Postmortem on the March 31, 2017 SSL certificate expiration on cfapps.io
- April 5, 2017
We’d like to provide some additional information about the service disruption that occurred on Pivotal Web Services (PWS) at 12:00 GMT Friday, March 31st, 2017. At that time, the SSL certificate for the default PWS cfapps.io domain expired causing client connections initiating a secure connection to fail. The failure was detected when our Global Support Services (GSS) received multiple reports and escalated the matter to our engineering and cloud operations team at 12:28 GMT. Since there was a clear corrective action, the operations team obtained a new updated certificate and installed the new certificate. The team confirmed the correct operation of the new certificate at 13:40 GMT. Secure connections failed on the cfapps.io domain for a total of 100 minutes.
How did a basic maintenance operation fail to occur and why did our checks fail? To understand this, let us start with how the process is currently defined under normal operation. The operations team relies on a monitoring application that runs as part of a scheduled CI build to check on the expiration date of domains under operations control. When a certificate is set to expire within 30 days, the monitoring application creates an issue in the operations team workflow management system and visually represents the build as failed. The operations team then picks up the issue from the workflow system and configures a new certificate.
Multiple failures led to our service disruption in this case. The first failure occurred when no issue was placed in the workflow system. Further investigation revealed that the monitoring application was not creating issues properly and was not successfully monitoring the tracked certificates due to it failing on an ill-formed certificate that was “hanging” the process from monitoring any other certificates behind it in the queue. A second failure occurred as this failing CI build went unnoticed by the operations team.
We have made several changes to our process as a result of this service disruption to avoid any similar future occurrences. First, we have switched to an automated service for renewal of certificates with our service provider. Secondly, our monitoring application solution was enhanced to properly handle ill-formed certificates. Thirdly, we have taken steps to ensure that failed builds are noisy and noticed by the Operations team.
As a Pivotal Web Services customer, you entrust Pivotal with the managing of the infrastructure for your applications and we hold ourselves to a high standard with those responsibilities in mind. We sincerely apologize for this occurrence and will do everything we can to keep your applications running smoothly in the future.