Below is a post-mortem analysis of Kernl’s August 14 2018 outage. It details what happened, why it happened, and how we can improve things so that this doesn’t happen again in the future.
What?
At 9:29PM EDT alerts were triggered saying that Kernl was down. Initial investigation showed that the marketing site was still up and that some (but not all) update requests were still going through. Upon further investigation, it was found that Kernl’s connection to MongoDB had stopped working.
In most situations the DB connection dropping would have caused only a momentary blip while we failed over to our secondary, but in this case that wasn’t possible.
Why?
Further investigation revealed that our Mongo provider (Compose.io) was experiencing an outage in some of their Digital Ocean environments. Unfortunately this outage effected not only our primary Mongo host but also our backup secondary host. Due to the nature of the outage automatic failover for Mongo wasn’t a possibility.
Actions Taken
- Determined the source of the downtime.
- Contacted Compose.io support to resolve the issue.
- After not hearing back from support for 30 minutes, work was started on an alternate plan for bringing Kernl back up.
- After 45 minutes of no response from Compose.io support the alternate plan was enacted.
- The alternate plan was to restore Kernl’s Mongo cluster into a different data center using the daily backup. The only downside was that that backup was 6 hours old, which means there is a possibility that customers will need to re-build or re-upload some plugin and theme versions.
- Cache lifetime for all Kernl endpoints was doubled. This was done because the new data center was outside of Digital Ocean NYC3. The increased cache lifetime helps combat the increased latency.
- Kernl’s downtime ended at roughly 11:00PM. The Compose.io incident wasn’t resolved until several hours after this so we feel that the decision to restore from a backup was the right one.
What’s Next?
Compose.io has been our Mongo provider for years now and we’ve never experienced any significant downtime. That being said, they don’t actually support DigitalOcean anymore and plan to kill their support for it at sometime in the future.
Our next steps are to evaluate how Kernl performs with the database in another data center. If things look good, we will likely move Kernl’s Mongo instances to the new data center permanently where they can be better supported by Compose. It is our suspicion that if we had been in one of their more popular data centers that we would have received help faster.
Once again, apologies for the downtime and we’ll continue to work hard so that it doesn’t happen again!