What’s New With Kernl – September 2018

It was great month for Kernl! We didn’t do much in the way of user-facing features, but we did accomplish a lot of great infrastructure work.

Features, Bugs, & Infrastructure

  • Analytics Domain Search Speed Improvements – Prior to this work is could take up to 5 seconds to filter through the list of domains in Kernl Analytics. You can now search with sub-second response times due to a well placed index in Postgres.
  • Daily Aggregates Cleanup – This task encompassed cleaning up the millions of rows of aggregate data that Kernl was hanging on to. We weren’t in jeopardy of running out of space, but a table of 300K rows is faster to query than a table of 20M rows.
  • Analytics Domain List Clickable URLs – A customer suggested that the urls in the domain list should be clickable, so now they are!
  • Marketing Site Scaled Images – Kernl now serves properly scaled images for the marketing site.
  • GZIP Compression – All /static routes now serve resources using GZIP compression.
  • MongoDB Upgrade – We’re now on Mongo 3.4 and the database is hosted in AWS via Compose.io
  • Node.js Upgrade – Kernl now runs on Node.js 8.11.4. This upgrade addresses some security issues and bug fixes.

August 14 2018 Outage Post-Mortem

Below is a post-mortem analysis of Kernl’s August 14 2018 outage. It details what happened, why it happened, and how we can improve things so that this doesn’t happen again in the future.

What?

At 9:29PM EDT alerts were triggered saying that Kernl was down. Initial investigation showed that the marketing site was still up and that some (but not all) update requests were still going through. Upon further investigation, it was found that Kernl’s connection to MongoDB had stopped working.

In most situations the DB connection dropping would have caused only a momentary blip while we failed over to our secondary, but in this case that wasn’t possible.

Why?

Further investigation revealed that our Mongo provider (Compose.io) was experiencing an outage in some of their Digital Ocean environments. Unfortunately this outage effected not only our primary Mongo host but also our backup secondary host. Due to the nature of the outage automatic failover for Mongo wasn’t a possibility.

Actions Taken

  • Determined the source of the downtime.
  • Contacted Compose.io support to resolve the issue.
  • After not hearing back from support for 30 minutes, work was started on an alternate plan for bringing Kernl back up.
  • After 45 minutes of no response from Compose.io support the alternate plan was enacted.
    • The alternate plan was to restore Kernl’s Mongo cluster into a different data center using the daily backup. The only downside was that that backup was 6 hours old, which means there is a possibility that customers will need to re-build or re-upload some plugin and theme versions.
  • Cache lifetime for all Kernl endpoints was doubled. This was done because the new data center was outside of Digital Ocean NYC3. The increased cache lifetime helps combat the increased latency.
  • Kernl’s downtime ended at roughly 11:00PM. The Compose.io incident wasn’t resolved until several hours after this so we feel that the decision to restore from a backup was the right one.

What’s Next?

Compose.io has been our Mongo provider for years now and we’ve never experienced any significant downtime. That being said, they don’t actually support DigitalOcean anymore and plan to kill their support for it at sometime in the future.

Our next steps are to evaluate how Kernl performs with the database in another data center. If things look good, we will likely move Kernl’s Mongo instances to the new data center permanently where they can be better supported by Compose. It is our suspicion that if we had been in one of their more popular data centers that we would have received help faster.

Once again, apologies for the downtime and we’ll continue to work hard so that it doesn’t happen again!

Building & Scaling Kernl Analytics

Over the past 3 years I’ve often received requests from new and existing Kernl customers for some form of analytics on their plugin/theme. I avoided doing this for a long time because I wasn’t sure that I could do so economically at the scale Kernl operates at, but I eventually decided to give Kernl Analytics a whirl and see where things ended up.

kernl analytics product versions
Product Versions Graph

Concerns

After deciding to give the analytics offering a try, I had to figure how to build it. When I first set out to build Kernl Analytics I had 3 main concerns:

  • Cost – I’ve never created a web service from scratch that needs to INSERT data at 75 rows per second with peaks of up to 500 rows per second. I wanted to be sure that running this service wouldn’t be prohibitively expensive.
  • Scale – How much would I need to distribute the load? This is tightly coupled to cost.
  • Speed – This project is going to generate a LOT of data by my standards. Can I query it in performant manner?

As development progressed I realized that cost and scale were non-issues. The database that I chose to use (PostgreSQL) can easily withstand this sort of traffic with no tweaking, and I was able to get things started on a $5 Digital Ocean droplet.

Kernl Analytics Architecture & Technology

Kernl Analytics was created to be it’s own micro-service with no public access to the world. All access to it is behind a firewall so that only Kernl’s Node.js servers can send requests to it. For data storage, PostgreSQL was chosen for a few reasons:

  1. Open Source
  2. The data is highly relational
  3. Performance

The application that captures the data, queries it, and runs periodic tasks is a Node.js application written in TypeScript. I chose TypeScript mostly because I’m familiar with it and wanted type safety so I wouldn’t need to write as many tests.

kernl analytics and typescript
TypeScript FTW!

With regards to size of the instance that Kernl Analytics is running on, I currently pay $15/month for a 3 core Digital Ocean droplet. I upgraded to 3 cores so that Postgres could easily handle both writes and multiple read requests at the same time. So far this setup has worked out well!

Pain Points

Overall things went well while implementing Kernl Analytics. In fact they went far better than expected. But that doesn’t mean there weren’t a few pain points along the way.

  • Write Volume – Kernl’s scale is just large enough to cause some scaling and performance pains when creating an analytics service. Kernl averages 25 req/s which translates to roughly 75 INSERTs into Postgres. Kernl also has peaks of 150 req/s which scales up to about 450 INSERTs into Postgres. Postgres can easily handle this sort of load, but doing it on a $5 digital ocean droplet was taxing to say the least.
  • Hardware Upgrade – I tried to keep costs down as much as possible with Kernl Analytics, but in the end I had to increase the size of the droplet I was using to a $15 / 3-core droplet. I ended up doing that so one or two cores could be dedicated to writes while leaving a single core available for read requests. Postgres determines what actions are executed where, but adding more cores had led to a lot less resource contention.
  • Aggregation – Initially the data wasn’t aggregated at all. This caused some pain because even with some indexing, plucking data out of a table with > 2.5 million rows can be sort of slow. It also didn’t help that I was writing data constantly to the table, which further slowed things down. Recently I solved this by doing daily aggregations for Kernl Analytics charts and domain data. This has improved speed significantly.
  • Backups & High Availability – To keep costs down the analytics service is not highly available. This is definitely one of those “take out some tech debt” items that will need to be addressed at a later date. Backups also happen only on a daily basis, so its possible to lose a day of data if something serious goes wrong.
kernl analytics server load
Yay for affordable hosting

Future Plans

Kernl Analytics is a work in progress and there is always room to improve. Future plans for the architecture side of analytics are

  • Optimize Indexes – I feel that more speed can be coaxed out of Postgres with some better indexing strategies.
  • Writes -vs- Reads – Once I gain a highly available setup for Postgres I plan to split responsibilities for writing and reading. Writes will go to the primary and reads will go to the secondary.
  • API – Right now the analytics API is completely private and firewalled off. Eventually I’d like to expose it to customers so that they can use it to do neat things.