Migrating Kernl Analytics from Batch-Driven to Realtime

When we first started capturing realtime analytics data for Kernl it was a simple affair:

  • 1 table
  • N columns (id, date, product_uuid, domain, product_version, php_version,….)

This worked well for the first few weeks, but eventually we started to see serious performance degradation as the table size grew. The first step we took was to normalize the database. So our table full of analytics data that looked like this:

Not Normalized

slowly but surely turned into something like this:


The normalized version of the database allowed for smaller indexes, faster lookups, and data de-duplication. This scaled well even for realtime lookups. However even that started to become problematic as the table grew to > 100M rows.

The next step in our performance journey was the aggregate our data in a nightly batch job. In general, we care about our data in aggregate (“how many people have version 8.1 of PHP installed?”), so aggregating was valid choice for us. So what did we do?

  1. Kept inserting data into the normalized “meta” table.
  2. Added another table for “daily aggregates” that the API would query from.
  3. Calculate aggregates once per day.
abc1232021-12-01{ “phpVersions”: { “5.4.1”: 12, “8.1”: 2 }}
Aggregate data table

One might think “Why do this in aggregate batches? Can’t you just update the data on the fly?”. Great question! The reason we avoided this at first was race conditions. At Kernl’s scale it is extremely likely that 2 requests would attempt to update the aggregate row at the same time.

PostgreSQL load after realtime lookup caching
PostgreSQL lookup caching results. You can see system load go down as the cache starts to populate.

A sane person would look at the problem and use transactions. We took the batch approach because we could easily serialize the process and not need to worry about transactions or race conditions at all. Was it a great choice? Not particularly, but it worked. So long as the batch job finished in 24 hours, all was well. As they say:

If it’s stupid and it works, it isn’t stupid”.

– Smart Software Engineer, probably

The final iteration in our years long journey is “real-time” data processing. The future of Kernl Analytics requires this, so we finally had to dig in and get it done.

Depending on what resources you are locking, transactions aren’t cheap. To successfully update our aggregate views we needed to start a transaction and then getOrCreate() several rows across multiple tables with data in the 100M row range. Normally this isn’t an issue in a low volume setting, but for us we were starting to get a lot of resource contention on these table locks.

We had a couple of options to help alleviate the pain here:

  1. Increase server capacity (CPU, memory) – This would allow transactions to complete faster therefore reducing resource contention. Unfortunately this costs quite a bit of money and we’re trying to keep costs down.
  2. Audit indexes – Make sure that tables are appropriately indexed. We did this and it made quite a large difference, but our processing queue was still growing (albeit slower than before).
  3. Cache lookups where possible – This bit was the final piece that let us easily keep up with the volume in real-time. For tables with data that is unlikely to change, we started aggressively caching lookups and results in Redis. As a result of this, > 90% of the time our transactions were only locking the aggregate row (versus that row AND all the other tables/rows we needed) and thus because quite fast.

All that being said, you no longer have to wait a day (or two!) to see your analytics data in Kernl. It will show up as soon as we process it which will be within a few seconds of a customer sending it.