$5 WordPress VPS Performance Showdown

Want to load test your own WordPress site? Sign up for Kernl now!

In the world of affordable WordPress hosting there is an array of different VPS providers to choose from. With so many choices how do you know who to choose? In addition to criteria such as ease of use and support, performance is a huge concern for most people deploying WordPress. In this article we’ll take a look at the performance of several different VPS providers in the $5 tier to see how they perform under load by using Kern’s WordPress Load Testing feature.

Who are we testing?

There are a lot of VPS providers out there providing machines in the $5 / month tier, so we’ve chosen 7 of the more popular providers to test against:

  • Digital Ocean (1GB RAM, 1vCPU)
  • Linode (Nanode 1GB, 1vCPU)
  • Vultr (1vCPU, 1GB RAM)
  • AWS Lightsail (1 GB RAM, 1 vCPU)
  • Hetzner (2vCPU, 4GB RAM)
  • Google Cloud (f1-micro: 1vCPU 600MB RAM)
  • Azure ($15 A0 1vCPU, 1GB RAM)

What tests will we run?

Using Kernl’s WordPress Load Testing feature we ran 2 different tests per provider:

  • No Cache – 200 concurrent users for 30 minutes. We used this test to see raw WordPress performance with no caching enabled.
  • Cached with W3 Total Cache + Memcached – 200 concurrent users for 30 minutes. We used this test to see what a more real-world scenario looks like. In general most people use some form of caching on their site.

VPS Setup

To make sure that our test setup was consistent across all VPS providers we followed this setup guide, where we ended up with the following versions of software:

With regards to regions, for every test we kept the VPS instances on the east coast of the United States with the exception of Hetzer where we had the VPS instance in Germany.

Results (Request & Failures)

First, let’s take a look at the request / second results across the different providers.

As you can see there is a wide spread of results depending on host. Honestly this wasn’t want I expected when I started. You’ll also notice that the Azure box cost $15/month. It was the closest I could get to finding a $5/month box in their interface (which I felt like I needed another degree in Computer Science to understand!).

So let’s visualize the data with no caching enabled.

We get lots of interesting results here. If you run a site where caching is difficult to do, your $5 will go much further depending on your host. Some notes:

  • Google Cloud and Azure performed TERRIBLE. I’m not sure why. Maybe it had to do with accessing the disk so frequently to load up PHP files? (but I expect that those were cached by PHP FPM or some other underlying process).
  • If you are in Europe, Hetzner is your friend. $5/month gets you 4x the ram and 2x the vCPUs are the next closest provider.
  • If you are in the US, AWS seems to be winning in this test but not by much. It feels like you would be fine going with Digital Ocean or Vultr.
  • Before making any real decisions on a host, I’d want to run 5 or so tests across different instances to make sure that there isn’t a lot of variance in my results. Noisy neighbors can often be a problem on VPS providers.

Now let’s take a look at a more realistic scenario where you have a caching plugin installed.

You’ll notice that there aren’t any error bars on this graph. Thats because each host was able to handle the load without having any. This isn’t too surprising since most of the requests would be served right out of memory via Memcached. Some notes on cached requests:

  • Once again, Google Cloud and Azure perform the worst out of any of these hosts. Given how highly regarded they are in the hosting ecosystem outside of WordPress I expected better performance.
  • All of the other providers posted impressive numbers (170 req/s – 180 req/s). At level I would probably choose whatever provider had the best support, user interface, and reliability.
  • I suspect that most of these boxes could handle a little bit more load before going under. If I increased this test by an order of magnitude (200 users to 2000 users) I think most of the providers would tap out before Hetzner does due to how much more RAM and CPU it has.

Results (Response Time Distribution)

While requests per second and failures per second are valuable metrics, in order to get a more holistic view of raw performance for these $5 VPS instances we need to look at response time distribution.

So how should you interpret this chart? For the percentage columns, each value is in milliseconds. If you look at the 99% column, you can see that 99% of Vultr requests returned in <= 3600ms. If you look at the 80% column for Vultr, you can see that 80% of Vultr requests returned in <= 3400ms. Let’s take a look at our un-cached results.

Some notes on this graph:

  • Any service that hit the 5000ms mark killed the request. I think that they likely would have gone far beyond that.
  • Once again, for $5 Hetzner is just crushing it. You simple can’t compete with 4GB RAM and 2vCPUs. Even with no caching their 99th percentile was under 2 seconds!
  • For our non-european readers, Digital Ocean, Vultr, and AWS seemed to perform the best. AWS remained remarkably consistent across the response time distribution range. This is a good thing.
  • Google Cloud… wtf? So 99% of your requests finished in <= 5 seconds, and then 90% of requests finished in <= 300ms? Something is fishy.

And now, let’s see how things change when caching is used.

As expected, most providers do very well when caching is used. Digital Ocean, Linode, Vultr, AWS, and Hetzner are all performing in the <= 200ms range (some lower!). It’s hard to decide who is better at this level due to latency due to geographic distance. The point is that you could choose any of those hosts and be OK when using a cache plugin. Once again, I’m struggling to figure out why someone would spend their $5 on Azure or Google Cloud.

Results (Total Requests)

Our final metric we need to look at before we can pass any judgement on our VPS providers is total requests. This in particular needs to be compared with response time distribution.


This graph does a great job explaining some of the discrepancies in the response time distribution data. When caching wasn’t enabled, Google Cloud and Azure barely even show up on the graph. More thoughts:

  • I’m 90% positive that the Google Cloud and Azure instances nearly stopped processing requests at some point. They were so overwhelmed that they just fell over. The data seems to support this.
  • Google Cloud and Azure are not the best place to spend $5. Even with caching I would be scared if there was ever a cache-miss.
  • Hetzner is the clear winner in un-cached data. Once again, this makes sense how much more machine you are getting for your money.
  • On the U.S. side of the ocean, AWS seems to win here, although not by much over Digital Ocean and Vultr. Once caching is taken into consideration they all perform roughly the same (accounting for latency between data centers and load generators).

Conclusions

If performance is all you care about for your $5 (and thats a big if), then choose Hetzner if you need a VPS in Europe. If you need a VPS in the US or elsewhere, choose AWS, Digital Ocean, or Vultr. Microsoft and Google are not great for $5.

Want to load test your own WordPress site? Sign up for Kernl now!

What’s New With Kernl – February 2019

February was a pretty busy month for Kernl! We had a lot of great tweaks to load testing, a few customer feature improvements, and some infrastructure work. Lets get started!

Features & Bugs

  • Multi-Region Load Tests – You can now select multiple regions for your WordPress load tests! Instead of having all traffic come from a single region you can have it evenly distributed across all the available regions. This is useful for testing if you have a global audience.
  • Load Testing Enters General Availability – Kernl’s WordPress load testing is now available for all customers.
  • Delete Load Tests – You are now able to delete your load tests.
  • License Max Version Bug – A customer brought to our attention that the “max_version” field behavior wasn’t quite right. This has been resolved.
  • Customer Card Expiration Cron Job Bug – We recently discovered that the cron job that checks to see if a customer has paid their invoice was broken. This was going on for about 5 months, so some of you may have received you Kernl subscription for free during that time period. 😉
  • Multiple License Domains – If you use our license management system and restrict via domains, you can now enter multiple domains on a per-license basis. This is useful if you want to use the same license for local, staging, and production.
  • License Management UI Updates – We’ve simplified the list view in license management by removing some columns that were cluttering the screen. We’ve also lined up the action buttons better and will now notify you in the plugin/theme detail pages if you have license management enabled but no licenses associated with your product.

Infrastructure

  • The Kernl Analytics server was re-sized to be smaller. It was way over allocated.
  • Load testing was moved to a Kernl sub-domain. Prior to this it had a top-level domain.
  • Load testing servers that don’t come up after 3 minutes are removed from the load testing pool.
  • Session handling (for OAuth) has been moved to cookies. Prior to this we stored sessions in Redis.
  • We have removed our dependency on the ‘Q’ promise package on the Node.js app servers.

Blog Posts

Thats it for this month!

Adventures in Scalable WordPress Hosting: Part 2

Interested in testing your WordPress scalability? Check out the Kernl WordPress Load Testing beta program!

In part 1 of this series I explored scaling WordPress using WP Super Cache and by throwing more expensive hardware at the problem. In part 2 of this series we’ll go on adventure in horizontal scalability using load balancers, NFS, Memcached, and an externally hosted MySQL.

The Plan

To horizontally scale any app is an exercise in breaking things apart as much as possible. In the case of WordPress there are a few shared components that I wanted to break up:

  • File System – The file system is the most problematic part of scaling WordPress. Unless you change how WordPress stores plugins, themes, media, and other things you need to have a shared file system that all nodes in your cluster can access. There are likely some other solutions here, but this one provides a lot of flexibility.
  • MySQL – In many WordPress installs MySQL lives on the same machine as WordPress. For a horizontally scaled cluster this doesn’t work so we need a MySQL that is external.
  • Memcached – It was brought to my attention that during part 1 of this series using WP Super Cache to generate static pages was sort of cheating. In the spirit of making this harder for myself I introduced W3 Total Cache instead and will be using an external Memcached instance as the shared cache.

Now that the basic why and what is out of the way lets talk about the how. I’m a huge fan of Digital Ocean. I use them for everything except file storage so I’m going to use them for this WordPress cluster as well. Here’s how its going down:

  1. Create a droplet that will act as the file system for our cluster. Using NFS all droplets in the cluster will be able to mount it and use it for WordPress. I’m also going to use this for Memcached since NFS doesn’t take up many resources.
  2. Create a base droplet that has Nginx and PHP7.2-FPM installed on it. There is a little bit of boilerplate configuration here, but in general the install is typical. The only change to the Nginx configuration where I set the root directory to be the NFS mount. Use this base droplet to configure WordPress database settings.
  3. Use Compose.io create a MySQL database. I wanted something that was configured well that I didn’t have to think about. Totally worth the $27 / month.
  4. Once the above are done take a snapshot of the base droplet and use it to create more droplets. If all goes well you shouldn’t need to do any configuration.
  5. Using Digital Ocean’s load balancer service add your droplets to the load balancer.
  6. Voila! Thats it.
Ugly architecture diagram

No Cache Smoke Test

200 users, 10 minutes, 2 users/sec ramp up, from London

As with every load test that I do, the first test is always just to shake out any bugs in the load test itself. For this test I didn’t have any caching enabled and only a single app server behind the load balancer. It was effectively the same as the first load test I did during part 1 of this blog series.

As you can see from the graph below, performance was what we would expect from the setup that I used. We settled in to 21 requests / second with no errors.

As Expected.

The response time distribution wasn’t very great. 90% of requests finished in under 5 seconds, but thats still a very long time. Generally if I saw this response time distribution I would think that its time to add caching or scale up/out.

Not bad. Not great.

So. Many. Failures.

2000 users, 120 minutes, 2 users/sec ramp up, from London

The next test I decided to run was the sustained heavy load test. This is generally where I start to see failures from managed WordPress hosting providers. Given that I didn’t add any more app servers to the load balancer and had no caching things went as poorly as you would expect.

All the failures of failure land.

Everything was fine up until ~25 req/s and then the wheels fell off. The response time distribution was bad too. No surprises here.

50% of requests in 5 seconds, 100% in…33 seconds 🙁

Looks like its time to scale.

Horizontal Scalability

2000 users, 120 minutes, 2 users/sec ramp up, from London

Before adding Memcached to the setup I wanted to see how it scaled without it. That means adding more hardware. For this test I added four more application servers (Nginx + PHP) to the load balancer and ran the test again.

Linear Growth

As you can see from the request/failure graph we experience roughly linear growth in our maximum requests/second. Given we originally maxed out at ~20 req/s on one machine, maxing out at ~100 req/s with five machines seems like exactly the sort of result that I would expect to see. The response time distribution also started to look better:

Not perfect, but better.

Obviously a 90% score of 4 seconds isn’t awesome, but it is a lot better than the previous test. I did make a tiny tweak to the load balancer configuration that may have helped though. I decided to use the ‘least connections’ options instead of ’round robin’. ‘Least connections’ tells the load balancer to send traffic to the app server with the least number of active connections. This should help with dog piling on a server with a few slower connections.

Given the results above we can safely assume linear growth tied to the number of app servers that we have for quite some time. Meaning for each app server that I add I can expect to handle an additional ~20 req/s. With that in mind, I wanted to see what would happen if I enabled some caching on this cluster.

Gotta Go Fast

In my previous test of vertical scaling I used WP Total Cache to make things go quick. WP Total Cache generates static HTML pages for your site and then serves those. The benefit being that static pages are extremely fast to serve. In this test I wanted to try a more dynamic approach using Memcached and W3 Total Cache. W3 Total Cache takes a very different approach to caching by storing pages, objects, and database queries in Memcached. In general this caching model is more flexible, but possibly a bit slower. I installed Memcached on the same server as the NFS mount because it was under utilized. In a real production scenario I wouldn’t violate this separation of concerns.

Once I enabled W3 Total Cache and re-ran the last test I got some pretty great results.

Boom.

With W3 Total Cache enabled and 5 app servers we settled in at ~370 requests/second. More impressive is that we only saw 5 failures during the entire test. For perspective Kernl pushed 1,329,470 requests at the WordPress cluster I created. Thats a failure rate of 0.0003%.

My favorite part of this test was the response time distribution. Without having to wait on MySQL for queries the response times became crazy good.

The “bad” outlier is only 2.5s.

99% of requests finished in 29ms. And the outlier at 100% was only 2.5 seconds. Not bad for WordPress.

Going Further

Being the good software developer that I am I wanted to push this setup to it’s limits. So I decided to try a test that is an order of magnitude more difficult:

20,000 users, 10 users/sec ramp up, for 60 minutes, from London

Things didn’t go great but not because of WordPress. I won’t show any graphs of this test but I started to get limited by the network card on the NFS/Memcached machine. Digital Ocean says that I can expect around 30MB/sec out of a given droplet and with this test I was starting to bump in to that limit. If I wanted to test it further I would have had to load balance Memcached which felt a little bit outside of scope. In a real production scenario I would likely pay for a hosted Memcached service to deal with this for me.

Conclusions

With Kernl I’m always weighing the build versus buy question when it comes to infrastructure and services. Given how much effort I had to put in to making this setup horizontally scalable and how much effort it would take to make it reproducible and manageable, it hardly seems worth creating and managing my own infrastructure.

Aside from my time the cost of the hardware was also not cheap.

  • Load Balancer – $10 / month
  • MySQL Database – $27 / month
  • Memcached (if separate from NFS) – $5 / month
  • NFS Mount (if separate from Memcached) – $5 / month
  • Application Servers – $25 / month ($5 / month * 5 servers)
  • Total – $72 / month

At $72 / month I could easily have any of the managed WordPress hosting companies (GoDaddy, SiteGroup, WPEngine, etc) run my setup, handle updates, security, etc. The only potential hiccup is the traffic limits they place on your account. This setup can handle millions of requests per day and while their setups can too, they’ll charge you a hefty fee for it.

As with any decision about hardware and scaling the choice varies from person to person and organization to organization. If you have a dedicated Ops team and existing hardware, maybe scaling on your own hardware makes sense. If you’re a WordPress freelancer and don’t want to worry about it, maybe it doesn’t. IMHO I wouldn’t scale WordPress on my own. I’d rather leave it to the professionals.

Interested in testing your WordPress scalability? Check out the Kernl WordPress Load Testing beta program!

Adventures in Scalable WordPress Hosting: Part 1

If you follow the Kernl Blog you’ll know that recently I’ve been writing about load testing different managed WordPress cloud providers. Half of the reason for doing this is to shake out any bugs in Kernl’s WordPress load testing platform and the other half is to learn whats out there in terms of managed WordPress hosting.

As I went through the first round of tests I kept thinking: “I wonder how they achieve that level of performance with WordPress?”. This blog post and the post that will follow it are a chronicle of my attempts to scale WordPress to the levels that these managed cloud providers are achieving in an economical fashion.

The Tests

Having done a handful of load tests against other cloud providers I figured that I should hold myself to the same tests. The scale I’m going to try and achieve is:

  1. 200 concurrent users for 10 minutes.
  2. 2000 concurrent users for 2 hours.
  3. 20000 concurrent users for 1 hour.

The first test is just to shake out bugs in the load test, but I have seen some providers start to throw errors at that level. The second test is testing for sustained load. And the third test is simulating a heavy traffic spike.

So. Basic.

To get things started I created a super basic WordPress install on a $5/month Digital Ocean droplet. The droplet specs:

  • 1 CPU
  • 1GB RAM
  • 1000GB data transfer
  • Ubuntu 18.10

I chose to use the LEMP stack instead of the LAMP stack mostly because I’m more familiar with tuning Nginx for performance. I followed the guide at https://www.digitalocean.com/community/tutorials/how-to-install-linux-nginx-mysql-php-lemp-stack-ubuntu-18-04 to get things running. The software specs:

  • PHP 7.2
  • Nginx 1.15.5
  • MySQL 5.7.24

The first test went really well. I didn’t performance tune anything and didn’t have any sort of cache enabled. After 10 minutes we had settled into 35 requests / second and didn’t see any failures at all.

So. Much. Blue.

For 90% of people this is probably more performance than they would ever need. The response time distribution was even awesome. 100% of requests finished in ~500ms.

Not bad 1 hour of work and $5

And Then The Wheels Fell Off

After my early success with the basic 200 user load test I thought it was time to throw some serious load at my WordPress install. This time I did the 2000 concurrent users for 2 hours test. At this point there still wasn’t any caching plugin installed.

Things did not go well

As you can see things didn’t go well. We peaked at around 40 requests/s but then our failure rate started to increase is a really bad way. You can also see that we sorta stopped fielding requests after awhile. Looking at the system load information, you can see why things went poorly. The $5 droplet just couldn’t handle anymore.

The poor $5 droplet was tapped out

As you would expect in this situation, the response time distribution was pretty dismal. In fact, this is the worst response time distribution that I’ve seen in all the load testing that I’ve performed 🙂

Thats right: 2% of requests took over 500s to return 🙁

After reaching the max capacity of the $5 droplet with no tuning, it was time to try and scale.

WP Super Cache Me

WP Super Cache is a caching plugin that generates static HTML files of your WordPress site. For read-heavy sites its tough to beat in terms of performance. The blog that I’m load testing with definitely falls into this category so it was the right choice for this test.

This test was simply a repeat of the last test (2000 users, 2 hours, etc) but with caching enabled. The results were pretty great.

135 req/s is respectable for $5/month

With WP Super Cache enabled on the $5 droplet we were able to field around 135 req/s, however you can see that our error rate was elevated during much of the test. If you expect to see this sort of traffic on a regular basis then this isn’t a great outcome but still pretty respectable for $5/month. The response time distribution tells a different story though:

33% of requests finished in > 10 seconds :/

Whats the point of serving 135 req/s if it takes more than 10s per request for 33% of your users? People are just going to close the tab after 1 second so we obviously have some more work to do.

Scale Me Up

When scaling any website you have 2 options (and they aren’t mutually exclusive):

  1. Scale up (vertically)
  2. Scale out (horizontally)

Scaling up is usually the easiest thing to do because you’re basically throwing more hardware at the problem. Digital Ocean makes scaling up really easy so I decided to give that a go first. This test was once again just a repeat of the 2000 users for 2 hours test but with better hardware. I upgraded from 1 CPU to 3 CPUs which seemed like the right choice given that it didn’t appear that memory was the problem in my previous tests.

3 CPUs -vs- 1 CPU

So how did it go? Real good actually. Once all the load test users were sending requests we settled in at 344 request / second. If that rate continued all day that comes out to 29 million requests. Not bad for $15/month.

So. Many. Requests.

We’re still seeing some failures, but relative to the number of requests it is much lower than the previous test. We can do better but that will likely take some more vertical or horizontal scaling. But what about the response times? Turns out adding more CPUs helped out quite a bit.

This is MUCH better.

100% of our requests finished in under 1.6s. While not SUPER fast it is still a respectable showing for the sort of load that this box was receiving. Even more impressive is that 90% of requests finished in under 100ms and some of that could be attributed to latency. The droplet was spun up in NYC3 and the load test generators were in Toronto, Canada.

Conclusions

The biggest selling point (for me) with WordPress is that it’s easy. With very little configuration or effort I was able to get a WordPress installation serving > 300 req/s. Sure it wasn’t perfect. I am still getting elevated error rates and vertical scaling can only take us so far. But this is likely good enough for almost anyone.

Part II

In part 2 of this series I’ll attempt to scale WordPress horizontally by using shared block storage to host the WordPress file system, a dedicated MySQL machine, and a bunch of application servers running behind a load balancer. The goal is serve 20,000 (or more!) concurrent users for 1 hour without any errors and response times below 1 second. Follow @kernl_ on Twitter to be notified when part 2 is published!

Load Testing the ChemiCloud Managed WordPress Hosting Service

At the beginning of December Kernl launched a closed beta for our WordPress Load Testing service. As part of the bug shakedown we’ve been spending some time load testing different managed WordPress hosting services. Some of previous tests include WordPress.com, CloudWays, and GoDaddy. For this test, we turned our sights on ChemiCloud.

How do we judge the platform?

Using Kernl’s load testing feature we run 3 different load tests against the target system.

  • The Baseline – This is a simple baseline load test that we use to verify that our configuration is correct and that the target can handle even minor traffic. It consists of 200 concurrent users, for 10 minutes, ramping up at 2 users / second, with traffic originating in San Francisco.
  • Sustained Traffic – The sustained traffic test mimics what traffic might look like for a read-heavy website with a lot of visitors. This load test consists of 2000 concurrent users, for 2 hours, ramping up at 2 users / second, with traffic originating from San Francisco.
  • Traffic Spike – This test is brutal. We use it mimic the sort of traffic that your WordPress site might experience if a link to it were shared by a Twitter or Instagram celebrity. The load test consists of 10,000 concurrent users, for 1 hour, ramping up at 10 users / second, with traffic originating from San Francisco.

All traffic for this test is generated out of Digital Ocean’s SFO2 data center.

What ChemiCloud plan was used?

ChemiCloud has several different tiers for managed WordPress hosting. We decided on the “Oxygen” plan. At a high level this seemed to align well with the hosting that we tested thus far.

ChemiCloud - Oxygen Plan
ChemiCloud – Oxygen Plan

Caveats

This load test is intentionally simple. It is read heavy. Many WordPress sites have this sort of traffic profile, but not all do. If you need to perform a WordPress load test with a different traffic profile Kernl supports this. Ideally we should also do multiple tests over time to make sure that this test wasn’t an outlier. Future load test articles will hopefully include this sort of rigor but for now this test can give you reasonable confidence in how you can expect ChemiCloud to perform under a read-heavy load.

The Baseline Test

200 concurrent users, 2 users / s ramp up, 10 minutes, SFO

As most of the hosting providers that we test do, ChemiCloud performed well on the baseline test. They settled in at right around 25 requests / second.

ChemiCloud - Requests
ChemiCloud – Requests

We did see a few failures towards the end of the test, but it appears that it was only a spike. Once the spike passed we didn’t see any more errors for the duration of the test.

ChemiCloud - Failures
ChemiCloud – Failures

The response time distribution for ChemiCloud was solid for this baseline test. 99% of requests finished in 550ms. If we go further down the distribution you can see that 95% of requests finished in ~250ms which is quite good. Even the 100% outlier still wasn’t that bad.

ChemiCloud - Response Time Distribution
ChemiCloud – Response Time Distribution

Sustained Traffic Test

2000 concurrent users, 2 users / s ramp up, 2 hours, SFO

For the sustained traffic test ChemiCloud did a great job serving requests while keeping response times down. As you can see from the graph below, the test settled in to right around 260 requests / second. The journey to that many users was smooth and there aren’t any surprises on the graph.

ChemiCloud - Requests
ChemiCloud – Requests

There were a few failures during the test period, but it appears that they were only a temporary blip. You can see that about half-way through the test we ran into ~32 failures. After that we didn’t see any more for awhile, and then we had one more before not seeing any again for the rest of the test. For some perspective, we performed 1,861,230 requests again ChemiCloud and only 33 failed. Thats a failure rate of 0.0017%! Nice work team ChemiCloud.

ChemiCloud - Failures
ChemiCloud – Failures

The response time distribution was pretty great for the sustained test as well. While there was an outlier at 100% (which is common), 99% of requests finished in under 400ms. Thats an effort worthy of praise with WordPress!

ChemiCloud - Response Time Distribution
ChemiCloud – Response Time Distribution

Traffic Spike Test

20000 concurrent users, 10 users / s ramp up, 1 hour, SFO

The traffic spike load test is brutal for any host. Nobody ever expects to see this kind of traffic out of nowhere so few are prepared for it. ChemiCloud handled the traffic rather well though. We eventually reached 1200 requests / second which is pretty impressive for a plan that costs $17.95 a month. There weren’t any surprises on the way up to that level of traffic, but as you’ll see we did start to see error rates increase.

ChemiCloud - Requests
ChemiCloud – Requests per Second

At about 15 minutes into the load test we started to see an uptick in failure rates. The rate of failure stayed consistent throughout the test after that. This is a fairly common pattern when hosts become overloaded with traffic. In general ChemiCloud performed well even with these failures. We sent 4,332,244 requests to ChemiCloud over an hour period and 134,893 failed. For this sort of load test a failure rate of 3.1% isn’t bad.

ChemiCloud - Failures
ChemiCloud – Failures

The most interesting graph from this load test was the response time distribution. You would expect to see a general degradation of response time performance as request failures increased but that wasn’t the case at all. Everything below the 99th percentile performed remarkably well considering the traffic we threw at it. 98% of requests finished in under 370ms. Great work!

ChemiCloud - Response Time Distribution
ChemiCloud – Response Time Distribution

Conclusions

ChemiCloud competes well with the other hosts that we’ve tested. They have a solid price-point and you get a lot of control over your WordPress environment. If you need a host that can handle some solid traffic spikes they are a good choice.

Want to be part of the Kernl WordPress Load Testing Beta? Sign up and then send an email to jack@kernl.us

Load Testing the CloudWays Managed WordPress Service

At the beginning of December Kernl launched the closed beta of it’s WordPress load testing service. As a test to shake out any bugs we’ve decided to run a blog series load testing managed WordPress services. Today we’re going to talk about the CloudWays managed WordPress service. In particular, CloudWays deployed to Vultr.

How is the platform judged?

Cloudways will be tested using 3 different load tests:

  • The Baseline – This is a 200 concurrent user, 10 minutes, 2 user/s ramp up test from San Francisco. This test is used to verify the test configuration and to make sure that Cloudways doesn’t go belly-up before we get started 🙂
  • The Sustained Traffic Test – This test is for 2000 concurrent users, ramps up at 2 users/s, from San Francisco, for 2 hours. The sustained traffic test represents a realistic load for a high traffic site.
  • The Traffic Spike Test – This test is intentionally brutal. It simulates 20000 concurrent users, ramps up at 10 users/s, from San Francisco, for 1 hour. It represents the sort of traffic pattern you might see if a Twitter celebrity shared a link to your blog.

What CloudWays plan was used?

For this test we used the lowest tier plan available while hosting on Vultr. The cost of the plan is $11 / month and includes full SSH access to the box that CloudWays deploys your WordPress instance on.

CloudWays $11 / month plan hosted on Vultr
Selected CloudWays Plan

Where does the traffic originate?

The traffic for this load test originates in Digital Ocean‘s SFO2 (San Francisco) data center. The Vultr server lives in their Seattle data center.

The baseline load test

200 concurrent users, 2 users / s ramp up, 10 minutes, SFO

The baseline WordPress load test that we did with CloudWays is used to test configuration. CloudWays performed well on this test. You can see from the request graph that we settled in at around 25 requests / second.

CloudWays BaseLine Load Test Requests
CloudWays Baseline Test – Requests per second

The failure graph for the baseline load test was empty, which is generally expected for the baseline test.

CloudWays BaseLine Load Test Failures
CloudWays Baseline Test – Failures

Finally the request distribution graph for the baseline test. You can see that 99% of the requests finished in ~200ms. There was at least one outlier at the ~5000ms mark, but this isn’t uncommon for load tests.

CloudWays BaseLine Load Test Response Time Distribution
CloudWays Baseline – Response Time Distribution

The sustained heavy traffic load test

2000 concurrent users, 2 users / s ramp up, 2 hours, SFO

The sustained traffic load test represents what a WordPress site with high readership might look like day over day.  The CloudWays setup responded quite well for the hardware that it was on.

CloudWays Sustained Heavy Traffic Load Test - Requests
CloudWays Sustained Load Test – Requests

You can see that performance was great for the first 10% of the test. The CloudWays setup had no trouble handling the load thrown at it. However once we started getting to around 85 requests / second the hardware had trouble keeping up with the request volume. You can see from the choppy behavior of request graph that the Varnish server which sits in front of WordPress was starting to get overwhelmed by the request volume. Considering that this particular CloudWays plan was deployed to a low-level Vultr VM, this performance isn’t bad at all.

The failure graph was a little disappointing, but not unexpected knowing the hardware that we tested on. It is very likely that if we tested on a more robust underlying Vultr box we would have had much better results. You can see that failures increased in a fairly linear rate through the whole load test.

CloudWays Sustained Heavy Traffic Load Test - Failures
CloudWays Sustained Load Test – Failures

The final graph for this test is the response distribution graph. This graph shows you for a given percentage of requests how many milliseconds they took to complete. In this case CloudWays didn’t perform great, but once again I’ll point to the fact that the underlying Vultr hardware isn’t that robust.

CloudWays Sustained Heavy Traffic Load Test - Response Time Distribution
CloudWays Sustained Load Test – Response time distribution

From the graph you can see that 99% of requests completed in ~95 seconds. Yes, you read that correctly. You can interpret this graph as you like but taking the other graphs into consideration you can see that Varnish and the underlying Vultr hardware were completely overwhelmed. Knowing that makes this a little less terrible. We suspect that a smaller load test (maybe 750 concurrent users?) might yield a far better response time distribution. Once a server becomes overwhelmed the response time distribution tends to go in a bad direction.

The traffic spike load test

20000 concurrent users, 10 users / s ramp up, 1 hour, SFO

Given what we know about the sustained traffic load test your expectations for how this test went are probably spot on. CloudWays did as good as can be expected with how the underlying hardware is allocated, but you would likely need to upgrade to a much larger plan to handle this level of traffic. We ended up stopping this load test after about 30 minutes due to the increased failure rate.

CloudWays Traffic Spike Load Test - Requests
CloudWays Traffic Spike Load Test – Requests per Second

The requests per second never really leveled out. It isn’t clear what the underlying reason was for the uneven level at the top of the graph. Regardless, top-end performance was similar to the sustained traffic test.

The failure chart looks as we expected it to. After a certain point we start to see increased failure rates. They continue up and to the right in a mostly linear fashion.

CloudWays Traffic Spike Load Test - Failures
CloudWays Traffic Spike Load Test – Requests per Second

The response time distribution is really bad for this test.

CloudWays Traffic Spike Load Test - Response Time Distribution
CloudWays Traffic Spike Load Test – Response Time Distribution

As you can see 80% of the requests finished in < 50s which means that 20% of the requests took longer than that. The 99% mark was only reached after > 200s, at which point the user is likely long gone.

Conclusions

For $11 / month the CloudWays managed WordPress installation did a great job, but there are better performers out there in the same price range (GoDaddy for instance). For the sake of this review which only looks at raw performance, CloudWays probably isn’t the best choice. But if you’re looking for good-enough performance with extreme flexibility then you would be hard pressed to find a better provider.

Want to run load tests against your own WordPress sites? Sign up for Kernl now!

What’s New With Kernl – November 2018

It has been a busy few months for Kernl. Lots of great work has gone into the WordPress load testing feature work as well as a few structural changes to increase reliability.

  • Cache moved to Redis – For as long as Kernl has existed our cache backend was powered by Memcached. We have now finished migrating to Redis hosted at Compose.io.
  • AngularJS Upgrade to 1.7.5 – Fairly straight-forward upgrade to Angular 1.7.5. We wanted to take advantage of performance improvements and few bug fixes.
  • WordPress Load Testing – Over the past few months we’ve been cooking up something new. Imagine if you could easily test performance changes to you or your client’s WordPress installation? Or be able to tell your client with confidence how many customers at a time their site can support (and what their experience will be like!). What if you could do all this without writing a single line of code or spinning up your own testing infrastructure? We’re ready to start beta testing so send an email to jack@kernl.us if you would like to be a part of it.

What’s New With Kernl – September 2018

It was great month for Kernl! We didn’t do much in the way of user-facing features, but we did accomplish a lot of great infrastructure work.

Features, Bugs, & Infrastructure

  • Analytics Domain Search Speed Improvements – Prior to this work is could take up to 5 seconds to filter through the list of domains in Kernl Analytics. You can now search with sub-second response times due to a well placed index in Postgres.
  • Daily Aggregates Cleanup – This task encompassed cleaning up the millions of rows of aggregate data that Kernl was hanging on to. We weren’t in jeopardy of running out of space, but a table of 300K rows is faster to query than a table of 20M rows.
  • Analytics Domain List Clickable URLs – A customer suggested that the urls in the domain list should be clickable, so now they are!
  • Marketing Site Scaled Images – Kernl now serves properly scaled images for the marketing site.
  • GZIP Compression – All /static routes now serve resources using GZIP compression.
  • MongoDB Upgrade – We’re now on Mongo 3.4 and the database is hosted in AWS via Compose.io
  • Node.js Upgrade – Kernl now runs on Node.js 8.11.4. This upgrade addresses some security issues and bug fixes.

August 14 2018 Outage Post-Mortem

Below is a post-mortem analysis of Kernl’s August 14 2018 outage. It details what happened, why it happened, and how we can improve things so that this doesn’t happen again in the future.

What?

At 9:29PM EDT alerts were triggered saying that Kernl was down. Initial investigation showed that the marketing site was still up and that some (but not all) update requests were still going through. Upon further investigation, it was found that Kernl’s connection to MongoDB had stopped working.

In most situations the DB connection dropping would have caused only a momentary blip while we failed over to our secondary, but in this case that wasn’t possible.

Why?

Further investigation revealed that our Mongo provider (Compose.io) was experiencing an outage in some of their Digital Ocean environments. Unfortunately this outage effected not only our primary Mongo host but also our backup secondary host. Due to the nature of the outage automatic failover for Mongo wasn’t a possibility.

Actions Taken

  • Determined the source of the downtime.
  • Contacted Compose.io support to resolve the issue.
  • After not hearing back from support for 30 minutes, work was started on an alternate plan for bringing Kernl back up.
  • After 45 minutes of no response from Compose.io support the alternate plan was enacted.
    • The alternate plan was to restore Kernl’s Mongo cluster into a different data center using the daily backup. The only downside was that that backup was 6 hours old, which means there is a possibility that customers will need to re-build or re-upload some plugin and theme versions.
  • Cache lifetime for all Kernl endpoints was doubled. This was done because the new data center was outside of Digital Ocean NYC3. The increased cache lifetime helps combat the increased latency.
  • Kernl’s downtime ended at roughly 11:00PM. The Compose.io incident wasn’t resolved until several hours after this so we feel that the decision to restore from a backup was the right one.

What’s Next?

Compose.io has been our Mongo provider for years now and we’ve never experienced any significant downtime. That being said, they don’t actually support DigitalOcean anymore and plan to kill their support for it at sometime in the future.

Our next steps are to evaluate how Kernl performs with the database in another data center. If things look good, we will likely move Kernl’s Mongo instances to the new data center permanently where they can be better supported by Compose. It is our suspicion that if we had been in one of their more popular data centers that we would have received help faster.

Once again, apologies for the downtime and we’ll continue to work hard so that it doesn’t happen again!

Building & Scaling Kernl Analytics

Over the past 3 years I’ve often received requests from new and existing Kernl customers for some form of analytics on their plugin/theme. I avoided doing this for a long time because I wasn’t sure that I could do so economically at the scale Kernl operates at, but I eventually decided to give Kernl Analytics a whirl and see where things ended up.

kernl analytics product versions
Product Versions Graph

Concerns

After deciding to give the analytics offering a try, I had to figure how to build it. When I first set out to build Kernl Analytics I had 3 main concerns:

  • Cost – I’ve never created a web service from scratch that needs to INSERT data at 75 rows per second with peaks of up to 500 rows per second. I wanted to be sure that running this service wouldn’t be prohibitively expensive.
  • Scale – How much would I need to distribute the load? This is tightly coupled to cost.
  • Speed – This project is going to generate a LOT of data by my standards. Can I query it in performant manner?

As development progressed I realized that cost and scale were non-issues. The database that I chose to use (PostgreSQL) can easily withstand this sort of traffic with no tweaking, and I was able to get things started on a $5 Digital Ocean droplet.

Kernl Analytics Architecture & Technology

Kernl Analytics was created to be it’s own micro-service with no public access to the world. All access to it is behind a firewall so that only Kernl’s Node.js servers can send requests to it. For data storage, PostgreSQL was chosen for a few reasons:

  1. Open Source
  2. The data is highly relational
  3. Performance

The application that captures the data, queries it, and runs periodic tasks is a Node.js application written in TypeScript. I chose TypeScript mostly because I’m familiar with it and wanted type safety so I wouldn’t need to write as many tests.

kernl analytics and typescript
TypeScript FTW!

With regards to size of the instance that Kernl Analytics is running on, I currently pay $15/month for a 3 core Digital Ocean droplet. I upgraded to 3 cores so that Postgres could easily handle both writes and multiple read requests at the same time. So far this setup has worked out well!

Pain Points

Overall things went well while implementing Kernl Analytics. In fact they went far better than expected. But that doesn’t mean there weren’t a few pain points along the way.

  • Write Volume – Kernl’s scale is just large enough to cause some scaling and performance pains when creating an analytics service. Kernl averages 25 req/s which translates to roughly 75 INSERTs into Postgres. Kernl also has peaks of 150 req/s which scales up to about 450 INSERTs into Postgres. Postgres can easily handle this sort of load, but doing it on a $5 digital ocean droplet was taxing to say the least.
  • Hardware Upgrade – I tried to keep costs down as much as possible with Kernl Analytics, but in the end I had to increase the size of the droplet I was using to a $15 / 3-core droplet. I ended up doing that so one or two cores could be dedicated to writes while leaving a single core available for read requests. Postgres determines what actions are executed where, but adding more cores had led to a lot less resource contention.
  • Aggregation – Initially the data wasn’t aggregated at all. This caused some pain because even with some indexing, plucking data out of a table with > 2.5 million rows can be sort of slow. It also didn’t help that I was writing data constantly to the table, which further slowed things down. Recently I solved this by doing daily aggregations for Kernl Analytics charts and domain data. This has improved speed significantly.
  • Backups & High Availability – To keep costs down the analytics service is not highly available. This is definitely one of those “take out some tech debt” items that will need to be addressed at a later date. Backups also happen only on a daily basis, so its possible to lose a day of data if something serious goes wrong.
kernl analytics server load
Yay for affordable hosting

Future Plans

Kernl Analytics is a work in progress and there is always room to improve. Future plans for the architecture side of analytics are

  • Optimize Indexes – I feel that more speed can be coaxed out of Postgres with some better indexing strategies.
  • Writes -vs- Reads – Once I gain a highly available setup for Postgres I plan to split responsibilities for writing and reading. Writes will go to the primary and reads will go to the secondary.
  • API – Right now the analytics API is completely private and firewalled off. Eventually I’d like to expose it to customers so that they can use it to do neat things.