Load Testing the CloudWays Managed WordPress Service

At the beginning of December Kernl launched the closed beta of it’s WordPress load testing service. As a test to shake out any bugs we’ve decided to run a blog series load testing managed WordPress services. Today we’re going to talk about the CloudWays managed WordPress service. In particular, CloudWays deployed to Vultr.

How is the platform judged?

Cloudways will be tested using 3 different load tests:

  • The Baseline – This is a 200 concurrent user, 10 minutes, 2 user/s ramp up test from San Francisco. This test is used to verify the test configuration and to make sure that Cloudways doesn’t go belly-up before we get started 🙂
  • The Sustained Traffic Test – This test is for 2000 concurrent users, ramps up at 2 users/s, from San Francisco, for 2 hours. The sustained traffic test represents a realistic load for a high traffic site.
  • The Traffic Spike Test – This test is intentionally brutal. It simulates 20000 concurrent users, ramps up at 10 users/s, from San Francisco, for 1 hour. It represents the sort of traffic pattern you might see if a Twitter celebrity shared a link to your blog.

What CloudWays plan was used?

For this test we used the lowest tier plan available while hosting on Vultr. The cost of the plan is $11 / month and includes full SSH access to the box that CloudWays deploys your WordPress instance on.

CloudWays $11 / month plan hosted on Vultr
Selected CloudWays Plan

Where does the traffic originate?

The traffic for this load test originates in Digital Ocean‘s SFO2 (San Francisco) data center. The Vultr server lives in their Seattle data center.

The baseline load test

200 concurrent users, 2 users / s ramp up, 10 minutes, SFO

The baseline WordPress load test that we did with CloudWays is used to test configuration. CloudWays performed well on this test. You can see from the request graph that we settled in at around 25 requests / second.

CloudWays BaseLine Load Test Requests
CloudWays Baseline Test – Requests per second

The failure graph for the baseline load test was empty, which is generally expected for the baseline test.

CloudWays BaseLine Load Test Failures
CloudWays Baseline Test – Failures

Finally the request distribution graph for the baseline test. You can see that 99% of the requests finished in ~200ms. There was at least one outlier at the ~5000ms mark, but this isn’t uncommon for load tests.

CloudWays BaseLine Load Test Response Time Distribution
CloudWays Baseline – Response Time Distribution

The sustained heavy traffic load test

2000 concurrent users, 2 users / s ramp up, 2 hours, SFO

The sustained traffic load test represents what a WordPress site with high readership might look like day over day.  The CloudWays setup responded quite well for the hardware that it was on.

CloudWays Sustained Heavy Traffic Load Test - Requests
CloudWays Sustained Load Test – Requests

You can see that performance was great for the first 10% of the test. The CloudWays setup had no trouble handling the load thrown at it. However once we started getting to around 85 requests / second the hardware had trouble keeping up with the request volume. You can see from the choppy behavior of request graph that the Varnish server which sits in front of WordPress was starting to get overwhelmed by the request volume. Considering that this particular CloudWays plan was deployed to a low-level Vultr VM, this performance isn’t bad at all.

The failure graph was a little disappointing, but not unexpected knowing the hardware that we tested on. It is very likely that if we tested on a more robust underlying Vultr box we would have had much better results. You can see that failures increased in a fairly linear rate through the whole load test.

CloudWays Sustained Heavy Traffic Load Test - Failures
CloudWays Sustained Load Test – Failures

The final graph for this test is the response distribution graph. This graph shows you for a given percentage of requests how many milliseconds they took to complete. In this case CloudWays didn’t perform great, but once again I’ll point to the fact that the underlying Vultr hardware isn’t that robust.

CloudWays Sustained Heavy Traffic Load Test - Response Time Distribution
CloudWays Sustained Load Test – Response time distribution

From the graph you can see that 99% of requests completed in ~95 seconds. Yes, you read that correctly. You can interpret this graph as you like but taking the other graphs into consideration you can see that Varnish and the underlying Vultr hardware were completely overwhelmed. Knowing that makes this a little less terrible. We suspect that a smaller load test (maybe 750 concurrent users?) might yield a far better response time distribution. Once a server becomes overwhelmed the response time distribution tends to go in a bad direction.

The traffic spike load test

20000 concurrent users, 10 users / s ramp up, 1 hour, SFO

Given what we know about the sustained traffic load test your expectations for how this test went are probably spot on. CloudWays did as good as can be expected with how the underlying hardware is allocated, but you would likely need to upgrade to a much larger plan to handle this level of traffic. We ended up stopping this load test after about 30 minutes due to the increased failure rate.

CloudWays Traffic Spike Load Test - Requests
CloudWays Traffic Spike Load Test – Requests per Second

The requests per second never really leveled out. It isn’t clear what the underlying reason was for the uneven level at the top of the graph. Regardless, top-end performance was similar to the sustained traffic test.

The failure chart looks as we expected it to. After a certain point we start to see increased failure rates. They continue up and to the right in a mostly linear fashion.

CloudWays Traffic Spike Load Test - Failures
CloudWays Traffic Spike Load Test – Requests per Second

The response time distribution is really bad for this test.

CloudWays Traffic Spike Load Test - Response Time Distribution
CloudWays Traffic Spike Load Test – Response Time Distribution

As you can see 80% of the requests finished in < 50s which means that 20% of the requests took longer than that. The 99% mark was only reached after > 200s, at which point the user is likely long gone.

Conclusions

For $11 / month the CloudWays managed WordPress installation did a great job, but there are better performers out there in the same price range (GoDaddy for instance). For the sake of this review which only looks at raw performance, CloudWays probably isn’t the best choice. But if you’re looking for good-enough performance with extreme flexibility then you would be hard pressed to find a better provider.

Want to run load tests against your own WordPress sites? Sign up for Kernl now!

Load Testing GoDaddy’s Managed WordPress Service

Earlier this December Kernl launched a closed beta of our WordPress load testing service. As part of that beta we’ve decided to run a series of load tests against some of the common managed WordPress hosting services.

GoDaddy was chosen as our first load test target for a few different reasons:

  • GoDaddy has been around for ages.
  • They offer a managed WordPress platform
  • They are fairly inexpensive for the service that they are offering.

How will providers be judged?

There are a number of different ways to judge a WordPress hosting provider. How reliable are they? Do they perform patches for you? What is their customer support like? How fast are they? For the purpose of our tests we’re focusing on raw speed under heavy load. We will only be judging the hosting providers on that metric. To test the speed of the hosting provider under heavy load we ran 3 tests:

  • The Small Test – 200 users, for 10 minutes, ramping up at a rate of 2 users per second. We did this test to check our configuration before we ran more intense load tests.
  • The Sustained Traffic Test – 2000 users, for 2 hours, ramping up at a rate of 2 users per second. This test was performed to see how GoDaddy’s WordPress hosting would perform under a sustained heavy load.
  • The Traffic Spike Test – 20000 users, for 1 hour, ramping up at a rate of 10 users per second. This test was used to determine how GoDaddy would handle lots of traffic coming in at once versus the slower ramp up of the sustained traffic test.

There was no configuration or tweaking done on the GoDaddy WordPress install. We simply imported the content of http://www.re-cycledair.com and started testing.

The GoDaddy WordPress Plan

An important part of this load test was which GoDaddy WordPress hosting plan was selected. As we’re going to try and do this across multiple different providers we’ve opted to go for plans based roughly on price. This plan was the “Deluxe Managed WordPress” plan that costs $12.99 / month.

GoDaddy WordPress Deluxe plan

Load Test Location

For these three load tests we generated traffic out of Digital Ocean’s SFO2 (San Francisco, CA, United States) data center.

The Small Test

200 concurrent users, 10 minutes, 2 user / sec ramp up, San Francisco

The requests graph represents the number of requests per second that the site under load is serving successfully. From the graph below, you can see that GoDaddy had no problem serving 200 concurrent users. After the ramp up completed things settled in at around 25 requests / second.

GoDaddy load testing - requests
GoDaddy WordPress Hosting – 200 concurrent users

The failure graph shows that during this load test there weren’t any reported failures.

GoDaddy load testing - failures
GoDaddy WordPress Hosting – 200 concurrent users

The final graph of the 200 concurrent user small test is the distribution graph. This is probably the most important part of these tests because it helps you understand what your end user experience will be like when your site is under heavy load.

GoDaddy load testing - distribution
GoDaddy WordPress Hosting – 200 concurrent users

To understand the graph select a column. We’ll look at the 99% column. Now read the value of the column (~600ms). You can now say that for this load test 99% of all requests finished in under 600ms. If you look at the 95% column you can see that the results are ~200ms which is pretty fantastic. The 100% column is almost always an outlier, but even in this case having 1% of requests finish between 500ms – 2200ms seems ok.

The Sustained Traffic Test

2000 concurrent users, 2 hours, 2 user / sec ramp up, San Francisco

The requests graph for the sustained traffic test yielded a nice curve. The traffic ended up leveling out at 252 requests / second. The transition along the curve was smooth and there weren’t any obvious pain points for request throughput during the test.

GoDaddy load testing - requests
GoDaddy WordPress Hosting – 2000 concurrent users

The failure graph for this set of tests is particularly interesting. About 10 minutes into the test we see a HUGE spike in errors. After a short period of time the errors stop accumulating. I’m not sure what happened here, but I suspect that some sort of scaling event was triggered in GoDaddy’s infrastructure. After the scaling event completed they were able to continue serving traffic. We didn’t see any more errors for the rest of the test.

GoDaddy load testing - failures
GoDaddy WordPress Hosting – 2000 concurrent users

For the distribution graph of this load test I would argue that GoDaddy performed very well under some fairly intense load. 99% of requests were finished in 460ms. There is obviously an issue with that other 1%, but that was likely due to the weird error event that happened at around the 10 minute mark.

GoDaddy load testing - distribution
GoDaddy WordPress Hosting – 2000 concurrent users

Overall GoDaddy performed far better than I expected on the sustained traffic test. I personally haven’t used GoDaddy as a WordPress host in ages, but for this one metric (performance under load) I think they really did a great job.

The Traffic Spike Test

20000 concurrent users, 1 hour, 10 user / sec ramp up, San Francisco

The traffic spike test is absolutely brutal but is definitely the kind of traffic you can expect if you had an article or site shared by a Twitter celebrity with a large following.

The requests graph for this test is by far my favorite out of this entire article. It shows linear growth with no slowing down. For reasons highlighted later I killed this test at ~10 minute mark, but up until that point GoDaddy was a rocket ship. At the point I stopped the test we were running at 483 requests / second.

GoDaddy load testing - requests
GoDaddy WordPress Hosting – 20000 concurrent users

The failure graph for this test is interesting as well. You can see that all was well until about 9 minutes in when errors increased sharply. I could have continued the load test but chose to stop it at this point due to the increased error rates. In hind sight I should have continued the test. Next time!

GoDaddy load testing - failures
GoDaddy WordPress Hosting – 20000 concurrent users

The most impressive aspect of the traffic spike test was the distribution chart. Even under some incredibly high load (for a WordPress site), GoDaddy was still returned 99% of requests in under 500ms. Great work team GoDaddy!

GoDaddy load testing - distribution
GoDaddy WordPress Hosting – 20000 concurrent users

Conclusions

For the single metric of speed and responsiveness under heavy load I think that GoDaddy’s managed WordPress solution did a fantastic job of handling the load that the Kernl WordPress load testing tool was throwing at it. If you have a site with really high traffic, GoDaddy should be on your list of hosts to check out.

Want to run load tests against your own WordPress sites? Sign up for Kernl now!

What’s New With Kernl – November 2018

It has been a busy few months for Kernl. Lots of great work has gone into the WordPress load testing feature work as well as a few structural changes to increase reliability.

  • Cache moved to Redis – For as long as Kernl has existed our cache backend was powered by Memcached. We have now finished migrating to Redis hosted at Compose.io.
  • AngularJS Upgrade to 1.7.5 – Fairly straight-forward upgrade to Angular 1.7.5. We wanted to take advantage of performance improvements and few bug fixes.
  • WordPress Load Testing – Over the past few months we’ve been cooking up something new. Imagine if you could easily test performance changes to you or your client’s WordPress installation? Or be able to tell your client with confidence how many customers at a time their site can support (and what their experience will be like!). What if you could do all this without writing a single line of code or spinning up your own testing infrastructure? We’re ready to start beta testing so send an email to jack@kernl.us if you would like to be a part of it.

Introducing The Kernl Analytics Agency Plan

Today we launched the next iteration of Kernl Analytics. The agency plan has been long in the making and we hope that you enjoy the new insights that you can extract with it.

Features

The Kernl Analytics agency plan is very similar to the “small” plan with two key differences:

  • Increased Data Retention – When the agency plan is selected, Kernl will hold on to your analytics data for 90 days (instead of the small plan’s single day). This also means that you can select a day in the past and see your analytics for it.
  • Compare Dates – With the agency plan you can select two dates and compare their data against each other. This is extremely useful if you would like to see adoption curves for WordPress versions, PHP versions, and installed versions of your plugin/theme. It allows you to make smart business decisions based on real data.

The Kernl Analytics agency plan is now available to all Kernl customers. The fee is $30/month on top of your existing Kernl plan. Reach out to jack@kernl.us if you have any questions!

What’s New With Kernl – September 2018

It was great month for Kernl! We didn’t do much in the way of user-facing features, but we did accomplish a lot of great infrastructure work.

Features, Bugs, & Infrastructure

  • Analytics Domain Search Speed Improvements – Prior to this work is could take up to 5 seconds to filter through the list of domains in Kernl Analytics. You can now search with sub-second response times due to a well placed index in Postgres.
  • Daily Aggregates Cleanup – This task encompassed cleaning up the millions of rows of aggregate data that Kernl was hanging on to. We weren’t in jeopardy of running out of space, but a table of 300K rows is faster to query than a table of 20M rows.
  • Analytics Domain List Clickable URLs – A customer suggested that the urls in the domain list should be clickable, so now they are!
  • Marketing Site Scaled Images – Kernl now serves properly scaled images for the marketing site.
  • GZIP Compression – All /static routes now serve resources using GZIP compression.
  • MongoDB Upgrade – We’re now on Mongo 3.4 and the database is hosted in AWS via Compose.io
  • Node.js Upgrade – Kernl now runs on Node.js 8.11.4. This upgrade addresses some security issues and bug fixes.

August 14 2018 Outage Post-Mortem

Below is a post-mortem analysis of Kernl’s August 14 2018 outage. It details what happened, why it happened, and how we can improve things so that this doesn’t happen again in the future.

What?

At 9:29PM EDT alerts were triggered saying that Kernl was down. Initial investigation showed that the marketing site was still up and that some (but not all) update requests were still going through. Upon further investigation, it was found that Kernl’s connection to MongoDB had stopped working.

In most situations the DB connection dropping would have caused only a momentary blip while we failed over to our secondary, but in this case that wasn’t possible.

Why?

Further investigation revealed that our Mongo provider (Compose.io) was experiencing an outage in some of their Digital Ocean environments. Unfortunately this outage effected not only our primary Mongo host but also our backup secondary host. Due to the nature of the outage automatic failover for Mongo wasn’t a possibility.

Actions Taken

  • Determined the source of the downtime.
  • Contacted Compose.io support to resolve the issue.
  • After not hearing back from support for 30 minutes, work was started on an alternate plan for bringing Kernl back up.
  • After 45 minutes of no response from Compose.io support the alternate plan was enacted.
    • The alternate plan was to restore Kernl’s Mongo cluster into a different data center using the daily backup. The only downside was that that backup was 6 hours old, which means there is a possibility that customers will need to re-build or re-upload some plugin and theme versions.
  • Cache lifetime for all Kernl endpoints was doubled. This was done because the new data center was outside of Digital Ocean NYC3. The increased cache lifetime helps combat the increased latency.
  • Kernl’s downtime ended at roughly 11:00PM. The Compose.io incident wasn’t resolved until several hours after this so we feel that the decision to restore from a backup was the right one.

What’s Next?

Compose.io has been our Mongo provider for years now and we’ve never experienced any significant downtime. That being said, they don’t actually support DigitalOcean anymore and plan to kill their support for it at sometime in the future.

Our next steps are to evaluate how Kernl performs with the database in another data center. If things look good, we will likely move Kernl’s Mongo instances to the new data center permanently where they can be better supported by Compose. It is our suspicion that if we had been in one of their more popular data centers that we would have received help faster.

Once again, apologies for the downtime and we’ll continue to work hard so that it doesn’t happen again!

Building & Scaling Kernl Analytics

Over the past 3 years I’ve often received requests from new and existing Kernl customers for some form of analytics on their plugin/theme. I avoided doing this for a long time because I wasn’t sure that I could do so economically at the scale Kernl operates at, but I eventually decided to give Kernl Analytics a whirl and see where things ended up.

kernl analytics product versions
Product Versions Graph

Concerns

After deciding to give the analytics offering a try, I had to figure how to build it. When I first set out to build Kernl Analytics I had 3 main concerns:

  • Cost – I’ve never created a web service from scratch that needs to INSERT data at 75 rows per second with peaks of up to 500 rows per second. I wanted to be sure that running this service wouldn’t be prohibitively expensive.
  • Scale – How much would I need to distribute the load? This is tightly coupled to cost.
  • Speed – This project is going to generate a LOT of data by my standards. Can I query it in performant manner?

As development progressed I realized that cost and scale were non-issues. The database that I chose to use (PostgreSQL) can easily withstand this sort of traffic with no tweaking, and I was able to get things started on a $5 Digital Ocean droplet.

Kernl Analytics Architecture & Technology

Kernl Analytics was created to be it’s own micro-service with no public access to the world. All access to it is behind a firewall so that only Kernl’s Node.js servers can send requests to it. For data storage, PostgreSQL was chosen for a few reasons:

  1. Open Source
  2. The data is highly relational
  3. Performance

The application that captures the data, queries it, and runs periodic tasks is a Node.js application written in TypeScript. I chose TypeScript mostly because I’m familiar with it and wanted type safety so I wouldn’t need to write as many tests.

kernl analytics and typescript
TypeScript FTW!

With regards to size of the instance that Kernl Analytics is running on, I currently pay $15/month for a 3 core Digital Ocean droplet. I upgraded to 3 cores so that Postgres could easily handle both writes and multiple read requests at the same time. So far this setup has worked out well!

Pain Points

Overall things went well while implementing Kernl Analytics. In fact they went far better than expected. But that doesn’t mean there weren’t a few pain points along the way.

  • Write Volume – Kernl’s scale is just large enough to cause some scaling and performance pains when creating an analytics service. Kernl averages 25 req/s which translates to roughly 75 INSERTs into Postgres. Kernl also has peaks of 150 req/s which scales up to about 450 INSERTs into Postgres. Postgres can easily handle this sort of load, but doing it on a $5 digital ocean droplet was taxing to say the least.
  • Hardware Upgrade – I tried to keep costs down as much as possible with Kernl Analytics, but in the end I had to increase the size of the droplet I was using to a $15 / 3-core droplet. I ended up doing that so one or two cores could be dedicated to writes while leaving a single core available for read requests. Postgres determines what actions are executed where, but adding more cores had led to a lot less resource contention.
  • Aggregation – Initially the data wasn’t aggregated at all. This caused some pain because even with some indexing, plucking data out of a table with > 2.5 million rows can be sort of slow. It also didn’t help that I was writing data constantly to the table, which further slowed things down. Recently I solved this by doing daily aggregations for Kernl Analytics charts and domain data. This has improved speed significantly.
  • Backups & High Availability – To keep costs down the analytics service is not highly available. This is definitely one of those “take out some tech debt” items that will need to be addressed at a later date. Backups also happen only on a daily basis, so its possible to lose a day of data if something serious goes wrong.
kernl analytics server load
Yay for affordable hosting

Future Plans

Kernl Analytics is a work in progress and there is always room to improve. Future plans for the architecture side of analytics are

  • Optimize Indexes – I feel that more speed can be coaxed out of Postgres with some better indexing strategies.
  • Writes -vs- Reads – Once I gain a highly available setup for Postgres I plan to split responsibilities for writing and reading. Writes will go to the primary and reads will go to the secondary.
  • API – Right now the analytics API is completely private and firewalled off. Eventually I’d like to expose it to customers so that they can use it to do neat things.

What’s New With Kernl – August 2018

Happy (almost) August everyone! This month with Kernl was focused on fixing some technical debt and adding a few features surrounding analytics.

Features & Bug Fixes

  • Analytics Top Level Menu – Kernl Analytics now has a top-level menu item. Prior to this change you had to enter a plugin/theme page before you could access it.
  • Analytics Product & Date Selector – Coupled with the analytics top-level menu, you can now select which product you want to see analytics for directly in the page. You can also select the date if you have the “agency” plan or above.
  • Session Store Moved to Memcached – For most of Kernl’s life sessions have been stored in Mongo. Recently we moved to storing sessions in Memcached.
  • GitLab Integration Bug Fix – The GitLab integration was broken for a few days after GitLab disabled their v3 API. This has been resolved.
  • The .kernlignore file had a few bugs related to processing it. These have been resolved.
  • Along with the session storage change we cleaned up a few collections in Mongo.
  • Node.js was upgraded to 8.11.3.

What’s New With Kernl – July 2018

I hope everyone in the northern hemisphere is enjoying their summer and for those of you in the southern hemisphere, stay warm! It was a nice month for Kernl with lots of good structural changes and a few new features rolled out.

  • SendOwl Integration – If you use SendOwl to distribute your plugin or theme you can now validate license keys with Kernl. This means that every time a customer checks to see if an update is available Kernl will first validate their SendOwl license.
  • Analytics Aggregate Data – Kernl Analytics now uses aggregate data to populate charts. This means that charts load instantly versus taking a few seconds as they did before. This was a big change and enables us to do neat things in the future like calculating changes over time.
  • Analytics Domains – In addition to using aggregate data to populate charts Kernl Analytics now has improved domain list support. Data is properly paginated, populated via aggregates for speed, and searchable.
  • Version Number Improvements – Kernl now supports version numbers such as 10.2.2-alpha or 9.2.1-beta. Previously the alpha|beta tags at the end were not supported.
  • Download Graph Bug Fix – A customer reported that the download chart in the plugin/theme detail pages weren’t quite right. This bug has been fixed.
  • License Management  – The license management page was occasionally showing duplicates. This bug has been fixed.

Thats it for this month!

 

 

What’s New With Kernl – June 2018

It’s been a great month for Kernl! Lots of new features, some bug fixes, and few updates to the license checking on the plugin and theme update checker files. Lets dive in!

Features

  • Gumroad License Validation – You can now use Kernl to validate your Gumroad licenses! This can be enabled for plugins or themes by going to the product edit screen, clicking the “License Management” tab, and then selecting “Validate Gumroad License?” at the bottom.
  • Kernl Referral Program – Kernl has a referral program. For every 3 referrals you send us we’ll give you a free month. Customers signing up with your referral code get their first 3 months free.
  • Restrict Updates to a Maximum Version – If you use Kernl’s license management you can now restrict update availability to a maximum version. For example if the current version of your product was 1.5.0 and you gave the user a license for < 1.6.0, then they would receive updates all the way through 1.5.X. This is a great way to drive more sales of your product!
  • Plugin/Theme PHP Update Check Files – The license error display behavior of these files has been greatly improved. The error dismisses when it’s supposed to, only shows up on the updates and plugin|theme page, and the license error message can now be customized. It is highly recommended that you update. The file is also versioned now so knowing when to update in the future will be much easier.

Other

  • Purchase Code Deprecation – Kernl’s old purchase code frontend interface has been hidden behind a feature flag. The goal is the have the old purchase code functionality completely removed by the end of July.
  • Copy Versions from Product to Product – To support special development styles, you can move versions of a product to another product. This is inherently dangerous and only toggled on for the person who requested it. If you think this might be useful please reach out to jack@kernl.us.
  • You can now easily click to a customer’s page from the License Management page.
  • A bug was fixed where the customer filter would stay set even after you had navigated away from that page.
  • Numerous copy changes were made on the License Management page and on the marketing site.
  • Some feature flags were removed from features that have proved to be stable.
  • The main route for plugin update checks (also the highest traffic route on Kernl) was refactored to use async/await instead of promise chains. This makes it much easier to maintain and improve.