Scaling a Cron: Checking Hundreds of Sites Every Minute

#

In my last article I described how I built a cloud app using Laravel Spark. Web Uptime is a website uptime and monitoring app that checks your website’s status every minute from multiple locations around the world.

Something that excited me about building an app like this was the technical challenge of checking the status of hundreds, if not thousands, of websites every minute. In this article I’m going to describe the thought process that lead me to the architecture I thought I would need, and then look at the solution I settled on.

Where to Start

The process of checking the status of a single website on a regular basis is easy enough. You setup a cron job that runs every minute and triggers a GET request that returns the HTTP status of the website (if the GET request fails or there is an error status then the website is down!). However, when you start to think about how to achieve this same process for hundreds or thousands of sites every minute things get much more complex.

Web Uptime is built using Laravel (PHP), however PHP is not a concurrent language. Unlike Python or Node for example, it doesn’t support threading or running multiple processes simultaneously. This means you can’t just fire off a single process for each website check every minute and collect the results. There are projects that enable threading in PHP, such as the pthreads PECL extension, but I didn’t want something that would increase my server maintenance complexity. I decided this was not a viable option. So how else could I do this?

  • Create a separate cron for each site
  • Use something like AWS Lambda to handle checks
  • Use a job cluster framework like Gearman
  • Use a queue with a cluster of job servers

In my initial planning I decided that creating a separate cron for each site wasn’t going to be easy to manage or scale (and using a single cron for multiple checks was going to be too slow without concurrency) so it was a no go. AWS Lambda doesn’t support PHP as a language (at the time of writing) and I didn’t want to maintain two different codebases in different languages, so Lambda was out. Gearman looked quite promising but ended up requiring a PECL extension to work so I turned it down for the same reason as using the pthreads PECL extension. Also, I didn’t find the documentation particularly clear for getting setup.

I was left with the final option of using a queue and a cluster of job servers. Despite this sounding quite complex, it was actually relatively easy to implement. Laravel has built-in queue handling so I could run a cron every minute to add hundreds of jobs to a queue in each region and have a small cluster of job servers in each region to pick the jobs off the queue and run them and collect the results. When the queues were getting too full I could simply spin up some new job servers and add them to the cluster to handle the extra load. I figured I could use AWS Simple Queue Service (SQS) and maybe even use AWS Auto Scaling to spin up new servers and add them to the cluster for me.

Tried and Failed

After setting up the queue + cluster I was confident I was on the right track and happily launched Web Uptime using this setup. Of course my naivety came back to bite me pretty quickly as I soon started getting alerts that my queues were filling up and more job servers were required. What I had failed to realize is that these job servers could still only run one check at a time per queue listener (because of PHP’s lack of concurrency) which meant that to check hundreds of sites every minute would take a fairly large cluster of job servers in each region to handle the load (“load” here not being computational power but time taken to perform the check) even if I ran multiple queue listeners on each server.

Given that each check could take between 1-10 seconds (depending on the load time of the site being checked) it dawned on me that I was going to need lots of servers, which was going to be expensive. For a small side project like Web Uptime it was going to be too expensive. Time to go back to the drawing board.

The Solution: curl_multi

After doing some more research, and realising I needed a concurrent solution that was going to be affordable, I stumbled across something in PHP that I had never seen before. PHP’s cURL implementation had a set of functions (curl_multi*) that allowed you to make multiple requests simultaneously, meaning you could make hundreds of requests at the same time (within reason) and it would only take as long as the longest request took to return a result.

This sounded ideal so I quickly set up a test to see if it was as good as it sounded, and sure enough, I could now send a hundred GET requests in as little as 7 seconds. This would mean I could use a single server in each region (affordable) and safely run hundreds of checks every minute for the foreseeable future. This is roughly how my implementation ended up looking:

$mh        = curl_multi_init();
$handles   = [];
$errors    = [];

foreach ($monitors as $monitor) {
    try {
        $url  = $monitor->site->url;
        $port = $monitor->port;

        if ($port == 80 && starts_with($url, 'https')) {
            $port = 443;
        }

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_PORT, $port);
        curl_setopt($curl, CURLOPT_TIMEOUT, 20);
        curl_setopt($curl, CURLOPT_HEADER, 1);
        curl_setopt($curl, CURLOPT_NOBODY, 1);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_USERAGENT, 'webuptime.io');
        curl_setopt($curl, CURLOPT_PRIVATE, $monitor->id);

        curl_multi_add_handle($mh, $curl);
        $handles[] = $curl;
    } catch (\Exception $e) {
        \Log::error($e->getMessage());
    }
}

$running = null;
do {
    curl_multi_exec($mh, $running);

    $info = curl_multi_info_read($mh);
    if ($info !== false && isset($info['handle']) && isset($info['result'])) {
        $errors[(int) $info['handle']] = $info['result'];
    }

    usleep(100000);
} while ($running > 0);

$monitorResults = [];
for ($i = 0; $i < count($handles); $i++) {
    $resourceId   = (int) $handles[$i];
    $monitorId    = curl_getinfo($handles[$i], CURLINFO_PRIVATE);
    $statusCode   = isset($errors[$resourceId]) ? $errors[$resourceId] : 0;
    $httpCode     = curl_getinfo($handles[$i], CURLINFO_HTTP_CODE);
    $responseTime = curl_getinfo($handles[$i], CURLINFO_TOTAL_TIME);

    // Save results...

    curl_multi_remove_handle($mh, $handles[$i]);
}

curl_multi_close($mh);

As you can see I set up a bunch of curl instances (as you would normally) but instead of executing them in order we add them to the multi handle by using curl_multi_add_handle and then store them in an array of $handles. We then use a do while loop to actually curl_multi_exec the curl instances simultaneously. Once the loop is finished we extract the information we need using the $handles array and finally remove the handles and close the multi-connection resource.

Using a quick and dirty test script I found that using regular curl_exec to check 95 sites takes ~60 seconds (on my machine). Using curl_multi_exec to check the same 95 sites took ~10 seconds. Obviously the more sites you check using curl_exec the longer it will take, however the same is not true with curl_multi_exec which is why it’s great for scaling a cron like this. Also notice that if I was using regular curl_exec requests I would need 1 server for every 100 sites at that rate (to make sure I could complete the checks within 60 seconds), whereas using concurrent curl_multi_exec requests I could run potentially thousands of requests before needing another server. That’s a big cost saving for a small startup.

What’s Next?

While this solution is a cheap way to scale a cron job like this for the foreseeable future, it is not a permanent solution. Web Uptime currently runs around 7 million checks every month using this technique and a single server in each location seems to cope fine. However, a server’s resources are not infinite and the time will come when adding more servers, or maybe even a totally different approach, is required to scale these checks. At that time maybe offloading to something like AWS Lambda might make more sense. We’ll see.

Have you ever had to scale a cron? Have you ever used curl_mutli functions? Do you have any tips on scaling? Let us know in the comments.

About the Author

Gilbert Pellegrom

Gilbert loves to build software. From jQuery scripts to WordPress plugins to full blown SaaS apps, Gilbert has been creating elegant software his whole career. Probably most famous for creating the Nivo Slider.