Scaling a Cron: Checking Hundreds of Sites Every Minute

In my last article I described how I built a cloud app using Laravel Spark. Web Uptime is a website uptime and monitoring app that checks your website’s status every minute from multiple locations around the world.

Something that excited me about building an app like this was the technical challenge of checking the status of hundreds, if not thousands, of websites every minute. In this article I’m going to describe the thought process that lead me to the architecture I thought I would need, and then look at the solution I settled on.

Where to Start

The process of checking the status of a single website on a regular basis is easy enough. You setup a cron job that runs every minute and triggers a GET request that returns the HTTP status of the website (if the GET request fails or there is an error status then the website is down!). However, when you start to think about how to achieve this same process for hundreds or thousands of sites every minute things get much more complex.

Web Uptime is built using Laravel (PHP), however PHP is not a concurrent language. Unlike Python or Node for example, it doesn’t support threading or running multiple processes simultaneously. This means you can’t just fire off a single process for each website check every minute and collect the results. There are projects that enable threading in PHP, such as the pthreads PECL extension, but I didn’t want something that would increase my server maintenance complexity. I decided this was not a viable option. So how else could I do this?

Create a separate cron for each site
Use something like AWS Lambda to handle checks
Use a job cluster framework like Gearman
Use a queue with a cluster of job servers

In my initial planning I decided that creating a separate cron for each site wasn’t going to be easy to manage or scale (and using a single cron for multiple checks was going to be too slow without concurrency) so it was a no go. AWS Lambda doesn’t support PHP as a language (at the time of writing) and I didn’t want to maintain two different codebases in different languages, so Lambda was out. Gearman looked quite promising but ended up requiring a PECL extension to work so I turned it down for the same reason as using the pthreads PECL extension. Also, I didn’t find the documentation particularly clear for getting setup.

I was left with the final option of using a queue and a cluster of job servers. Despite this sounding quite complex, it was actually relatively easy to implement. Laravel has built-in queue handling so I could run a cron every minute to add hundreds of jobs to a queue in each region and have a small cluster of job servers in each region to pick the jobs off the queue and run them and collect the results. When the queues were getting too full I could simply spin up some new job servers and add them to the cluster to handle the extra load. I figured I could use AWS Simple Queue Service (SQS) and maybe even use AWS Auto Scaling to spin up new servers and add them to the cluster for me.

Tried and Failed

After setting up the queue + cluster I was confident I was on the right track and happily launched Web Uptime using this setup. Of course my naivety came back to bite me pretty quickly as I soon started getting alerts that my queues were filling up and more job servers were required. What I had failed to realize is that these job servers could still only run one check at a time per queue listener (because of PHP’s lack of concurrency) which meant that to check hundreds of sites every minute would take a fairly large cluster of job servers in each region to handle the load (“load” here not being computational power but time taken to perform the check) even if I ran multiple queue listeners on each server.

Given that each check could take between 1-10 seconds (depending on the load time of the site being checked) it dawned on me that I was going to need lots of servers, which was going to be expensive. For a small side project like Web Uptime it was going to be too expensive. Time to go back to the drawing board.

The Solution: curl_multi

After doing some more research, and realising I needed a concurrent solution that was going to be affordable, I stumbled across something in PHP that I had never seen before. PHP’s cURL implementation had a set of functions (curl_multi*) that allowed you to make multiple requests simultaneously, meaning you could make hundreds of requests at the same time (within reason) and it would only take as long as the longest request took to return a result.

This sounded ideal so I quickly set up a test to see if it was as good as it sounded, and sure enough, I could now send a hundred GET requests in as little as 7 seconds. This would mean I could use a single server in each region (affordable) and safely run hundreds of checks every minute for the foreseeable future. This is roughly how my implementation ended up looking:

$mh        = curl_multi_init();
$handles   = [];
$errors    = [];

foreach ($monitors as $monitor) {
    try {
        $url  = $monitor->site->url;
        $port = $monitor->port;

        if ($port == 80 && starts_with($url, 'https')) {
            $port = 443;
        }

        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_PORT, $port);
        curl_setopt($curl, CURLOPT_TIMEOUT, 20);
        curl_setopt($curl, CURLOPT_HEADER, 1);
        curl_setopt($curl, CURLOPT_NOBODY, 1);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_USERAGENT, 'webuptime.io');
        curl_setopt($curl, CURLOPT_PRIVATE, $monitor->id);

        curl_multi_add_handle($mh, $curl);
        $handles[] = $curl;
    } catch (\Exception $e) {
        \Log::error($e->getMessage());
    }
}

$running = null;
do {
    curl_multi_exec($mh, $running);

    $info = curl_multi_info_read($mh);
    if ($info !== false && isset($info['handle']) && isset($info['result'])) {
        $errors[(int) $info['handle']] = $info['result'];
    }

    usleep(100000);
} while ($running > 0);

$monitorResults = [];
for ($i = 0; $i < count($handles); $i++) {
    $resourceId   = (int) $handles[$i];
    $monitorId    = curl_getinfo($handles[$i], CURLINFO_PRIVATE);
    $statusCode   = isset($errors[$resourceId]) ? $errors[$resourceId] : 0;
    $httpCode     = curl_getinfo($handles[$i], CURLINFO_HTTP_CODE);
    $responseTime = curl_getinfo($handles[$i], CURLINFO_TOTAL_TIME);

    // Save results...

    curl_multi_remove_handle($mh, $handles[$i]);
}

curl_multi_close($mh);

As you can see I set up a bunch of curl instances (as you would normally) but instead of executing them in order we add them to the multi handle by using curl_multi_add_handle and then store them in an array of $handles. We then use a do while loop to actually curl_multi_exec the curl instances simultaneously. Once the loop is finished we extract the information we need using the $handles array and finally remove the handles and close the multi-connection resource.

Using a quick and dirty test script I found that using regular curl_exec to check 95 sites takes ~60 seconds (on my machine). Using curl_multi_exec to check the same 95 sites took ~10 seconds. Obviously the more sites you check using curl_exec the longer it will take, however the same is not true with curl_multi_exec which is why it’s great for scaling a cron like this. Also notice that if I was using regular curl_exec requests I would need 1 server for every 100 sites at that rate (to make sure I could complete the checks within 60 seconds), whereas using concurrent curl_multi_exec requests I could run potentially thousands of requests before needing another server. That’s a big cost saving for a small startup.

What’s Next?

While this solution is a cheap way to scale a cron job like this for the foreseeable future, it is not a permanent solution. Web Uptime currently runs around 7 million checks every month using this technique and a single server in each location seems to cope fine. However, a server’s resources are not infinite and the time will come when adding more servers, or maybe even a totally different approach, is required to scale these checks. At that time maybe offloading to something like AWS Lambda might make more sense. We’ll see.

Have you ever had to scale a cron? Have you ever used curl_mutli functions? Do you have any tips on scaling? Let us know in the comments.

This entry was tagged Development, Cron, Laravel, SaaS.

.replybox-fallback, .replybox-fallback ul { list-style: none; }

Dale says:

August 23, 2016 at 10:36 am

Hi, I am having exactly this problem with using the WP RSS Aggregator plugin (I’m just a user not the developer of the plugin). I would love to persuade the developer to swap to this methodology as currently checking around 300 feeds and the server is getting overloaded quite often. Failing the developer wanting to do the change I wonder if there is any way to build an addon to the plugin to handle the site checks in this manner and still be able to update the plugin?

Brian says:

August 23, 2016 at 10:57 am

Depends on whether the plugin developers added the hooks/filters you can use to add that functionality. Since it’s a completely different approach chances are you’d need the plugin dev to work with you on it. I did a presentation on doing addons to other plugins here if you’re interested in ways to see what hooks/filters a plugin offers and how you can use them: https://brianhogg.com/wordpress-actions-and-filters/

Reply
Gilbert Pellegrom says:

August 23, 2016 at 11:14 am

I’ve never used WP RSS Aggregator plugin Dale so I wouldn’t be able to say if you could build a custom addon for this.

Reply

Danny says:

August 23, 2016 at 10:57 am

Nice solution 😉

Jeremy Benson says:

August 23, 2016 at 11:01 am

Nice write up! You’re right curl_multi is a great find for something like this. We recently had to scale a cron job that was responsible for connecting multiple API’s for different SaaS apps we use. I ended up researching System Services that would run some php like a cron. What I like about it is that its written to loop, so If its taking a long time to process I don’t have to worry about multiple independent jobs overlapping and trying to do the same thing. It also means the variables in memory are fairly dependable. It reminds me of writing actionscript honestly. The only thing is you have to make sure that garbage collection is on point because the service is running non-stop. I like the using the service because it has good visibility, I can tell what its doing and how much memory its using and control its execution without much hassle. Around the web they don’t recommend services for anything that is not supposed to run a shorter interval than every minute, but I think that’s naive. For a job that could potentially take a while to execute, or where there might be a continuous loop of things to monitor (even if the changes are minor or there are no changes for a period) I think a service might be more appropriate than a cron job.

Gilbert Pellegrom says:

August 23, 2016 at 11:21 am

Thanks Jeremy. You make a good point about garbage collection and memory usage.

Reply

Adam Norwood says:

August 23, 2016 at 11:50 am

Yes, curl_multi is super-useful! I first discovered it when building a web application that needed to check some 10k+ URLs efficiently and download resources depending on which HTTP status code came back for each — I don’t know how I would have handled this without curl_multi (given the constraint that it needed to be written in PHP). I ended up using this wrapper class for curl_multi, appropriately called MultiCurl: https://github.com/bizonix/MultiCurl/blob/master/MultiCurl.class.php It’s a bit long in the tooth and I can’t seem to find a canonical repo for it (this GitHub repo is a fork), but AFAIK it still works well, and it abstracts away some of the tedious parts of setting things up.

Gilbert Pellegrom says:

August 23, 2016 at 12:59 pm

That looks like a nice wrapper class Adam. Thanks for sharing.

Reply

Jason says:

August 29, 2016 at 2:25 pm

I like it! New tool in the toolbox. The only thing I’d consider doing in something like this is analyze the length of time it takes to perform the full task and then have it self-adjust the number of curl requests to perform over each execution based on a running average of the time it took. That way you’re always keeping it optimized as close as possible to as many requests as you can fit in under, say, 60 seconds. That allows each server to also run optimized given its own setup. Just a thought. 🙂

Jon Pearkins says:

March 5, 2017 at 9:02 pm

Thanks for this great test code! Given how many web hosting companies have crippled their CURL capability, I really needed to get my hands on code that actually works before wasting any more time trying to determine if I wrote my Multi CURL code wrong or my web host had issues. As you can guess, it was my code….

Jon Pearkins says:

March 6, 2017 at 6:54 pm

After adapting your code to my application, I discovered that your main loop could be written with a lot less overhead. $running = null; do { curl_multi_exec($mh, $running); $info = curl_multi_info_read($mh); if ($info !== false && isset($info[‘handle’]) && isset($info[‘result’])) { $errors[(int) $info[‘handle’]] = $info[‘result’]; } usleep(100000); } while ($running > 0); could be rewritten: $running = null; curl_multi_exec( $mh, $running ); sleep( 21 ); // Timeout + 1 second curl_multi_exec( $mh, $running ); while ( FALSE !== ( $info = curl_multi_info_read( $mh ) ) ) { if ( isset( $info[‘handle’] ) && isset( $info[‘result’] ) ) { $errors[ (int) $info[‘handle’] ] = $info[‘result’]; } } That is minus the bells and whistles I used to do some extra error checking, including a counter to prevent infinite loops in the WHILE clause.

Jon Pearkins says:

May 6, 2017 at 9:39 pm

Well, I was half right and half wrong. You can eliminate the usleep() if you don’t care about curl_multi_info_read()[‘result’], which seem to disappear after some period of time. So far, I’ve got usleep() up to half a second without any loss. If you’ll pardon the loss of indentation as I copy and paste here, this is what I ended up with: $curl_errors = array(); $running = NULL; $monitor_time = microtime( TRUE ); $cme_rc = curl_multi_exec( $curlm, $running ); $cme_count = 1; while ( ( $running > 0 ) && ( CURLM_OK === $cme_rc ) ) { if ( -1 === curl_multi_select( $curlm ) ) { usleep( 500000 ); } $cme_rc = curl_multi_exec( $curlm, $running ); ++$cme_count; while ( FALSE !== ( $info = curl_multi_info_read( $curlm ) ) ) { if ( isset( $info[‘result’] ) && ( CURLE_OK !== $info[‘result’] ) && isset( $info[‘handle’] ) ) { $curl_errors[] = array( $info[‘handle’], $info[‘result’] ); } } }

Reply