Quantcast
Channel: bloggle
Viewing all articles
Browse latest Browse all 124

Making Coggle Even Faster

$
0
0

Today we’ve got another update from the tech behind Coggle: how we cut the average response time by over 40% with some fairly simple changes, and learned a lesson in checking default configurations.

First, a bit of architecture. Coggle is divided into several separate services behind the scenes, with each service responsible for different things. One service is responsible for storing and accessing documents, another for sending email, one for generating downloads, and so on.

These services talk to each other internally with HTTP requests: so for each request from your browser for a page there will be several more requests between these services before a response is sent back to your browser.

This all adds up to quite a lot of HTTP requests - and many of these Coggle services call out to further services hosted by AWS, using (yep you guessed it!) even more HTTP requests.

So, in all, an awful lot of HTTP requests are going on.

Coggle is written in node.js, and originally we just used the default settings of the node request module, and the AWS SDK for node for most of these requests. (At this point there are better options than the request module - we’d recommend undici for new development - but there isn’t a practical alternative to the AWS SDK.)

Why does this matter? Well, it turns out both of these default configurations are absolutely not tuned for high-throughput applications…

The Investigation Begins

A few weeks ago I came across this interesting interactive debugging puzzle by @b0rk - now, no spoilers here (go try it for yourself!), but when I finally got to the solution it did make me immediately wonder if the same issue was present in Coggle - as for a long time our average response time for requests has been about 60ms:

graph showing 60ms response time over several months

It didn’t take long to confirm that the problem in the puzzle was not occurring for us, but this made me wonder why exactly our average response-time graph was so consistently high - was there room for any improvement? Are all those requests between the different services slowing things down?

What About the Database?

The first obvious place to check is the database. While the vast majority of requests are very fast, we have some occasionally slower requests. Could these be holding things up due to slow trains? Tweaking the connection pool size options of the mongodb driver showed a small improvement, and this is definitely a default configuration that you should tune to your application rather than leaving as-is (note maxPoolSize, not poolSize, is the option that should be used for unified topology connections).

No dramatic improvements here though.

All Those Internal Requests…

Like the mongodb driver, nodejs itself also maintains a global connection pool (in this case an http.Agent) for outgoing connections. If you search for information about this connection pool you will find lots articles saying that it’s limited to 5 concurrent connections. Ahha! This could easily be causing requests to back-up.

Inter-service requests are generally slower than database requests, and just five slow requests could cause others to start piling up behind them!

Fortunately, all those articles are very out of date. The global nodejs connection pool has been unlimited in size since nodejs 0.12 in 2015. But this line of investigation does lead directly to the true culprit.

The global http Agent which our internal requests were using is constructed using default options. And a careful reading of the http agent documentation shows that the keepAlive option is false by default.

This means, simply, that after a request is complete nodejs will close the connection to the remote server, instead of keeping the connection in case another request is made to the same server within a short time period.

In Coggle, where we have a small number of component services making a large number of requests to each other, it should almost always be possible to re-use connections for additional requests. Instead, with the default configuration, a new connection was being created for every single request!

A Solution!

It is not possible to change the global default value, so to configure the request module to use an http agent with keepalive set, a new agent must be created and passed in the options to each request. Separate agents are needed for http and https, but we want to make sure to re-use the same agent for multiple requests, so we use a simple helper function to create or retrieve an agent:

Code not formatted nicely? view on bloggle.coggle.it for syntax highlighting.


const http = require('http');
const https = require('https');

const shared_agents = {'http:':null, 'https:':null};
const getAgentFor = (protocol) => {
    if(!shared_agents[protocol]){
        if(protocol === 'http:'){
            shared_agents[protocol] = new http.Agent({
                keepAlive: true
            });
        }else if(protocol === 'https:'){
            shared_agents[protocol] = new https.Agent({
                keepAlive: true,
                rejectUnauthorized: true
            });
        }else{
            throw new Error(`unsupported request protocol ${protocol}`);
        }
    }
    return shared_agents[protocol];
};

And then when making requests, simply set the agent option:


args.agent = getAgentFor(new URL(args.url).protocol);
request(args, callback);

For Coggle, this simple change had a dramatic effect not only on the latency of internal requests (requests are much faster when a new connection doesn’t have to be negotiated), but also on CPU use. For one service a reduction of 70%!

graph showing dramatic reduction in CPU use

The AWS SDK

As with the request module, the AWS SDK for nodejs will also use the default http Agent options for its own connections - meaning again that a new connection is established for each request!

To change this, httpOptions.agent can be set on the constructor for individual AWS services, for example with S3:

const https = require('https');
const s3 = new AWS.S3({
    httpOptions:{
        agent: new https.Agent({keepAlive:true, rejectUnauthorized:true})
    }
});

Setting keepAlive when requests are not made sufficiently frequently will not have any performance benefit. Instead there will be a slight cost in memory and cpu of maintaining connections only for them to be closed by the remote server without being re-used.

So how often do requests need to be for keepAlive to show a benefit, or in other words, how long will remote servers keep the connection option?

When keepAlive Makes Sense

The default for nodejs servers is five seconds, and helpfully the Keep-Alive: timeout=5 header is set on responses to indicate this. For AWS things aren’t so clear.

While the documentation mentions enabling keepAlive in nodejs clients, it doesn’t say how long the server will keep the connection open, and so how frequent requests need to be in order to re-use it.

Some experimentation with S3 in the eu-west-1 region showed a time of about 4 seconds, though it seems possible this could vary with traffic, region, and across services.

But as a rough guide, if you’re likely to make more than one request every four seconds then there’s some gain to enabling keepAlive, and from there on, as request rates increase, the benefit only grows.

Combined Effect

For Coggle, the combined effect of keepalive for internal and AWS requests was a reduction from about 60ms to about 40ms for the median response time, which is quite amazing for such simple changes!

In the end, this is also a cautionary tale about making sure default configurations are appropriate, especially as patterns of use change over time. Sometimes there can be dramatic gains from just making sure basic things are configured correctly.

I’ll leave you with the lovely graph of how much faster the average request to Coggle is since these changes:

graph showing reducing in response time from 60ms to 40ms

I hope this was an interesting read! As always if you have any questions or comments feel free to email hello@coggle.it


Posted by James, June 16th 2021.


Viewing all articles
Browse latest Browse all 124

Trending Articles