bloggle

Making Coggle Even Faster

Today we’ve got another update from the tech behind Coggle: how we cut the average response time by over 40% with some fairly simple changes, and learned a lesson in checking default configurations.

First, a bit of architecture. Coggle is divided into several separate services behind the scenes, with each service responsible for different things. One service is responsible for storing and accessing documents, another for sending email, one for generating downloads, and so on.

These services talk to each other internally with HTTP requests: so for each request from your browser for a page there will be several more requests between these services before a response is sent back to your browser.

This all adds up to quite a lot of HTTP requests - and many of these Coggle services call out to further services hosted by AWS, using (yep you guessed it!) even more HTTP requests.

So, in all, an awful lot of HTTP requests are going on.

Coggle is written in node.js, and originally we just used the default settings of the node request module, and the AWS SDK for node for most of these requests. (At this point there are better options than the request module - we’d recommend undici for new development - but there isn’t a practical alternative to the AWS SDK.)

Why does this matter? Well, it turns out both of these default configurations are absolutely not tuned for high-throughput applications…

The Investigation Begins

A few weeks ago I came across this interesting interactive debugging puzzle by @b0rk - now, no spoilers here (go try it for yourself!), but when I finally got to the solution it did make me immediately wonder if the same issue was present in Coggle - as for a long time our average response time for requests has been about 60ms:

graph showing 60ms response time over several months

It didn’t take long to confirm that the problem in the puzzle was not occurring for us, but this made me wonder why exactly our average response-time graph was so consistently high - was there room for any improvement? Are all those requests between the different services slowing things down?

What About the Database?

The first obvious place to check is the database. While the vast majority of requests are very fast, we have some occasionally slower requests. Could these be holding things up due to slow trains? Tweaking the connection pool size options of the mongodb driver showed a small improvement, and this is definitely a default configuration that you should tune to your application rather than leaving as-is (note maxPoolSize, not poolSize, is the option that should be used for unified topology connections).

No dramatic improvements here though.

All Those Internal Requests…

Like the mongodb driver, nodejs itself also maintains a global connection pool (in this case an http.Agent) for outgoing connections. If you search for information about this connection pool you will find lots articles saying that it’s limited to 5 concurrent connections. Ahha! This could easily be causing requests to back-up.

Inter-service requests are generally slower than database requests, and just five slow requests could cause others to start piling up behind them!

Fortunately, all those articles are very out of date. The global nodejs connection pool has been unlimited in size since nodejs 0.12 in 2015. But this line of investigation does lead directly to the true culprit.

The global http Agent which our internal requests were using is constructed using default options. And a careful reading of the http agent documentation shows that the keepAlive option is false by default.

This means, simply, that after a request is complete nodejs will close the connection to the remote server, instead of keeping the connection in case another request is made to the same server within a short time period.

In Coggle, where we have a small number of component services making a large number of requests to each other, it should almost always be possible to re-use connections for additional requests. Instead, with the default configuration, a new connection was being created for every single request!

A Solution!

It is not possible to change the global default value, so to configure the request module to use an http agent with keepalive set, a new agent must be created and passed in the options to each request. Separate agents are needed for http and https, but we want to make sure to re-use the same agent for multiple requests, so we use a simple helper function to create or retrieve an agent:

Code not formatted nicely? view on bloggle.coggle.it for syntax highlighting.


const http = require('http');
const https = require('https');

const shared_agents = {'http:':null, 'https:':null};
const getAgentFor = (protocol) => {
    if(!shared_agents[protocol]){
        if(protocol === 'http:'){
            shared_agents[protocol] = new http.Agent({
                keepAlive: true
            });
        }else if(protocol === 'https:'){
            shared_agents[protocol] = new https.Agent({
                keepAlive: true,
                rejectUnauthorized: true
            });
        }else{
            throw new Error(`unsupported request protocol ${protocol}`);
        }
    }
    return shared_agents[protocol];
};

And then when making requests, simply set the agent option:


args.agent = getAgentFor(new URL(args.url).protocol);
request(args, callback);

For Coggle, this simple change had a dramatic effect not only on the latency of internal requests (requests are much faster when a new connection doesn’t have to be negotiated), but also on CPU use. For one service a reduction of 70%!

graph showing dramatic reduction in CPU use

The AWS SDK

As with the request module, the AWS SDK for nodejs will also use the default http Agent options for its own connections - meaning again that a new connection is established for each request!

To change this, httpOptions.agent can be set on the constructor for individual AWS services, for example with S3:

const https = require('https');
const s3 = new AWS.S3({
    httpOptions:{
        agent: new https.Agent({keepAlive:true, rejectUnauthorized:true})
    }
});

Setting keepAlive when requests are not made sufficiently frequently will not have any performance benefit. Instead there will be a slight cost in memory and cpu of maintaining connections only for them to be closed by the remote server without being re-used.

So how often do requests need to be for keepAlive to show a benefit, or in other words, how long will remote servers keep the connection option?

When keepAlive Makes Sense

The default for nodejs servers is five seconds, and helpfully the Keep-Alive: timeout=5 header is set on responses to indicate this. For AWS things aren’t so clear.

While the documentation mentions enabling keepAlive in nodejs clients, it doesn’t say how long the server will keep the connection open, and so how frequent requests need to be in order to re-use it.

Some experimentation with S3 in the eu-west-1 region showed a time of about 4 seconds, though it seems possible this could vary with traffic, region, and across services.

But as a rough guide, if you’re likely to make more than one request every four seconds then there’s some gain to enabling keepAlive, and from there on, as request rates increase, the benefit only grows.

Combined Effect

For Coggle, the combined effect of keepalive for internal and AWS requests was a reduction from about 60ms to about 40ms for the median response time, which is quite amazing for such simple changes!

In the end, this is also a cautionary tale about making sure default configurations are appropriate, especially as patterns of use change over time. Sometimes there can be dramatic gains from just making sure basic things are configured correctly.

I’ll leave you with the lovely graph of how much faster the average request to Coggle is since these changes:

graph showing reducing in response time from 60ms to 40ms

I hope this was an interesting read! As always if you have any questions or comments feel free to email hello@coggle.it

Posted by James, June 16th 2021.

coggle tech performance http https latency aws nodejs keep-alive postedbyjames

What We’ve Learned from Moving to Signed Cookies

We’ve recently moved Coggle’s login sessions from a database-storage model to signed cookies, where session data is stored the session cookie itself.

There aren’t many real-world examples of how to handle this migration, so we’re sharing what we’ve learned doing this with node and express, and hopefully it’ll be a useful and interesting read!

Part 1: How Old Sessions Worked

Previously we handled sessions with the express-session module and connect-mongo data store, and then we used passport to load our actual user data based on the session. Our middleware setup looked like this:

const session = require('express-session');
const MongoStore = require("connect-mongo")(session);
const sessionStore = new MongoStore({ ...  });

// loads req.session from the database store, if the request included a valid session cookie
app.use(session({store: sessionStore, ...}));

// passport middleware loads req.user from our users collection based on the user ID stored in the session
app.use(passport.initialize());
app.use(passport.session());
// csrfMiddleware saves a CSRF token in the session
app.use(csrfMiddleware);

For each request that included a session cookie, the process was basically:

Check the connect-mongo sessions collection in the database to see if the cookie is valid
If it’s valid, load the session data (the user ID and anti-CSRF token) from the sessions collection
Passport middleware loads req.user based on the user ID
Our actual app logic runs
Finally, if the session is updated (for example the cookie expiry is extended), re-save the session to the database. (express-session does this when the response is sent by hooking the response object)

The corresponding data for every single session cookie that hadn’t expired had to be saved in the database. This added up to a lot of session records!

Before the migration sessions were the biggest cause of writes to our database, a significant source of reads, and the majority of data we actually stored in our main database (The actual content of Coggle diagrams is stored separately). Our goal with moving to signed sessions is to significantly reduce the resources needed to host this.

Part of the reason for the volume of session data is that we have very long-lived session cookies, as we prioritise people being able to easily return to their Coggle diagrams. People forgetting which email address they used to log in and ‘losing’ their diagrams as a result is our biggest source of support requests.

Part 2: Choosing a Signed Cookie Implementation

An alternative to storing sessions in the database is to instead store the session data in the cookie itself, so when each page is loaded the session data needed is immediately available in the cookies of the request. This is possible as long as there’s a cryptographic signature on the cookie to stop it from being tampered with. Someone can’t change their cookie to log in to someone elses account, as they have no way to forge the cryptographic signature.

There isn’t a formal standard for signing cookies, but the most common approach is to store a second cookie alongside each cookie to be signed, with a .sig extension to the name. This is the approach used by the cookies npm module, and the cookie-session middleware wraps this module into a convenient middleware which initialises req.session if the session cookie’s signature is valid.

We already use JSON Web Tokens in Coggle for authentication between our back-end services, so we also considered using JWTs as session cookie values. There would be a number of advantages/disadvantages to this:

Public-keys could be used for signing, enabling our back-end services to verify signatures without access to the private signing key
Cookie values could be easily encrypted, as well as signed, by using the related JWE standard.

The additional information that makes JWTs portable (key ID, issuer, and using public-key signatures) also makes them bigger
Public-key signatures are significantly more expensive to sign and verify.
There are no readily available open source node modules for JWT-based session cookies.

Since we don’t need encryption and would prefer to use symmetric keys, we chose the cookie-session middleware. If you’re considering the same route, then think carefully about whether all of the data stored in your session should be unencrypted.

Part 3: Implementation

Secure Configuration:

The default for cookie-session (inherited from the cookies module), is to use the SHA1-HMAC signing algorithm. SHA1 has some weaknesses, so to be cautious we use SHA256-HMAC instead by passing our own Keygrip instance when creating the session middleware:

const signingKeys = new Keygrip([superSecretKey, ...], 'sha256');

const cookieSessionMiddleware = cookieSession({
  name: 'session-cookie',
  keys: signingKeys,
  maxAge: Session_Duration,
  httpOnly: true,
  sameSite: 'lax',
  signed: true,
  secure: true,
});

Handling CSRF:

We set SameSite=Lax on our session cookies, so it would not normally be possible for code on other sites to send potentially state-changing POST requests with the session cookie. However, in case people are using old browsers which do not support SameSite, or there is a bug in browser’s implementation, we still also use an anti-CSRF token for state changing requests.

Previously the CSRF token for each session was stored in the database, and the value sent with each request from the client compared against this - with signed cookies it’s instead stored in the cookie itself.

As the session cookie is stored as a HTTPOnly cookie, it is not possible for a CSRF script to read the value, even though it exists on the client.

It might be possible for malicious javascript to overwrite the HTTPOnly cookie, but in that case the cookie signature would be invalid.

This CSRF protection set-up is definitely a compromise, but as Coggle isn’t handling payments, we think it’s reasonable.

Migrating Old Sessions

It's important to migrate existing sessions so we don't log out users - running both the express-session and cookie-session middleware simultaneously isn't possible, as they both hook req.session and the response object.

As a result, we had to extract the logic from express-session which actually reads and verifies cookies (the getcookie function), and manually check the connect-mongo store, which is relatively straightforward:

const session = require('express-session'); // just for passing to connect-mongo, not used as middleware!
const MongoStore = require("connect-mongo")(session);
const legacySessionStore = new MongoStore({ ...  });

const loadLegacySession = function(req, callback){
  const session_id = getcookie(req, legacyCookieName, [legacyCookieSecret])
  if(session_id){
    legacySessionStore.get(session_id, function(err, session){
      return callback(err, session_id, session);
    });
  }else{
    return callback(null, null, null);
  }
};

With this in place, the final middleware for migrating sessions is straightforward. The migration is only temporary - once all old sessions have expired, we’ll be able to just use the new cookieSessionMiddleware directly instead.

const sessionMiddleware = function(req, res, next){
  // first delegate to the new session middleware:
  cookieSessionMiddleware(req, res, function(err){
    if(err) return next(err);
    // then, ONLY IF there's no user ID in the new style 
    // session, try to load one from the legacy session 
    // so we can migrate it:
    if(!(req.session && req.session.passport && req.session.passport.user)){
      loadLegacySession(req, function(err, legacySessionID, legacySession){
        if(err) return next(err);
        // if there was a passport user ID in the old 
        // session, migrate it:
        if(legacySession && legacySession.passport && legacySession.passport.user){
          req.session.passport = {user:legacySession.passport.user};
          // also migrate any existing CSRF token, so 
          // CSRF tokens used in pages which are already  
          // open remain valid:
          if(legacySession._csrf){
            req.session.csrf = legacySession._csrf;
          }
        }
        // delete the old session:
        if(legacySessionID){
          deleteLegacySession(legacySessionID, req, res);
        }
        return next();
      });
    }else{
      // if we have an authed session from the new cookie 
      // already then we're done:
      return next();
    }
  });
});

After this, the passport middleware works exactly the same as before, loading req.user from session.passport.user

Part 4: The Results!

We deployed the new sessions around January 27th. Based on one week either side of that, we saw some dramatic differences:

Database update operations, and corresponding db journal data, were reduced by approximately 80%, from 0.4MB/s to 0.08MB/s

Chart showing databse update data rates dropping

Chart showing database journal data rates dropping

Database volume busy time (which previously limited our peak scaling), reduced from approximately 15% to approximately 3%. In theory we can now handle peaks of over 30x our normal traffic volume, instead of peaks of only 7x!

Chart showing increase in database volume idle time

(The reduction in the read ops of two of the volumes is primarily because they were being used as syncing sources for our off-site replicas - less journal data means less to be read for syncing)

And finally, 311 GB of session data and corresponding indexes can eventually be dropped from our database (multiplied across replicas, that’s over 1.5TB of disk space, or about $160/month)

Hopefully this has been an interesting read. If you thought we were crazy to store our sessions in MongoDB in the first place, well, we also used to store the entire contents of Coggle documents in a MongoDB database too… maybe we’ll write about that next!

Posted by James, Feb 2020.

coggle tech nodejs express cookie-session express-session postedbyjames

coggle

coggle newthings collaboration remotework remoteworking remotebrainstorming remote mindmapping flowcharts postedbyjames

See, that’s what the app is perfect for.