Skip to content

Scaling the Routing Service - A Discussion About How to Start Up Faster #3953

@kevinkreiser

Description

@kevinkreiser

Editors Note: We've been struggling with this problem since the creation of the project. Indeed most other projects with data of this size that allow querries of this size, will have struggled with this same problem. These are some of the various thoughts and topics that have come up over the years regarding this problem.

Problem Description

In typical backend services one generally relies on horizontal scaling to match changing request load. In short, your service is configured to add or remove capacity (instances of the service on separate machines/vms/k8s pods etc.) depending on the value of some metric (eg. cpu utilization). A planet-wide routing tileset itself can be anywhere from 70 to 140gb in size. For best performance (in terms of latency) the service is run with all the data local to its process (typically on an nvme disk). What this means is that scaling up (adding capacity) takes time; typically 10+ minutes to get the data onto the disk before the new service instance can start. In that time incoming request volume can have overwhelmed the capacity of your existing instances which will result in your service returning 504 (timeout) some of the queued requests.

Conventional Scaling is Rarely Practical

At the end of the day we have a database and we want to scale it horizontally. Depending on your use-case, the shape of your request traffic can vastly limit your options for horizontally scaling. Many factors play a role but the variance in request rate and variance in complexity of those requests over time are the typical deciding factors in whether or not its possible for you to horizontally scale valhalla as it works today. If you have a nice sinusoidal request rate over the course of the day (coresponding to when people are awake and not) and your requests don't vary that much in complexity, then you can probably get away with scaling valhalla in the typical way; just set a really low CPU threshold and let the service add and remove machines over the course of the day.

Sadly though the world is usually only ever that predicable in the limit. With huge request volumes you can expect some dependability but its far more common to have a request rate that is at least some what "bursty," meaning a higher degree of variance in request rate and complexity over time. Strategies or theories for responding to bursty request patterns are therefore what we will concern ourselves with in the rest of this writeup.

Typical Confinguration has Obvious Waste

The typical backend service (as opposed to embedded phone app or in-car dash system) configuration for valhalla consists of:

  1. running the service in multiprocess mode for operational stability due to isolation of requests
  2. making the data accessible via a memory mapped tar file so that OS filesystem cache (ram) can be shared across processes that need lock-free concurrent access to the data. This is essentially a multi-reader (the service processes) single-writer (the OS) cache

As mentioned above we download the whole of the data (one big tar file) to the disk before we start the service processes which memory map the file. The download takes a while, even with compression/decompression tricks, and memory mapping the file can also take some time too (though this can be shortened by using tar index support which we added in #3281).

So the obvious thing we want to do is stop wasting time getting data to the service that it doesn't immediately need. We download the whole planet before starting the service but a given request can be serviced by only a handful of tiles (fraction of the data).

How to Reduce the Wasted Data

So now that we know we don't need all the data to start handling requests (that's the benefit of having a tiled dataset) how do we determine what data to cut out? There are basicaly two approaches to solving this problem, and neither of them is mutually exclusive.

The first option is to reduce the total scope of the data by concentrating on routing in only one part of the world. Some people refer to this approach as "sharding"; an idea that's been around forever in distributed systems in which no single instance has a copy of the entire dataset but rather just a single piece of the whole; a shard.

The second option is lazy loading. An equally old idea in which data is fetched/loaded on-the-fly as the algorithm calls for it. That is, the data accessor is lazy, it doesn't preload any data it waits for the code to ask it for data and loads it only then or in the background while other work is already being done.

Sharding: Regional Extracts

The sharding approach would practically result in at least several top level instance configurations based on their data coverage:

  1. North America
  2. South America
  3. Afro-Eurasia
  4. Oceania + Rest of World

This is primarily because these are the major connected regions of the world. Of course these regions could be further reduced to smaller and smaller regions. For example you could keep the Afro-Eurasia region in the case that you'd need to route someone from Cape Town to Beijing but you could also have separate instances running with just Western Europe or just North Africa and so on. The shards can overlap in other words. We do this with the hope that you don't see a lot of bursty request patterns for the larger shards (Afro-Eurasia has to be like 70% of the planet data) so that most of the horizontal scaling is done on the smaller shards which start faster by virtue of having much less data

Pros to this approach:

  1. Cut up the data as much as you like to match your traffic the best you can
  2. The approach works without any code changes to valhalla
  3. The idea is straight forward

Cons to this approach:

  1. You're only as scalable as your largest shard, its possible its still too large to practically start in time
  2. You need a complicated request proxy that knows about the shards and doesn't pick the wrong one (this is non-trivial, shard shapes must be based on connectivity somehow)
  3. Data production is complicated for the same reason, need to learn how to cut up the data based on what we know about request location and volume
  4. Managing heterogenious clusters is complicated, they all have to be tracked and alarmed separately

Lazy Loading

Lazy loading in a practical sense means just-in-time tile fetching from some location that isn't local to the process. Valhalla already supports this via its http drivers. Even today, one can configure valhalla to fetch tiles on-the-fly from s3 for example (using pre-signed urls). What this means is that even today you can get a new instance of the service started instantly and start servicing requests instantly. The problem with this approach though is that the latency of such a solution is far too high. Http round trip for something like s3 is already on the order of 50-100ms. For it to work we need low latency access to the tile data. NVME drives have a latency of something like 50-100μs; this is our cache miss penalty in memory mapped tar mode described above. In other words our worst case is 3 orders of magnitude faster than lazy loaded http data access. On top of that, the stuff we lazy load isn't shared in a memory map across processes, because we are grabbing individual tiles we end up having an in-memory tile cache local to every process. This means we need a lot more ram to handle the same request load as before or we need to do a lot more per process cache invalidation which also slows things up. So yeah a pretty dismal situation.

So those are the two issue we need to solve to make lazy loading practical.

  1. Low latency access to remote tiles -> A global distributed cache
  2. Low ram requirements per instance -> A local shared cache

The good news is we already have both of these readily available to us. For the former we can use Redis/Memcached (or hosted variants) and for the latter we can continue using our memory mapped tar. How do we do that?

At Data Build Time

  1. optional: modify the tar generation script to do better sorting of the tiles within the tar, cluster level 0 1 and 2 tiles together in blocks rather than lexical sort, could also try the infamous hilbert curve (though not clear how different levels fit in that scheme). this should improve cache coherency some when memory mapping
  2. populate redis with tileset/tileid as keys and corresponding binary gph tiles as values
  3. also add the tar file index.bin with tileset/index.bin as the key and the actual file as the value

Single-Writer or Multi-Writer, Either Way Don't Race

Here we have two options, we can start a sidecar to handle operations with redis or whatever lower latency storage and the filesystem tar or we can do this all inside graphreader and libvalhalla itself. Below I wrote an outline with the former in mind but the latter is equally do-able.

  1. download the index.bin file from redis (should be tiny)
  2. write the beginning of the tar with the entry and contents of index.bin and then write 0s until the rest of the tar is the desired length (based on last entry of index.bin)
  3. memory map the tar we just created
  4. open up a (domain) socket to listen for requests from the valhalla graph reader processes, if they see something is missing in the tar they will ask this process to get it
  5. when they ask, this sidecar process will:
    1. go to redis to get the tile data
    2. write all the bytes excepting the graph_tile_size part of the graphtileheader, into the location it belongs within the tar based on index.bin
    3. then write the graph_tile_size part of the graphtileheader signaling that the write is complete (to the readers in the service process)
  6. return an ok over the socket signalling the caller that it can now use the memory mapped address in the tar

Multi-Reader, Check then Fallback

On the graphreader side where we need to get access to the data we essentially need to add a little bit of checking to the tar "driver".

  1. Update the tar driver to check the graph_tile_size field for a non-zero value in the graphtileheader
  2. If the value is non-zero we are good to go return the address to the tile inside the mem-mapped tar as normal
  3. If the value is zero fallback to sending a request for the tile over the open socket
  4. Once the socket returns ok we can return the address as normal

Synopsis

A practical approach to solving the horizontal scaling problem likely requires a combination of strategies, shrinking data and delaying data fetching. I would suggest starting with lazy loading support, measuring its efficacy and then making the decision on whether or not sharding is required to further reduce the scope of data needed at start up.

Risks

  1. if the request load is so varied that it covers the whole globe, there is nothing we can do. look at you request history and geographic distribution in the queries to determine how localized bursty traffic is
  2. OS filesystem cache is not magic, it takes time to move bytes from disk into ram. A redis round trip goes through ram, to disk and back to ram. this is the price we pay to avoid having machines with a ton of ram. It also means that latency of requests will be slow when a server is starting up. We hope that it will still be usable but if there is an experiment we can design to derisk this it should be done. There is no point in building the whole system if it still takes 5 minutes for the fresh instance to reach semi-normal request latencies
  3. I'm sure there are more, I'll add them when its not 2am..

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions