Skip to content

Resource Usage

Jeremy Echols edited this page Oct 2, 2024 · 9 revisions

This page shares resource usage information from the institutions running Open ONI in production. Please contact us on Slack with information to share with the community here.

Contents

Historic Oregon Newspapers

Oregon runs Open ONI on Red Hat Enterprise Linux 9 in a monolithic configuration, not through Docker.

When CPU Memory Storage Pages
2018-2022 Two vCPUs 6GB See notes below 1.2 million
2022-2024 Two vCPUs 10GB See notes below 2 million
2024- Two vCPUs 16GB See notes below 2.5 million

UO is probably a good baseline - we get a ton of traffic for our server's power, especially when you look at the number of bots we let spider things.

  • We have about 2.5 million unique newspaper pages. More than half are color. Color pages use about 3x the RAM and CPU as monochrome.
  • In any given week, we see roughly 35,000 user page requests, but roughly twenty times that many bot page requests
    • To be clear, I'm not talking about asset requests. I'm talking about HTML requests, and specifically for individual newspaper page views (since those are the easiest to find bot traffic against)

Server: To power this, our server has 2 vCPUs and 16 gigs of RAM. This server runs Apache (using the Django wsgi stuff), MySQL, Solr, RAIS as well as the batch ingests when those happen. i.e., it's a monolith with no separation of services, because it's one of the very few projects that just has never needed separation.

Solr: Our instance is allocated 4GB of RAM. We might bump this, as we're seeing at least 4 gigs of RAM just sitting mostly idle since we bumped to 16GB.

Storage: If coded to NDNP spec minimums, and monochrome, an okay ballpark is probably one terabyte per 25,000 pages. Our 2 million pages are a bit misleading since a lot of them have no TIFFs - born-digital issues are much easier to store.

All HTTP traffic comes in through an HAProxy cluster. We're not doing anything fancy there, but it does help us block misbehaving spiders like tencent's IP ranges. We also block overzealous spiders by name in Apache (ChatGPT, Claude, ByteDance, a few others), since it's easier to quickly make changes in Apache versus our HAProxy cluster.

Solr, MySQL, and word coordinates can use a fair amount of storage with big collections (~200 gigs for us), but it's a very small fraction of the storage needed just for the JP2s (we're holding 12 terabytes of batches, probably more than half of which is JP2s since we archive the TIFFs).

When we were running at about 1 million unique pages, it should come as no surprise that our storage needs were roughly half, Solr needed less than a gig of RAM, and the server itself ran with about 6GB of RAM.

Nebraska Newspapers

Nebraska runs Open ONI on a CentOS 7 KVM virtual server with Apache, Solr, and MariaDB running natively and RAIS running via Docker. Apache and MariaDB configuration has been customized, but Solr is running with the default 512m heap.

CPU Memory Storage Pages
16 vCPUs 8GB 4.6TB 570,000

The Open ONI-related processes are using around 2.6GB of RAM:

  • 1.4GB for MariaDB (largely due to custom config)
  • 600MB for RAIS
  • 350MB for Solr
  • 300MB for Apache + Django via mod_wsgi with the default number of processes and maximum-requests=10000

All our pages are NDNP-spec, monochrome, with only JP2s images.

North Carolina Newspapers

North Carolina Digital Heritage Center (NCDHC) runs Open ONI in a Docker environment (MariaDB, RAIS).

CPU Memory Storage Pages
4 core VM 16GB (8GB Swap) 23TB 2.9 million (March 2022)

As of March 2022, pages served by Open ONI get ~12,500 views per day. All page batches are NDNP-spec except the TIFF derivative has been removed in favor of JP2. Almost all JP2s are monochrome; estimated 10-20% are full-color.

Pennsylvania Newspaper Archive

Penn State runs Open ONI in a Docker environment.

CPU Memory Storage Pages
Four CPUs 32GB 11TB 1.1 million

Memory usage is showing at 8.27Gb w/ a load average of 0.26.

We also have a fairly large number of TIFFs that are still in the batches. We’re planning to dump them sometime this spring and rebag the batches.

Clone this wiki locally