-
Notifications
You must be signed in to change notification settings - Fork 25
Resource Usage
This page shares resource usage information from the institutions running Open ONI in production. Please contact us on Slack with information to share with the community here.
Contents
- Historic Oregon Newspapers
- Nebraska Newspapers
- North Carolina Newspapers
- Pennsylvania Newspaper Archive
Oregon runs Open ONI on Red Hat Enterprise Linux 9 in a monolithic configuration, not through Docker.
| When | CPU | Memory | Storage | Pages |
|---|---|---|---|---|
| 2018-2022 | Two vCPUs | 6GB | See notes below | 1.2 million |
| 2022-2024 | Two vCPUs | 10GB | See notes below | 2 million |
| 2024- | Two vCPUs | 16GB | See notes below | 2.5 million |
UO is probably a good baseline - we get a ton of traffic for our server's power, especially when you look at the number of bots we let spider things.
- We have about 2.5 million unique newspaper pages. More than half are color. Color pages use about 3x the RAM and CPU as monochrome.
- In any given week, we see roughly 35,000 user page requests, but roughly twenty times that many bot page requests
- To be clear, I'm not talking about asset requests. I'm talking about HTML requests, and specifically for individual newspaper page views (since those are the easiest to find bot traffic against)
Server: To power this, our server has 2 vCPUs and 16 gigs of RAM. This server runs Apache (using the Django wsgi stuff), MySQL, Solr, RAIS as well as the batch ingests when those happen. i.e., it's a monolith with no separation of services, because it's one of the very few projects that just has never needed separation.
Solr: Our instance is allocated 4GB of RAM. We might bump this, as we're seeing at least 4 gigs of RAM just sitting mostly idle since we bumped to 16GB.
Storage: If coded to NDNP spec minimums, and monochrome, an okay ballpark is probably one terabyte per 25,000 pages. Our 2 million pages are a bit misleading since a lot of them have no TIFFs - born-digital issues are much easier to store.
All HTTP traffic comes in through an HAProxy cluster. We're not doing anything fancy there, but it does help us block misbehaving spiders like tencent's IP ranges. We also block overzealous spiders by name in Apache (ChatGPT, Claude, ByteDance, a few others), since it's easier to quickly make changes in Apache versus our HAProxy cluster.
Solr, MySQL, and word coordinates can use a fair amount of storage with big collections (~200 gigs for us), but it's a very small fraction of the storage needed just for the JP2s (we're holding 12 terabytes of batches, probably more than half of which is JP2s since we archive the TIFFs).
When we were running at about 1 million unique pages, it should come as no surprise that our storage needs were roughly half, Solr needed less than a gig of RAM, and the server itself ran with about 6GB of RAM.
Nebraska runs Open ONI on a CentOS 7 KVM virtual server with Apache, Solr, and MariaDB running natively and RAIS running via Docker. Apache and MariaDB configuration has been customized, but Solr is running with the default 512m heap.
| CPU | Memory | Storage | Pages |
|---|---|---|---|
| 16 vCPUs | 8GB | 4.6TB | 570,000 |
The Open ONI-related processes are using around 2.6GB of RAM:
- 1.4GB for MariaDB (largely due to custom config)
- 600MB for RAIS
- 350MB for Solr
- 300MB for Apache + Django via mod_wsgi with the default number of processes and
maximum-requests=10000
All our pages are NDNP-spec, monochrome, with only JP2s images.
North Carolina Digital Heritage Center (NCDHC) runs Open ONI in a Docker environment (MariaDB, RAIS).
| CPU | Memory | Storage | Pages |
|---|---|---|---|
| 4 core VM | 16GB (8GB Swap) | 23TB | 2.9 million (March 2022) |
As of March 2022, pages served by Open ONI get ~12,500 views per day. All page batches are NDNP-spec except the TIFF derivative has been removed in favor of JP2. Almost all JP2s are monochrome; estimated 10-20% are full-color.
Penn State runs Open ONI in a Docker environment.
| CPU | Memory | Storage | Pages |
|---|---|---|---|
| Four CPUs | 32GB | 11TB | 1.1 million |
Memory usage is showing at 8.27Gb w/ a load average of 0.26.
We also have a fairly large number of TIFFs that are still in the batches. We’re planning to dump them sometime this spring and rebag the batches.