Skip to content

rgw: add usage exporter with LMDB-backed usage cache and perf counters#66109

Open
harsimran-05 wants to merge 14 commits intoceph:mainfrom
harsimran-05:usage-exporter-clean-v4
Open

rgw: add usage exporter with LMDB-backed usage cache and perf counters#66109
harsimran-05 wants to merge 14 commits intoceph:mainfrom
harsimran-05:usage-exporter-clean-v4

Conversation

@harsimran-05
Copy link

@harsimran-05 harsimran-05 commented Nov 3, 2025

This PR introduces a low-overhead mechanism to track and expose per-user and per-bucket usage statistics (used bytes and object count) in Ceph RGW using:
Labeled PerfCounters via key_create() — metrics use Prometheus labels (tenant, owner, bucket) instead of embedding identifiers in metric names
LMDB (lightweight embedded key-value store for persistence across RGW restarts)
Background refresh thread that periodically syncs statistics from RADOS (the distributed source of truth)

Design :

-> RADOS is the single source of truth for all usage statistics
-> Background thread reads from RADOS every rgw_usage_stats_refresh_interval (default: 5 minutes)
-> LMDB cache is used only for persistence across RGW restarts
-> No cache updates in I/O path - only marks users as "active" for background refresh
-> All RGWs in a cluster show consistent, accurate statistics
->Owner label is extracted from bucket->get_info().owner via std::visit, supporting both rgw_user and rgw_account_id

Metric Format
Bucket metrics use labels for tenant, owner, and bucket name:

rgw_bucket_usage{tenant="", owner="testuser", bucket="bucket1"} used_bytes=5242892 num_objects=2
rgw_bucket_usage{tenant="", owner="testuser", bucket="bucket2"} used_bytes=10485760 num_objects=1

User metrics use an owner label:
rgw_user_usage{owner="testuser"} used_bytes=15728652 num_objects=3

KKey Components
UsagePerfCounters (rgw_usage_perf.h/cc)

Manages per-user PerfCounters for Prometheus export
Maintains set of "active" users for background refresh
Background refresh thread syncs from RADOS at configurable interval
Creates dynamic perf counters: rgw_user_ with used_bytes and num_objects

UsageCache (rgw_usage_cache.h/cc)

LMDB-backed persistent cache for usage statistics
Provides fast reads for perf counter updates
Survives RGW restarts (stats available immediately on startup)
Simplified counters: cache_updates, cache_size, cache_hits, cache_misses

I/O Path Integration (rgw_op.cc)

Minimal overhead: only calls mark_user_active() on PUT operations
No sync_owner_stats() calls in hot path
No cache calculations or updates during I/O

Configuration Options

  1. rgw_enable_usage_perf_counters -> type : bool (default : false) - > enable/disable the feature
  2. rgw_usage_cache_path -> type : string (default : /var/lib/ceph/radosgw/usage_cache-$cluster-$name.mdb) - > LMDB database path
  3. rgw_usage_cache_max_size -> type : size (default : 1GB) - > Maximum LMDB database size
  4. rgw_usage_stats_refresh_interval -> type : int (default : 20 min) - > Background sync interval from RADOS

PerfCounters Exposed
Per-Bucket Counters (rgw_bucket_usage)

->Labels: tenant, owner, bucket
->used_bytes — Total bytes stored in bucket
->num_objects — Total objects in bucket

Per-User Counters (rgw_user_usage)

->Labels: owner
->used_bytes — Total bytes used by user
->num_objects — Total objects owned by user

Cache Health Counters (rgw_usage_cache)

->cache_updates, cache_size, cache_hits, cache_misses

Unit Tests
Added test_rgw_usage_exporter.cc under src/test/rgw:
Creates a temporary LMDB store
Writes and verifies usage stats for both users and buckets

HOW TO TEST THE FEATURE?

Navigate to your Ceph build directory

cd /path/to/ceph/build

Kill any existing development cluster

../src/stop.sh

Start a minimal cluster with RGW

MON=1 OSD=3 MGR=1 RGW=1 ../src/vstart.sh -n -d --rgw_port 8000

Set up the usage counters feature

./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_enable_usage_perf_counters true
./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_usage_cache_path /tmp/usage_cache.mdb
./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_usage_cache_max_size 1073741824

Restart RGW to apply settings

pkill radosgw
sleep 2
./bin/radosgw -c ceph.conf --log-file=out/radosgw.8000.log --admin-socket=out/radosgw.8000.asok --debug-rgw=20 -n client.rgw.8000 --rgw_frontends="beast port=8000" &
(Sometimes, prompt gets stuck just press enter and check if RGW is running . If it is still running , continue further)

Create a user for testing

./bin/radosgw-admin -c ceph.conf user create --uid=testuser --display-name="Test User" --access-key=test --secret-key=test

Install s3cmd if needed

pip install s3cmd

Configure s3cmd

cat > ~/.s3cfg << EOF
[default]
access_key = test
secret_key = test
host_base = localhost:8000
host_bucket = localhost:8000
use_https = False
signature_v2 = True
EOF

Create test files

echo "Hello World" > small.txt
dd if=/dev/zero of=medium.bin bs=1M count=5
dd if=/dev/zero of=large.bin bs=1M count=10

Create buckets

s3cmd mb s3://bucket1
s3cmd mb s3://bucket2

Upload files

s3cmd put small.txt s3://bucket1/
s3cmd put medium.bin s3://bucket1/
s3cmd put large.bin s3://bucket2/

Set the admin socket path

SOCKET=out/radosgw.8000.asok

(you will have to wait for refresh_interval time to see the updates)

View bucket counters with labels

./bin/ceph --admin-daemon $SOCKET counter dump | jq '.rgw_bucket_usage'

View user counters with labels

./bin/ceph --admin-daemon $SOCKET counter dump | jq '.rgw_user_usage'

Query specific bucket

./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | select(.labels.bucket=="bucket1")'

Query specific user

./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_user_usage[] | select(.labels.owner=="testuser")'

View bucket summary (bucket, owner, bytes)

./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | {bucket: .labels.bucket, owner: .labels.owner, bytes: .counters.used_bytes}'

Cross-check with radosgw-admin (values should match)

./bin/radosgw-admin -c ceph.conf bucket stats --bucket=bucket1 |
jq '.usage["rgw.main"] | {size, num_objects}'
./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | select(.labels.bucket=="bucket1") | .counters'

Also, If you want to run automated tests and check

run these commands
ninja unittest_rgw_usage_cache
./bin/unittest_rgw_usage_cache
ninja unittest_rgw_usage_perf_counters
./bin/unittest_rgw_usage_perf_counters

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

Config Diff Tool Output

+ added: rgw_usage_cache_ttl (rgw.yaml.in)
+ added: rgw_usage_cache_max_size (rgw.yaml.in)
+ added: rgw_usage_stats_refresh_interval (rgw.yaml.in)
+ added: rgw_usage_cache_path (rgw.yaml.in)
+ added: rgw_enable_usage_perf_counters (rgw.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch from 10eb215 to 296283a Compare November 3, 2025 12:00
@harsimran-05
Copy link
Author

/config check ok

@harsimran-05
Copy link
Author

jenkins test ceph config changes

@harsimran-05
Copy link
Author

Jenkins retest this please

@harsimran-05
Copy link
Author

jenkins retest this please

@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch 2 times, most recently from 17a524d to c9e937f Compare November 4, 2025 05:13
@harsimran-05
Copy link
Author

jenkins test api

@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch 2 times, most recently from 48ccfdb to ee38e91 Compare November 7, 2025 09:04
@harsimran-05 harsimran-05 reopened this Nov 11, 2025
@harsimran-05
Copy link
Author

jenkins retest this please

@harsimran-05
Copy link
Author

jenkins retest this please

Copy link
Contributor

@adamemerson adamemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you merge, please make sure the top line of every commit has a prefix like rgw: or rgw/something:

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Also, increasing time to 20 mins and updating tests

Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch from 11dd98c to 9033dd4 Compare December 12, 2025 06:18
@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch 5 times, most recently from e9c130e to 113681a Compare December 12, 2025 10:24
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
@harsimran-05 harsimran-05 force-pushed the usage-exporter-clean-v4 branch from 113681a to 9742fbe Compare December 12, 2025 10:43
@anrao19
Copy link
Contributor

anrao19 commented Feb 10, 2026

New build created and execution is in progress : https://tracker.ceph.com/issues/74079

@anrao19
Copy link
Contributor

anrao19 commented Feb 23, 2026

Execution complete, tracker approved by @ivancich. tracker detail: https://tracker.ceph.com/issues/74079
@harsimran-05, If no further testing needed, pr can be merged

@anrao19
Copy link
Contributor

anrao19 commented Feb 23, 2026

jenkins test make check

@anrao19
Copy link
Contributor

anrao19 commented Feb 23, 2026

jenkins test make check arm64

@anrao19
Copy link
Contributor

anrao19 commented Mar 3, 2026

jenkins test make check

@anrao19
Copy link
Contributor

anrao19 commented Mar 3, 2026

jenkins test make check arm64

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to merge this, as we're pursuing a different design based on prometheus pushgateway and #66671

@jmundack
Copy link
Contributor

jmundack commented Mar 3, 2026

i don't think we want to merge this, as we're pursuing a different design based on prometheus pushgateway and #66671

@cbodley - is there some place public where these plans are made available? so that in the future we can avoid wasted/duplicate efforts

Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants