rgw: add usage exporter with LMDB-backed usage cache and perf counters#66109
rgw: add usage exporter with LMDB-backed usage cache and perf counters#66109harsimran-05 wants to merge 14 commits intoceph:mainfrom
Conversation
Config Diff Tool Output+ added: rgw_usage_cache_ttl (rgw.yaml.in)
+ added: rgw_usage_cache_max_size (rgw.yaml.in)
+ added: rgw_usage_stats_refresh_interval (rgw.yaml.in)
+ added: rgw_usage_cache_path (rgw.yaml.in)
+ added: rgw_enable_usage_perf_counters (rgw.yaml.in)
The above configuration changes are found in the PR. Please update the relevant release documentation if necessary. |
10eb215 to
296283a
Compare
|
/config check ok |
|
jenkins test ceph config changes |
|
Jenkins retest this please |
|
jenkins retest this please |
17a524d to
c9e937f
Compare
|
jenkins test api |
48ccfdb to
ee38e91
Compare
|
jenkins retest this please |
|
jenkins retest this please |
adamemerson
left a comment
There was a problem hiding this comment.
Before you merge, please make sure the top line of every commit has a prefix like rgw: or rgw/something:
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
Also, increasing time to 20 mins and updating tests Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
11dd98c to
9033dd4
Compare
e9c130e to
113681a
Compare
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
113681a to
9742fbe
Compare
|
New build created and execution is in progress : https://tracker.ceph.com/issues/74079 |
|
Execution complete, tracker approved by @ivancich. tracker detail: https://tracker.ceph.com/issues/74079 |
|
jenkins test make check |
|
jenkins test make check arm64 |
|
jenkins test make check |
|
jenkins test make check arm64 |
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
This PR introduces a low-overhead mechanism to track and expose per-user and per-bucket usage statistics (used bytes and object count) in Ceph RGW using:
Labeled PerfCounters via key_create() — metrics use Prometheus labels (tenant, owner, bucket) instead of embedding identifiers in metric names
LMDB (lightweight embedded key-value store for persistence across RGW restarts)
Background refresh thread that periodically syncs statistics from RADOS (the distributed source of truth)
Design :
-> RADOS is the single source of truth for all usage statistics
-> Background thread reads from RADOS every rgw_usage_stats_refresh_interval (default: 5 minutes)
-> LMDB cache is used only for persistence across RGW restarts
-> No cache updates in I/O path - only marks users as "active" for background refresh
-> All RGWs in a cluster show consistent, accurate statistics
->Owner label is extracted from bucket->get_info().owner via std::visit, supporting both rgw_user and rgw_account_id
Metric Format
Bucket metrics use labels for tenant, owner, and bucket name:
rgw_bucket_usage{tenant="", owner="testuser", bucket="bucket1"} used_bytes=5242892 num_objects=2
rgw_bucket_usage{tenant="", owner="testuser", bucket="bucket2"} used_bytes=10485760 num_objects=1
User metrics use an owner label:
rgw_user_usage{owner="testuser"} used_bytes=15728652 num_objects=3
KKey Components
UsagePerfCounters (rgw_usage_perf.h/cc)
Manages per-user PerfCounters for Prometheus export
Maintains set of "active" users for background refresh
Background refresh thread syncs from RADOS at configurable interval
Creates dynamic perf counters: rgw_user_ with used_bytes and num_objects
UsageCache (rgw_usage_cache.h/cc)
LMDB-backed persistent cache for usage statistics
Provides fast reads for perf counter updates
Survives RGW restarts (stats available immediately on startup)
Simplified counters: cache_updates, cache_size, cache_hits, cache_misses
I/O Path Integration (rgw_op.cc)
Minimal overhead: only calls mark_user_active() on PUT operations
No sync_owner_stats() calls in hot path
No cache calculations or updates during I/O
Configuration Options
PerfCounters Exposed
Per-Bucket Counters (rgw_bucket_usage)
->Labels: tenant, owner, bucket
->used_bytes — Total bytes stored in bucket
->num_objects — Total objects in bucket
Per-User Counters (rgw_user_usage)
->Labels: owner
->used_bytes — Total bytes used by user
->num_objects — Total objects owned by user
Cache Health Counters (rgw_usage_cache)
->cache_updates, cache_size, cache_hits, cache_misses
Unit Tests
Added test_rgw_usage_exporter.cc under src/test/rgw:
Creates a temporary LMDB store
Writes and verifies usage stats for both users and buckets
HOW TO TEST THE FEATURE?
Navigate to your Ceph build directory
cd /path/to/ceph/build
Kill any existing development cluster
../src/stop.sh
Start a minimal cluster with RGW
MON=1 OSD=3 MGR=1 RGW=1 ../src/vstart.sh -n -d --rgw_port 8000
Set up the usage counters feature
./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_enable_usage_perf_counters true
./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_usage_cache_path /tmp/usage_cache.mdb
./bin/ceph -c ceph.conf config set client.rgw.8000 rgw_usage_cache_max_size 1073741824
Restart RGW to apply settings
pkill radosgw
sleep 2
./bin/radosgw -c ceph.conf --log-file=out/radosgw.8000.log --admin-socket=out/radosgw.8000.asok --debug-rgw=20 -n client.rgw.8000 --rgw_frontends="beast port=8000" &
(Sometimes, prompt gets stuck just press enter and check if RGW is running . If it is still running , continue further)
Create a user for testing
./bin/radosgw-admin -c ceph.conf user create --uid=testuser --display-name="Test User" --access-key=test --secret-key=test
Install s3cmd if needed
pip install s3cmd
Configure s3cmd
cat > ~/.s3cfg << EOF
[default]
access_key = test
secret_key = test
host_base = localhost:8000
host_bucket = localhost:8000
use_https = False
signature_v2 = True
EOF
Create test files
echo "Hello World" > small.txt
dd if=/dev/zero of=medium.bin bs=1M count=5
dd if=/dev/zero of=large.bin bs=1M count=10
Create buckets
s3cmd mb s3://bucket1
s3cmd mb s3://bucket2
Upload files
s3cmd put small.txt s3://bucket1/
s3cmd put medium.bin s3://bucket1/
s3cmd put large.bin s3://bucket2/
Set the admin socket path
SOCKET=out/radosgw.8000.asok
(you will have to wait for refresh_interval time to see the updates)
View bucket counters with labels
./bin/ceph --admin-daemon $SOCKET counter dump | jq '.rgw_bucket_usage'
View user counters with labels
./bin/ceph --admin-daemon $SOCKET counter dump | jq '.rgw_user_usage'
Query specific bucket
./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | select(.labels.bucket=="bucket1")'
Query specific user
./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_user_usage[] | select(.labels.owner=="testuser")'
View bucket summary (bucket, owner, bytes)
./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | {bucket: .labels.bucket, owner: .labels.owner, bytes: .counters.used_bytes}'
Cross-check with radosgw-admin (values should match)
./bin/radosgw-admin -c ceph.conf bucket stats --bucket=bucket1 |
jq '.usage["rgw.main"] | {size, num_objects}'
./bin/ceph --admin-daemon $SOCKET counter dump |
jq '.rgw_bucket_usage[] | select(.labels.bucket=="bucket1") | .counters'
Also, If you want to run automated tests and check
run these commands
ninja unittest_rgw_usage_cache
./bin/unittest_rgw_usage_cache
ninja unittest_rgw_usage_perf_counters
./bin/unittest_rgw_usage_perf_counters
Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition