Skip to content

feat: Use S3 node store with seaweedfs#3498

Merged
aldy505 merged 36 commits intomasterfrom
byk/feat/s3-nodestore
Sep 13, 2025
Merged

feat: Use S3 node store with seaweedfs#3498
aldy505 merged 36 commits intomasterfrom
byk/feat/s3-nodestore

Conversation

@BYK
Copy link
Member

@BYK BYK commented Dec 31, 2024

Enables S3 node store using SeaweedFS and sentry-nodestore-s3 by @stayallive

This should alleviate all the issues stemming from (ab)using PostgreSQL as the node store.

@codecov
Copy link

codecov bot commented Dec 31, 2024

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.49%. Comparing base (440b658) to head (95a4de2).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3498   +/-   ##
=======================================
  Coverage   99.49%   99.49%           
=======================================
  Files           3        3           
  Lines         197      197           
=======================================
  Hits          196      196           
  Misses          1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aldy505
Copy link
Collaborator

aldy505 commented Dec 31, 2024

Any reason why you didn't use SeaweedFS per what you said yesterday?

BYK and others added 2 commits December 31, 2024 15:23
Co-authored-by: Reinaldy Rafli <aldy505@proton.me>
@BYK
Copy link
Member Author

BYK commented Dec 31, 2024

@aldy505

Any reason why you didn't use SeaweedFS per what you said yesterday?

Well I started with that and realized 3 things:

  1. It really is not geared towards single-node setups and have nodes with different roles. This makes is more challenging to scale up or set up in our setup
  2. It has this paid admin interface. Not a deal breaker but it is clear that it is geared towards more "professional" setups
  3. Its S3 API interface support is not really great

Garage fits the bill much better as it is explicitly created for smaller setups like this, easy to expand without specialized roles, doesn't have any paid thing in it, and has much more decent and familiar S3 interface support.

@doc-sheet
Copy link
Contributor

It really is not geared towards single-node setups and have nodes with different roles. This makes is more challenging to scale up or set up in our setup

when I tried seaweedfs last time (and I still use it for sourcemap/profile storage tbh) it had single node ability via weed server command.
Like

weed server -filter=true -s3=true -master=true -volume=true

Some of them enabled by default.

@doc-sheet
Copy link
Contributor

doc-sheet commented May 25, 2025

I think garage/minio simpler for small setups, seaweedfs looks necessary for mid to high setups because all other services I know keep files as is.

And thousands of thousands small files like profiles not ideal to store on most popular filesystems i guess.

@aldy505
Copy link
Collaborator

aldy505 commented Jun 4, 2025

I think garage/minio simpler for small setups, seaweedfs looks necessary for mid to high setups because all other services I know keep files as is.

@doc-sheet Hey, I'm going to work on this PR. I'd think seaweed is better for self-hosted Sentry. One thing I don't like about Garage is that we need to specify the storage allocation beforehand, if we set it to 100GB, there might be some people that have more data than 100GB, I don't want that to cause any issues.

That said, since you said you've used seaweed before: How was your experience? How does it compare to MinIO or Ceph?

And thousands of thousands small files like profiles not ideal to store on most popular filesystems i guess.

Yeah if we set up an object storage, we might as well move filestore & profiles there too. But let's focus on nodestore first.

@doc-sheet
Copy link
Contributor

How was your experience? How does it compare to MinIO or Ceph?

It is a bit strange sometimes. But it is fine.

It has multiple options for filer store.
I didn't try leveldb storage aiming to fault tolerance.

At first I tried redis it worked for several months and then... I just lost all data.
It was there physically but wasn't available from API (s3 or web) - each list call returned different results.

I don't know if issue was in redis or weed itself. i suspect bug with ttl could be the reason too.

But after that incident I wiped cluster and started new one with scylla as a filer backend and it works fine for almost a year already despite that ttl bug.

Seaweedfs have multiple versions like

  • 3.89
  • 3.89_full
  • 3.89_large_disk
  • 3.89_large_disk_full

I suggest to use large_disk always. Documentation is not clear but it is easy to reach that limit
https://github.com/seaweedfs/seaweedfs/wiki/FAQ#how-to-configure-volumes-larger-than-30gb

I don't know difference between full and normal and just use _large_disk_full builds :)

Also I don't use s3 auth - I was too lazy to set it up.

Other than all that I have no problems and barely touched it after initial setup. It just works.
I added some volumes but not removed any yet.

As for minio and ceph.
I never used ceph.

But minio was the reason to look for alternatives.

Tons of profiles from js-sdk stored as different files started to affect my monitoring script and soon it might start to affect minio performance too.

And it is not that easy to scale minio. And probably impossible to optimize for small-files storage. At least in my low-cost setup.

@doc-sheet
Copy link
Contributor

doc-sheet commented Jun 4, 2025

let's focus on nodestore first.

If seaweedfs would control ttl then there is another catch.
I'm not sure if it is possible to control ttl with s3-api already.

weed have it's own settings for collections and it creates collection for each s3-bucket.
https://github.com/seaweedfs/seaweedfs/wiki/S3-API-FAQ#setting-ttl

But if sentry itself would cleanup old data I guess there is no difference.

@aldy505
Copy link
Collaborator

aldy505 commented Jun 5, 2025

How was your experience? How does it compare to MinIO or Ceph?

It is a bit strange sometimes. But it is fine.

It has multiple options for filer store. I didn't try leveldb storage aiming to fault tolerance.

At first I tried redis it worked for several months and then... I just lost all data. It was there physically but wasn't available from API (s3 or web) - each list call returned different results.

I don't know if issue was in redis or weed itself. i suspect bug with ttl could be the reason too.

But after that incident I wiped cluster and started new one with scylla as a filer backend and it works fine for almost a year already despite that ttl bug.

Seaweedfs have multiple versions like

  • 3.89
  • 3.89_full
  • 3.89_large_disk
  • 3.89_large_disk_full

I suggest to use large_disk always. Documentation is not clear but it is easy to reach that limit https://github.com/seaweedfs/seaweedfs/wiki/FAQ#how-to-configure-volumes-larger-than-30gb

I don't know difference between full and normal and just use _large_disk_full builds :)

Also I don't use s3 auth - I was too lazy to set it up.

Other than all that I have no problems and barely touched it after initial setup. It just works. I added some volumes but not removed any yet.

Good to know about Seaweed

As for minio and ceph. I never used ceph.

But minio was the reason to look for alternatives.

Tons of profiles from js-sdk stored as different files started to affect my monitoring script and soon it might start to affect minio performance too.

And it is not that easy to scale minio. And probably impossible to optimize for small-files storage. At least in my low-cost setup.

Ah so everyone has the same experience with minio.

let's focus on nodestore first.

If seaweedfs would control ttl then there is another catch. I'm not sure if it is possible to control ttl with s3-api already.

weed have it's own settings for collections and it creates collection for each s3-bucket. https://github.com/seaweedfs/seaweedfs/wiki/S3-API-FAQ#setting-ttl

But if sentry itself would cleanup old data I guess there is no difference.

The sentry cleanup job only cleans up the one on filesystem. If we're using S3, it won't clean up anything. We need to configure S3 data cleanup on our own.

@doc-sheet
Copy link
Contributor

doc-sheet commented Jun 6, 2025

Looks like i missed that seaweedfs now have an ability to control ttl with s3 api. And I even linked to correct section of FAQ. :)

I'd like to look into new integraton with seaweedfs.

Nad by the way I like the idea of expanding sentry images.

I am myself install some packages and modules.

Like maybe an extra step in install to build user provided Dockerfiles.

@aldy505
Copy link
Collaborator

aldy505 commented Sep 6, 2025

Config migration is done and is behind a prompt/flag. Next up is to think how to manage the retention. Should we do it with a cron/scheduled job (cons: probably heavy process, pros: can be configured dynamically), or a S3 lifecycle (cons: can't be configured dynamically, pros: won't be a heavy process).

@aldy505
Copy link
Collaborator

aldy505 commented Sep 9, 2025

I'm just gonna go forward with this https://github.com/seaweedfs/seaweedfs/wiki/S3-API-FAQ#setting-ttl

@aldy505
Copy link
Collaborator

aldy505 commented Sep 9, 2025

Great, I believe that's all.

@aldy505
Copy link
Collaborator

aldy505 commented Sep 9, 2025

@aminvakil @doc-sheet Hi, would you mind reviewing this PR?

Copy link
Collaborator

@aldy505 aldy505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking myself.

Copy link
Member Author

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguing with myself 😝 (comments for @aldy505 )

@@ -0,0 +1,90 @@
echo "${_group}Bootstrapping seaweedfs (node store)..."

$dc up --wait seaweedfs postgres
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the --wait thing work for podman?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will check later.


bucket_list=$($s3cmd --access_key=sentry --secret_key=sentry --no-ssl --region=us-east-1 --host=localhost:8333 --host-bucket='localhost:8333/%(bucket)' ls)

if [[ $($bucket_list | tail -1 | awk '{print $3}') != 's3://nodestore' ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if condition really needs some explanation about what it is checking and how -- like what are the assumptions here?

# Other backend implementations for node storage developed by the community
# are available in public GitHub repositories.

SENTRY_NODESTORE = "sentry_nodestore_s3.S3PassthroughDjangoNodeStorage"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Passthrough for fresh installs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is your original code, so I assume, yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not assume anything about my original code 😅

read -p "y or n? " yn
case $yn in
y | yes | 1)
export APPLY_AUTOMATIC_CONFIG_UPDATES=1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dangerous as once you export it like this wouldn't it also enable auto update for pgbouncer? I think that should be a separate questions and decision?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point.

n | no | 0)
export APPLY_AUTOMATIC_CONFIG_UPDATES=0
echo
echo -n "Alright, you will need to update your sentry.conf.py file manually before running 'docker compose up'."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, provide details and link to this PR (or some doc). Also skip setting up seaweed at this point or at least mention that the service will be running for no good reason?


if [[ "$APPLY_AUTOMATIC_CONFIG_UPDATES" == 1 ]]; then
nodestore_config=$(sed -n '/SENTRY_NODESTORE/,/[}]/{p}' sentry/sentry.conf.example.py)
if [[ $($dc exec postgres psql -qAt -U postgres -c "select exists (select * from nodestore_node limit 1)") = "f" ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up until this point you don't need the postgres service so maybe instead of bringing it up at the beginning, just use $dcr postgres psql ... here?

Comment on lines +62 to +63
$dc exec seaweedfs mkdir -p /data/idx/
$s3cmd --access_key=sentry --secret_key=sentry --no-ssl --region=us-east-1 --host=localhost:8333 --host-bucket='localhost:8333/%(bucket)' mb s3://nodestore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these commands repeated (from lines 6 and 7 above)?

Comment on lines +65 to +81
# XXX(aldy505): Should we refactor this?
lifecycle_policy=$(
cat <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<LifecycleConfiguration>
<Rule>
<ID>Sentry-Nodestore-Rule</ID>
<Status>Enabled</Status>
<Filter></Filter>
<Expiration>
<Days>$SENTRY_EVENT_RETENTION_DAYS</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
EOF
)
$dc exec seaweedfs sh -c "printf '%s' '$lifecycle_policy' > /tmp/nodestore-lifecycle-policy.xml"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, make it a config file and mount it? Is this because of the $SENTRY_EVENT_RETENTION_DAYS variable? If yes, we may wanna construct a custom entry point script for seaweed so when people change this value and restart, it is updated without running ./install.sh again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting seaweed wouldn't take anything into effect. You will need to execute s3cmd setlifecycle for that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm saying we should put whatever is necessary into a custom entrypoint script for seaweed so it picks the changes up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that approach, it complicates stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, until someone compalins let's go with the simpler approach :)

Comment on lines +84 to +85
echo "Making sure the bucket lifecycle policy is all set up correctly..."
$s3cmd --access_key=sentry --secret_key=sentry --no-ssl --region=us-east-1 --host=localhost:8333 --host-bucket='localhost:8333/%(bucket)' getlifecycle s3://nodestore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems it belongs to tests instead of here? Is this because of the unknown env variable value?

Copy link
Collaborator

@aminvakil aminvakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it opt-in for a release? And get some feedback regarding this change?

@aldy505
Copy link
Collaborator

aldy505 commented Sep 12, 2025

Can we make it opt-in for a release? And get some feedback regarding this change?

@aminvakil You're saying this shouldn't be the default? And probably switch it as a default at a later stage? That works for me.

@BYK
Copy link
Member Author

BYK commented Sep 12, 2025

I'd say default for fresh installs. And since we prompt people for existing installs, it should be fine?

@aminvakil
Copy link
Collaborator

@aldy505 Yes.

@BYK Yes.

I have zero knowledge about seaweedfs and very limited knowledge about S3 storages, therefore I'd suggest merging this after 15th, so people who use main branch can test it in different environments they have.

But again that's because my limited knowledge of S3 storages and I do not know who big this change is.

EOF
)
$dc exec seaweedfs sh -c "printf '%s' '$lifecycle_policy' > /tmp/nodestore-lifecycle-policy.xml"
$s3cmd --access_key=sentry --secret_key=sentry --no-ssl --region=us-east-1 --host=localhost:8333 --host-bucket='localhost:8333/%(bucket)' setlifecycle /tmp/nodestore-lifecycle-policy.xml s3://nodestore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does --region=us-east-1 mean here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the default region name for AWS S3. It doesn't mean anything for self-hosted, but we can only put known region values.

"endpoint_url": "http://seaweedfs:8333",
"bucket_path": "nodestore",
"bucket_name": "nodestore",
"region_name": "us-east-1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is region_name mandatory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Otherwise it'll complain of missing region name.

@aldy505
Copy link
Collaborator

aldy505 commented Sep 12, 2025

@aldy505 Yes.

@BYK Yes.

I have zero knowledge about seaweedfs and very limited knowledge about S3 storages, therefore I'd suggest merging this after 15th, so people who use main branch can test it in different environments they have.

But again that's because my limited knowledge of S3 storages and I do not know who big this change is.

Great! I'll change it later tonight.

@aldy505
Copy link
Collaborator

aldy505 commented Sep 12, 2025

Uh, I mean lets do what @BYK said. I still prefer to merge all this before 15th.

@aldy505 aldy505 merged commit 84f904f into master Sep 13, 2025
16 checks passed
@aldy505 aldy505 deleted the byk/feat/s3-nodestore branch September 13, 2025 00:48
xprotocol-bot added a commit to xprotocol-org/self-hosted that referenced this pull request Sep 18, 2025
xprotocol-bot added a commit to xprotocol-org/self-hosted that referenced this pull request Sep 24, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Postgres nodestore_node is huge Cleaning nodestore_node table

5 participants