Skip to content

Add support for external datastore as source of historical ledger data#437

Merged
urvisavla merged 16 commits intostellar:protocol-23from
urvisavla:getledgers-from-datastore
Jun 3, 2025
Merged

Add support for external datastore as source of historical ledger data#437
urvisavla merged 16 commits intostellar:protocol-23from
urvisavla:getledgers-from-datastore

Conversation

@urvisavla
Copy link
Contributor

@urvisavla urvisavla commented May 16, 2025

What

This PR addresses #425 - adding support for handling getLedgers requests that fall outside the rpc's local ledger retention window by proxies out of range requests to the datastore.

  • Adds support for serving historical ledgers from a datastore when the requested range isn’t available locally.
  • Uses BufferedStorageBackend to read from the datastore.
  • The gen-config-file command now includes default datastore_config and buffered_storage_backend_config sections (commented out by default).
# Buffered storage backend configuration for reading ledgers from the datastore.
# [buffered_storage_backend_config]
  # buffer_size = 100
  # num_workers = 10
  # retry_limit = 3
  # retry_wait = "30s"

# External datastore configuration including type, bucket name and schema.
# [datastore_config]
  # type = "GCS"

  # [datastore_config.params]
    # destination_bucket_path = "path_to_bucket"

  # [datastore_config.schema]
    # files_per_partition = 64000
    # ledgers_per_file = 1
  • Adds a new flag serve_ledger_from_datastore. The flag used instead of just relying on the presence of a datastore config because same datastore config can in future be also be used for ingestion from a datastore, so this flag control over which behavior is enabled.

  • Includes basic unit tests.

Why

#425

Known limitations

  1. Currently, this assumes that the datastore has all ledgers starting from genesis. This is because there's no way yet to query datastore what range of ledgers it actually has. That limitation will be fixed once this stellar/go#5498) is done. We will then also use the real range to fill in the latestLedger and oldestLedger fields more accurately.

  2. The datastore_config and buffered_storage_backend_config can only be set through the config file and not with command-line flags. That’s because these settings can be complex or different depending on the type of datastore being used (like S3 or GCS). Supporting them in the config file keeps things simpler and more flexible.

@urvisavla urvisavla force-pushed the getledgers-from-datastore branch 2 times, most recently from 9f70442 to 83c48d4 Compare May 16, 2025 20:57
Copy link
Contributor

@Shaptic Shaptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a lil' drive by review 🚗

DefaultValue: 5 * time.Second,
},
{
Name: "serve-ledgers-from-datastore",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this implied by the presence of a datastore config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason

Adds a new flag serve_ledger_from_datastore. The flag used instead of just relying on the presence of a datastore config because same datastore config can in future be also be used for ingestion from a datastore, so this flag control over which behavior is enabled.

@urvisavla urvisavla force-pushed the getledgers-from-datastore branch 27 times, most recently from 84e5f50 to 33ab837 Compare May 18, 2025 20:03
@urvisavla urvisavla force-pushed the getledgers-from-datastore branch 2 times, most recently from 16e5297 to abba4e4 Compare May 26, 2025 23:28
@urvisavla urvisavla requested review from 2opremio and tamirms May 27, 2025 21:39
return protocol.GetLedgersResponse{
Ledgers: ledgers,
Ledgers: ledgers,
// TODO: update these fields using ledger range from datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider using separate fields for the ledger range belonging to the data store. I think it would be useful to know which ledgers are available directly from rpc and which ledgers require access to the data store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe a client use case which benefits from the breakout of ledgers?

This pushes forward an internal aspect of RPC, totally understand it given the possible difference of meta content mentioned. It seems like this introduces additional work for clients to handle and possibly some confusion.

Ideally, the outcome from sourcing other backends for ledger metadata by RPC should be transitive at the client API layer, i.e. data provided by RPC is from the network, nothing more?

One thought..Can the potential difference of ledger close meta between RPC or datastores be mitigated if Galexie were to always enable all captive core flags that affect ledger close meta? In that way a datastore would always provide the superset of meta avoiding missing data cases when interleaved with meta from any other source.

@tamirms
Copy link
Contributor

tamirms commented May 28, 2025

Now that there are several stellar-core flags (diagnostic events, classic cap-67 events, etc) which affect the contents of ledger close meta, I think we should consider the scenario where ledgers in the datastore are inconsistent with the ledgers served locally from rpc.

This issue can be addressed separate from this PR but I think we'll need to introduce some way to determine which features are enabled in the ledgers found from the data lake vs the rpc captive core. https://github.com/stellar/go/issues/5507 could be relevant

Copy link
Contributor

@tamirms tamirms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 issues that imo warrant further design discussions but I think those can be addressed separately from this PR

@mollykarcher
Copy link
Contributor

Now that there are several stellar-core flags (diagnostic events, classic cap-67 events, etc) which affect the contents of ledger close meta, I think we should consider the scenario where ledgers in the datastore are inconsistent with the ledgers served locally from rpc.

This is a really good point. Another way to solve this might be, that if you have a datastore configured as your backend, RPC's ledger retention window gets automatically set to 0, so RPC is effectively just acting as a lightweight facade/API layer on top of the data lake. I think I would be more comfortable with doing that if getTransactions/getEvents were also being served from the data lake though. Though there is precedence in other ecosystems (at least, ethereum) for some endpoints not working on when a node is running in archival mode.

@urvisavla urvisavla force-pushed the getledgers-from-datastore branch from 2f0397a to 6b62f20 Compare June 2, 2025 06:16
@urvisavla
Copy link
Contributor Author

I'm going to merge the PR but I've tried to capture all the open items from the PR in this issue #453. Please add anything I might have missed.

@urvisavla urvisavla merged commit aa8fc36 into stellar:protocol-23 Jun 3, 2025
10 of 13 checks passed
@urvisavla urvisavla deleted the getledgers-from-datastore branch June 3, 2025 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants