Skip to content

Automated migration of configuration and data format changes #8889

@creachadair

Description

@creachadair

Abstract

This issue documents some background information and discussion about how to support automatic migration of configuration file changes and data storage changes across different versions of Tendermint.

Configuration Changes

New releases of Tendermint often add, remove, or change configuration settings defined in the consensus node's config.toml file. When a network upgrades from a previous Tendermint release to a new one, these changes may require node operators to modify their existing configuration file to account for these changes:

  • For newly-added settings, the node provides sensible default values. The operator should not need to do anything for a new setting, unless they specifically want a value different from the default. For example: In v0.35, a new mode setting was added that defaults to "full".

  • Settings are removed when they belong to features that are no longer present in Tendermint. The operator should consider removing these from their config file, but is not required to do so: The node's configuration parser currently ignores "unknown" settings keys. For example: In v0.36, the rpc.grpc-laddr setting was removed.

Other kinds of configuration changes may require a more careful inspection:

  • Change of type: If the value for an existing configuration setting changes type, the operator will have to update the setting to use a new syntax. For example: In v0.35, the tx-index.indexer setting was changed from a "string" to an ["array", "of", "strings"].

  • Change of location and/or name: If an existing configuration setting is moved or renamed within the configuration file, the operator may have to update their file. For example: In v0.35, the top-level fast-sync setting was moved and renamed to blocksync.enable.

    • If the node was previously using the default value for the setting, it's possible the operator does not need to do anything. For ongoing maintenance, however, it would be wise for operators to update their files for these changes even if the effect is a no-op, to avoid confusion if they later need to modify the value.

    • If the node was not using the default value, or if the default value changes, the operator must update the configuration file, or else the value will implicitly revert to the (new) default.

Data Format Changes

New releases of Tendermint may also change the format of data stored in the node's on-disk databases. The most prominent example of this is #4567, which changed the format of all the keys in the key-value tables to use binary order-preserving keys (via the orderedcode package) in place of the original per-type custom string formats from previous versions. The changes to implement that issue also consolidated some per-height records to be singleton (per-node) records instead, as a way of saving space.

Any change to the keys or values in the node's databases requires us to migrate the existing data to the new format. There are some specific technical concerns for any data migration:

  1. The databases for a production node may be very large, and a node installation may not have enough storage to implement a "clean" migration, translating the existing data to a new copy of the database.

  2. Migration may be time-consuming for a large node, and a migration tool could fail or be interrupted partway through the process. Any automation for migration needs to be safely resilient to such failures.

  3. Depending on what the format changes are, a node may be unable to run at all (or may behave incorrectly) using a partially-migrated database. For example, in the case of the key format change described above, if a migration fails partway through neither the old software nor the new software may be able to correctly use the data (even if it does not crash, neither will see a correct view).

Discussion

To the extent possible, we should not require node operators to take manual actions during an upgrade of Tendermint, except where necessary to ensure correct operation. Changes to configuration files and databases that can be safely automated should be handled automatically by upgrade tooling.

The term safe here means that we should be confident that the change can be performed automatically without causing the node to fail or lose data as a result of that change. Conversely, if we cannot be confident a change will always migrate a working node at the previous version to a working node at the next version, we should not fully automate the process.

As a general principle, automated migration should never result in a surprising failure for a node operator moving from an old working version to a new version of the software. Surprising failures include things like a node crashing or stalling due to missing or incorrect config settings, or data loss due to an incomplete or incorrect database migration.

Automation does not have to be all-or-nothing. As long as a migration tool that encounters unexpected problems can report an actionable error to the operator without losing any data, we do not necessarily have to guarantee every possible condition is handled. A key point is that an automated migration should always "fail safe": If we report success, the software should work as before. If migration couldn't complete or needs input from the operator, it should report an error so the operator is not misled into thinking all is well.

These principles can also guide decisions about how to implement changes: If automated upgrades are a goal, we should avoid making changes to configuration settings and data formats that could not be safely automated—even if that means we have to compromise on the "ideal" changes we would prefer to make.

Metadata

Metadata

Assignees

No one assigned

    Labels

    stalefor use by stalebot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions