Skip to content

Serialization compatibility version to be encoded in the Database file#14981

Closed
carlopi wants to merge 19 commits intoduckdb:mainfrom
carlopi:serialization_compatibility_to_db_option
Closed

Serialization compatibility version to be encoded in the Database file#14981
carlopi wants to merge 19 commits intoduckdb:mainfrom
carlopi:serialization_compatibility_to_db_option

Conversation

@carlopi
Copy link
Contributor

@carlopi carlopi commented Nov 25, 2024

This PR change the serialization_compatibility_version from being a setting set on every session to be written on creation in the database file (and so persisted across versions).

Basic example, works both in older and newer DuckDB versions:

ATTACH 'file1.db';
  --- write something, to be serialized targeting version v0.10.0
SET serialization_compatibility_version = 'v1.0.0';
ATTACH 'file2.db';
  --- write something, to be serialized targeting v1.0.0

But in older DuckDB versions the setting influenced how the file is written right now, with this PR this is persisted, example:

DETACH file2;
ATTACH 'file2.db';
  --- write something, to be serialized targeting v1.0.0 after this PR, v0.10.0 before this PR

There is also an ATTACH option:

ATTACH 'file3.db' (COMPATIBILITY_VERSION 'v1.1.0');
   --- file will target COMPATIBILITY_VERSION v1.1.0

Note that for a given DuckDB database file, compatibility_version is a property of the file itself that is only set on initialisation.

This PR should allow to convert database files, like:

ATTACH 'file_from_the_future.db'; --- assume there are some newer compression methods in this file
ATTACH 'compatible_file.db' (COMPATIBILITY_VERSION 'v1.0.0');
COPY DATABASE FROM file_from_the_future TO compatible_file;

Where file_from_the_future.db might support better encoding, compatible_file.db will contain the same actual logical content but encoded differently.

Implementation details

  • Serializer, and class derived from it, needs to initialize explicitly what's the target serialization they need to support. This is cumbersome, and might break extensions relying on old API, but I think it's better to require explicit serialization target
  • The more relevant changes are:
    • in src/storage/single_file_block_manager.cpp where there is the logic to read and write versions to the DuckDB database file
    • in src/storage/storage_manager.cpp where there is the logic that decides what compatibility_version to use, that is latest when in memory, the one from the DB file when opening an existing file, or the one either from the AttachOption or the setting when creating a new file

Follow up:

  • expose this to SQL, possibly by adding a tag so that it can be queried via duckdb_databases()
  • add switch on compatibility_version on whether to serialize ZSTD segments or other changes
  • adding more tests
  • adding more testing across DuckDB versions

…dded once there is a way to specify compat version earlier)
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some comments

void Write(WriteStream &ser);
static MainHeader Read(ReadStream &source);

data_t compatibility_git_desc[MAX_VERSION_SIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private?

optional_idx GetSerializationVersion(const char *version_string);
string GetSerializationVersionName(idx_t index);
vector<string> GetSerializationCandidates();
extern const uint64_t DEFAULT_STORAGE_VERSION_INFO;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be here?

} else if (entry.first == "row_group_size") {
storage_options.row_group_size = entry.second.GetValue<uint64_t>();
} else if (entry.first == "compatibility_version") {
storage_options.compatibility_version = entry.second.GetValue<string>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to validate the option here and store the internal version number instead of a string in the storage options?

const SerializationCompatibility &compat) const {
MemoryStream stream;
BinarySerializer serializer(stream);
SerializationOptions s(SerializationOptions::From(compat));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just always use latest - should not be a parameter

SerializationData data;

public:
explicit Serializer(const SerializationOptions &opts) : options(opts) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of enforcing this at the constructor level maybe we can use the Set/Get parameters to pass this down instead?

# name: test/sql/catalog/function/test_macro_default_arg_with_dependencies.test
# group: [function]

require skip_reload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

# description: Test views with changing schema
# group: [view]

require skip_reload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

# description: Test behavior of 'sql' on various different views
# group: [view]

require skip_reload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

# description: Test export of generated columns
# group: [export]

require skip_reload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

public:
explicit JsonSerializer(yyjson_mut_doc *doc, bool skip_if_null, bool skip_if_empty, bool skip_if_default)
: doc(doc), stack({yyjson_mut_obj(doc)}), skip_if_null(skip_if_null), skip_if_empty(skip_if_empty) {
: Serializer(SerializationOptions::DefaultOldestSupported()), doc(doc), stack({yyjson_mut_obj(doc)}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be passed down as an option similar to the serializer

hannes added a commit that referenced this pull request Dec 17, 2024
Tests should never use `latest`, but specify what's the intended version
explicitly.

This was part of #14981, but this
is independent and can go ahead at a different speed.
@carlopi carlopi mentioned this pull request Jan 14, 2025
Mytherin added a commit that referenced this pull request Jan 16, 2025
This adds the possibility for DuckDB to read (and modify) files of a new
storage version. Those files are (at this moment) identical to files
version 64 (that has been the default since `v0.10.1`), with a
significant change that they can't be opened by previous DuckDB version.

This would make possible to guard features touching storage that are
incompatible with previous DuckDB versions, such as improved compression
methods.

The meaning of the data in `src/storage/version_map.json` changes from
being the exact storage version produced by a given DuckDB version to
the maximum storage version produced. Note that this is compatible with
previous interpretation, where the LOWER and UPPER bounds happened to be
the same.

Opening files with storage_version=65 with older duckdb versions will
produce error messages that point to
https://duckdb.org/docs/internals/storage.html, where this is
documented.

Only visible (at the SQL level) version of this change is that now
DuckDB AttachedDatabases have a `storage_version` tag.

This PR is to be followed by a reworked version of
#14981.
Mytherin added a commit that referenced this pull request Jan 20, 2025
…orage version when serializing a database (#15794)

Follow-up from #15702
Supersedes/builds on top of #14981

This PR change the `storage_compatibility_version` from being a setting
set on every session to be written in the database file.

Previously we would set this setting at run-time, and it would be shared
across all database instances:

```sql
ATTACH 'file1.db';
-- write something, to be serialized targeting version v0.10.0
SET storage_compatibility_version = 'v1.0.0';
ATTACH 'file2.db';
-- write something, to be serialized targeting v1.0.0
```

This has a number of issues:

* The storage compatibility version is shared across all attached
databases
* When restarting the system, the `storage_compatibility_version` would
revert back towards the default setting (currently `v0.10.0`)
* When reading a database, we did not know which storage compatibility
version was used, which could lead to hard to understand errors when
reading databases with an older version

### STORAGE_VERSION parameter

This PR reworks this so that the storage version is instead specified on
`ATTACH`. When none is specified:

* The version set in the `storage_compatibility_version` is used when
creating a new database
* The version stored within the database is used when loading an
existing database

As a result, we can target the storage version towards the desired
supported version when creating a new database. When opening an existing
database, we will keep on writing targeting the same DuckDB version
(i.e. we never automatically "upgrade" the file to a newer DuckDB
version). The user can *manually* upgrade a file by opening an older
file while targeting a later storage version.

For example:

```sql
-- use default `storage_compatibility_version`
ATTACH 'new_file.db';
-- explicitly target versions >= v1.2.0
ATTACH 'new_file.db' (STORAGE_VERSION 'v1.2.0');

-- use the storage version stored within the file
ATTACH 'existing_file.db';
-- use storage version v1.2.0 - if the file uses an older storage version, this upgrades the file
ATTACH 'existing_file.db' (STORAGE_VERSION 'v1.2.0');
```

Note that we cannot *downgrade* a file. If we try to open a file that
targets e.g. version v1.2.0 with an explicit storage version of v1.0.0,
we get an error:

```sql
ATTACH 'database_file.db' (STORAGE_VERSION 'v1.2.0');
DETACH database_file;

ATTACH 'database_file.db' (STORAGE_VERSION 'v1.0.0');
-- Error opening "database_file.db": cannot initialize database with storage version 2 - which is lower than what the database itself uses (4). The storage version of an existing database cannot be lowered.
```

### Opening with DuckDB < v1.1.3

When opening a file that targets `v1.2.0` in an older DuckDB version, we
now get a storage incompatibility error:

```sql
duckdb database_file.db
```

```
Error: unable to open database "database_file.db": IO Error: Trying to read a database file with version number 65, but we can only read version 64.
The database file was created with an newer version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

See the storage page for more information: https://duckdb.org/internals/storage
```

The description in the error is not entirely correct - but the error is
a lot more descriptive than the previous error that would be thrown in
this scenario (which was `INTERNAL Error: Unsupported compression
function type`).

The error message has also been improved in
#15702 already.
Mytherin added a commit that referenced this pull request Jan 20, 2025
…orage version when serializing a database (#15794)

Follow-up from #15702
Supersedes/builds on top of #14981

This PR change the `storage_compatibility_version` from being a setting
set on every session to be written in the database file.

Previously we would set this setting at run-time, and it would be shared
across all database instances:

```sql
ATTACH 'file1.db';
-- write something, to be serialized targeting version v0.10.0
SET storage_compatibility_version = 'v1.0.0';
ATTACH 'file2.db';
-- write something, to be serialized targeting v1.0.0
```

This has a number of issues:

* The storage compatibility version is shared across all attached
databases
* When restarting the system, the `storage_compatibility_version` would
revert back towards the default setting (currently `v0.10.0`)
* When reading a database, we did not know which storage compatibility
version was used, which could lead to hard to understand errors when
reading databases with an older version

### STORAGE_VERSION parameter

This PR reworks this so that the storage version is instead specified on
`ATTACH`. When none is specified:

* The version set in the `storage_compatibility_version` is used when
creating a new database
* The version stored within the database is used when loading an
existing database

As a result, we can target the storage version towards the desired
supported version when creating a new database. When opening an existing
database, we will keep on writing targeting the same DuckDB version
(i.e. we never automatically "upgrade" the file to a newer DuckDB
version). The user can *manually* upgrade a file by opening an older
file while targeting a later storage version.

For example:

```sql
-- use default `storage_compatibility_version`
ATTACH 'new_file.db';
-- explicitly target versions >= v1.2.0
ATTACH 'new_file.db' (STORAGE_VERSION 'v1.2.0');

-- use the storage version stored within the file
ATTACH 'existing_file.db';
-- use storage version v1.2.0 - if the file uses an older storage version, this upgrades the file
ATTACH 'existing_file.db' (STORAGE_VERSION 'v1.2.0');
```

Note that we cannot *downgrade* a file. If we try to open a file that
targets e.g. version v1.2.0 with an explicit storage version of v1.0.0,
we get an error:

```sql
ATTACH 'database_file.db' (STORAGE_VERSION 'v1.2.0');
DETACH database_file;

ATTACH 'database_file.db' (STORAGE_VERSION 'v1.0.0');
-- Error opening "database_file.db": cannot initialize database with storage version 2 - which is lower than what the database itself uses (4). The storage version of an existing database cannot be lowered.
```

### Opening with DuckDB < v1.1.3

When opening a file that targets `v1.2.0` in an older DuckDB version, we
now get a storage incompatibility error:

```sql
duckdb database_file.db
```

```
Error: unable to open database "database_file.db": IO Error: Trying to read a database file with version number 65, but we can only read version 64.
The database file was created with an newer version of DuckDB.

The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
The storage will be stabilized when version 1.0 releases.

For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

See the storage page for more information: https://duckdb.org/internals/storage
```

The description in the error is not entirely correct - but the error is
a lot more descriptive than the previous error that would be thrown in
this scenario (which was `INTERNAL Error: Unsupported compression
function type`).

The error message has also been improved in
#15702 already.
@carlopi carlopi closed this Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants