Skip to content

Implement links compression#5492

Merged
xzfc merged 5 commits intodevfrom
links-comp-impl
Dec 18, 2024
Merged

Implement links compression#5492
xzfc merged 5 commits intodevfrom
links-comp-impl

Conversation

@xzfc
Copy link
Member

@xzfc xzfc commented Nov 20, 2024

This PR add HNSW graph links compression implementation, both for reading and writing.

The integration is added in #5489: it's disabled by default unless __QDRANT_COMPRESSED_LINKS is set.

Depends on #5486, #5487, #5489, #5490, #5491. Ignore all but the last commit during the review.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I'm not 100% confident I covered all the nitty gritty details in this PR as it's quite difficult to see everything. I did go over the PR twice though, and no real questions or remarks come to mind.

Your extensive testing and unit tests do add trust. Thanks for implementing them!

bits_per_unsorted: u8,
}

#[derive(AsBytes, FromBytes, FromZeroes)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just extend the header type (and size) here without problems?

Copy link
Contributor

@IvanPleshkov IvanPleshkov Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. We have HEADER_SIZE constant which is larger that actual size of this structure. It allows us to extend this struct. Only one thing to check. New fields will be zeroes in old files

total_offsets_len: u64,
offsets_padding: u64, // 0 or 4
offsets_padding: u64,
m: u16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to store m and m0 here? Can we provide it from the graph where these values are already known? My concern is if links are uncompressed, m and m0 are zeroes. Right now it's clear why, but this behavior may spend long hours of debugging in the future, I find it untrivial behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, if you wish to store m and m0 anyway, you may create an additional structure with compression option. In this way you explicitly mark m and m0 fields as compression options and say that in non-compression case developer cannot rely on m/m0 fields

} else {
self.offsets_padding as u64
},
m: if self.compressed { self.m as u16 } else { 0 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned in another thread that this behaviour is very untrivial and in the future develops may struggle of the situation where m and m0 are zeroes. I proposed a couple of solutions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: these values are present in HeaderCompressed but not in HeaderPlain.

converter.save_as(path)?;
let new_size = path.metadata()?.len();

log::info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to change log level to debug because HNSW graph logs have debug severity

let links_builder = graph_layers_builder.links_layers[idx][0].read();
let link_container_from_builder = links_builder.iter().copied().collect::<Vec<_>>();
assert_eq!(links_orig, &link_container_from_builder);
let m = if compressed { M * 2 } else { 0 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another PR, logic max_m = m * 2 should have one source of true. In means moving in a separate function

Copy link
Contributor

@IvanPleshkov IvanPleshkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need resolution about m and m0 which are 0 in non-compressed links

/// Deliberately placed at the same offset as [`HeaderV0::levels_count`] and
/// set to an impossibly large number to make old Qdrant versions fail fast
/// when trying to read the new format.
version: U64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you place version at the end, you might compare with 0 instead of magic 0xFFFF_FFFF_FFFF_FF01, because old version header are guaranteed to be zeroed after it's end

Copy link
Contributor

@IvanPleshkov IvanPleshkov Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, you may read version this way version directly and avoid the complicated levels_count_or_version variable while loading. Do you have any concerns about this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want an old Qdrant (≤1.12) to explicitly crash if it try to read a compressed file. If we place a new field to the end, it would just ignore it as it doesn't know about it.

While links compression is a hidden experimental feature that is not enabled by default, we won't bump the segment version soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed separately, old qdrant versions will throw the error in this approach. But the error will be segfault, it does not say that user tries to open new links with older qdrant. I don't remember why we cannot increase the segment version in this case, can you remind please complains about the segment version update?

zero_padding: [u8; 8],
}

const HEADER_VERSION_1: u64 = 0xFFFF_FFFF_FFFF_FF01;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these constants are more clear, WDYT

const CURRENT_HEADER_VERSION: u64 = 1;
const HEADER_VERSION_FILE_CONTENT: u64 = u64::MAX - CURRENT_HEADER_VERSION;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have 0xFFFF_FFFF_FFFF_FF01 explicitly written in the code since it's the value that is visible in hexdump graph.links | head -1. More convenient for debugging/troubleshooting.

.get();

match levels_count_or_version {
_ if u64::from_le(levels_count_or_version) <= 1 << 32 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare version with 1 << 32 is untrivial. I propose to move it as a separate line of code with additional comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// 2. header padding
writer.write_zeros(HEADER_SIZE - size_of::<GraphLinksFileHeader>())?;
if self.compressed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this approach. If we update the header version, we may save only in the new format and drop old version in the future. Currently, HeaderV1 is not a new version of the header — it's, in fact, a header for compressed links only.

let links_range = (offset0 as usize)..(offset1 as usize);

if let Some(compression) = &self.info.compression {
for_each_packed_link(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unpacking costs something, we may think in the future about partial links decompression, we need only first links that pass filtering. It's not a proposal for this PR but we may improve the speed if additional checks costs less that unpacking

});
}

pub fn bench_bitpacking_links(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for benchmarking! =)

@xzfc xzfc merged commit 372a421 into dev Dec 18, 2024
@xzfc xzfc deleted the links-comp-impl branch December 18, 2024 11:29
timvisee pushed a commit that referenced this pull request Jan 8, 2025
* common::bitpacking::packed_bits

* BitWriter::new(): append, don't clear the buffer

* common::bitpacking_links

* Implement links compression

* Update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants