Skip to content

Refactor StoreExternalIndexNodes function and receive index file from socket in chunks#333

Merged
var77 merged 6 commits intomainfrom
varik/external-index-streaming-write
Sep 16, 2024
Merged

Refactor StoreExternalIndexNodes function and receive index file from socket in chunks#333
var77 merged 6 commits intomainfrom
varik/external-index-streaming-write

Conversation

@var77
Copy link
Copy Markdown
Collaborator

@var77 var77 commented Sep 8, 2024

Refactor StoreExternalIndexNodes function and receive index file from socket in chunks

  • simplify StoreExternalIndexNodes function to write the provided buffer into index pages until fully flushed.
  • try to write as much nodes as possible before requesting new chunk from external indexing server to optimize data streaming
  • do not throw errors or process interrupts from StoreExternalIndex or external index socket functions, instead set status and error msg to buildstat->status, then check the status and throw the error or process interrupts after resource cleanup.
  • receive and process index file in 10MB chunks

Copy link
Copy Markdown

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This pull request implements changes to receive and write external index data in chunks, aiming to reduce memory usage when handling large index files. The key modifications include:

  • Modified StoreExternalIndex function in src/hnsw/external_index.c to support both streaming and non-streaming modes
  • Introduced new functions external_index_receive_metadata and external_index_receive_index_part in src/hnsw/external_index_socket.c for chunked data transfer
  • Added error handling for incomplete data reception and buffer management for processing chunks
  • Updated function signatures and added new parameters in src/hnsw/external_index.h to accommodate the chunked transfer approach
  • Replaced the previous method of receiving the entire index file at once with a more memory-efficient chunked approach

These changes should significantly reduce memory usage by processing large index files in smaller parts instead of loading them entirely into memory.

5 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 8, 2024

Codecov Report

Attention: Patch coverage is 69.06780% with 73 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/hnsw/external_index_socket.c 57.29% 41 Missing ⚠️
src/hnsw/build.c 59.61% 21 Missing ⚠️
src/hnsw/external_index.c 87.50% 11 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment on lines +320 to +339
// rotate buffer
buffer_position = external_index_buffer_size - local_progress;
memcpy(external_index_data, external_index_data + local_progress, buffer_position);
Copy link
Copy Markdown
Contributor

@Ngalstyan4 Ngalstyan4 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's implement an actual Ring Buffer for this to avoid copying here.
I think external_index_receive_index_part needs to change to take the write, read positions and size of the ring buffer instead of the write position and size, as it does now.

elog(ERROR, "external index socket read failed");
break;
case EXTERNAL_INDEX_INDEXING_ERROR:
buffer[ size ] = '\0';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the guarantee that this does not overflow the buffer?

@var77 var77 force-pushed the varik/external-index-streaming-write branch 2 times, most recently from a53f4d8 to e41a549 Compare September 11, 2024 13:37
…om socket in chunks

- simplify `StoreExternalIndexNodes` function to write the provided buffer into index pages until fully flushed.
- try to write as much nodes as possible before requesting new chunk from external indexing server to optimize data streaming
- do not throw errors or process interrupts from `StoreExternalIndex` or external index socket functions, instead set status and error msg to `buildstat->status`, then check the status and throw the error or process interrupts after resource cleanup.
- receive and process index file in 10MB chunks
@var77 var77 force-pushed the varik/external-index-streaming-write branch from e41a549 to 02ec07f Compare September 11, 2024 13:48
@var77 var77 changed the title receive and write external index in chunks Refactor StoreExternalIndexNodes function and receive index file from socket in chunks Sep 11, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 11, 2024

Benchmarks

metric old new pct change
recall (after create) 0.956 0.953 -0.31%
recall (after insert) 0.000 0.000 -
select tps 26084.121 25767.662 -1.21%
select bulk(100) tps 38.167 36.451 -4.50%
select latency (ms) 0.713 ± 1.325𝜎 0.706 ± 1.336𝜎 -0.98%
select bulk(100) latency (ms) 812.370 ± 110.239𝜎 862.900 ± 107.405𝜎 +6.22%
create latency (ms) 395672.052 396084.148 +0.10%
insert tps 435.679 450.008 +3.29%
insert bulk(100) tps 4.655 4.768 +2.42%
insert latency (ms) 72.627 ± 16.395𝜎 70.402 ± 15.572𝜎 -3.06%
insert bulk(100) latency (ms) 6754.574 ± 135.138𝜎 6578.834 ± 143.086𝜎 -2.60%
disk usage (bytes) 8192008192.000 8192008192.000 -

@var77 var77 force-pushed the varik/external-index-streaming-write branch from 4b8019a to cdc0a19 Compare September 12, 2024 09:06
@var77 var77 force-pushed the varik/external-index-streaming-write branch from d6c2349 to 1fe46f4 Compare September 13, 2024 07:58
@var77 var77 merged commit 9db16cc into main Sep 16, 2024
@var77 var77 deleted the varik/external-index-streaming-write branch September 16, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants