Skip to content

infer dimensions#88

Closed
ezra-varady wants to merge 9 commits intolanterndata:mainfrom
ezra-varady:ezra/infer_dim
Closed

infer dimensions#88
ezra-varady wants to merge 9 commits intolanterndata:mainfrom
ezra-varady:ezra/infer_dim

Conversation

@ezra-varady
Copy link
Copy Markdown
Collaborator

This should address #55, allowing postgres to automatically decide the dimension index of arrays of real and int. It preserves the user's ability to specify a dimension. When dims is left as the default lantern will now check the column's dimension and use that instead if it differs. I haven't tested against older versions of postgres, and I'm not familiar with the API, so this may need some additional work to be backwards compatible

@ezra-varady ezra-varady changed the title Ezra/infer dim infer dimensions Aug 26, 2023
@ezra-varady
Copy link
Copy Markdown
Collaborator Author

API appears to have been different in postgres 11, will look into it later this evening

Copy link
Copy Markdown
Contributor

@Ngalstyan4 Ngalstyan4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Left small comments. Should be ready soon

src/hnsw/build.c Outdated
int n_items = currDim;
int indexCol;

// If NumIndexAttrs isn't 1 there's no clear way to infer the dim
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify in the comment what NumIndexAttrs represents?

src/hnsw/build.c Outdated

// Get the indexed column out of the row and return it's dimensions
datum = heap_getattr(tuple, indexCol, RelationGetDescr(heap), &isNull);
if(!isNull) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems, if the first row in the table has NULL vector and all the ones afterwards have e.g. 3 dimensional real[] arrays, then we fail to infer. This is fair. Let's just add a test for it and document it

src/hnsw/build.c Outdated
if(buildstate->dimensions == HNSW_DEFAULT_DIMS)
buildstate->dimensions = GetArrayLengthFromHeap(heap, indexInfo, buildstate->dimensions);
/* Require column to have dimensions to be indexed */
if(buildstate->dimensions < 1) elog(ERROR, "column does not have dimensions");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the error we are going to see if the table is empty? Is that case tested?

Copy link
Copy Markdown
Collaborator Author

@ezra-varady ezra-varady Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be if DEFAULT_DIMS were changed from 3 to -1. Currently a number of tests create indices on empty tables, so I wasn't sure what the expected behavior was. I preserved the default to maximize compatibility with the existing API, so right now this doesn't generate an error. If it's alright to throw an error when the table being indexed is empty I can change the default per @var77's suggestion and update the relevant tests to make things more consistent

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is all right and the expected behaviour. Just throw a helpful message asking the user to provide number of dimensions for the indexed column

src/hnsw/build.c Outdated
// If a dimension wasn't specified try to infer it
// this is unecessary when the actual size is 3 but in practice this shouldn't be too common
if(columnType == REAL_ARRAY || columnType == INT_ARRAY)
if(buildstate->dimensions == HNSW_DEFAULT_DIMS)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set HNSW_DEFAULT_DIMS to -1, NULL or something like that, as now if we pass (dims=3) on index creation and insert array with 4 dimensions it is going to overwrite the dims option.

@ezra-varady ezra-varady marked this pull request as draft August 27, 2023 02:18
@ezra-varady
Copy link
Copy Markdown
Collaborator Author

ezra-varady commented Aug 27, 2023

There's actually an issue with GetHnswIndexDimensions. If we infer the dimension it won't set the dims relopt and it will break scans. Unfortunately afaict while building the index the rd_options field is still null, and my various attempts to set it manually don't work, marking as draft for a bit while I add logic to fix this

EDIT: fixed this but I'm seeing odd results on the distance function regression tests these two diffs are exemplary but there are several others

@@ -102,13 +102,13 @@
     FROM small_world_l2
     ORDER BY vector <-> array[0,1,0] LIMIT 7
 ) v ORDER BY v.dist, v.id;
-                                                      QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
- Sort  (cost=0.70..0.72 rows=7 width=48)
+                                                     QUERY PLAN                                                     
+--------------------------------------------------------------------------------------------------------------------
+ Sort  (cost=14.36..14.38 rows=7 width=48)
    Sort Key: v.dist, v.id
-   ->  Subquery Scan on v  (cost=0.00..0.60 rows=7 width=48)
-         ->  Limit  (cost=0.00..0.53 rows=7 width=52)
-               ->  Index Scan using small_world_l2_vector_idx on small_world_l2  (cost=0.00..81.42 rows=1070 width=52)
+   ->  Subquery Scan on v  (cost=0.00..14.26 rows=7 width=48)
+         ->  Limit  (cost=0.00..14.19 rows=7 width=52)
+               ->  Index Scan using small_world_l2_vector_idx on small_world_l2  (cost=0.00..16.22 rows=8 width=52)
                      Order By: (vector <-> '{0,1,0}'::real[])
 (6 rows)
 
@@ -161,8 +161,8 @@
 -----+------
  010 | 0.00
  011 | 0.29
- 110 | 0.29
  111 | 0.42
+ 000 | 1.00
  001 | 1.00
  100 | 1.00
  101 | 1.00

@ezra-varady ezra-varady marked this pull request as ready for review August 27, 2023 03:31
@ezra-varady ezra-varady force-pushed the ezra/infer_dim branch 3 times, most recently from 28320d5 to 65853b4 Compare August 27, 2023 06:16
@Ngalstyan4
Copy link
Copy Markdown
Contributor

Hi @ezra-varady,

This might have been a result of some issues on indexes with real[] column type. @dqii addressed those at #87 which was just merged to main. Could you rebase and see if the issues you mentioned above persist?

src/hnsw/build.c Outdated

// If a dimension wasn't specified try to infer it
if(buildstate->columnType == REAL_ARRAY || buildstate->columnType == INT_ARRAY)
if(buildstate->dimensions < 1) buildstate->dimensions = InferDimension(heap, indexInfo);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always use braces {,} with ifs and other basic blocs.
It is particularly important for cases like these where you have multiple ifs

I think the code above can be clarified as well.
What happens when buildstate->columnType is not REAL_ARRAY or INT_ARRAY?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, the operators aren't defined, good catch

src/hnsw/build.c Outdated
HnswOptions *opts;
int attrNum;

// We know there's one key because we generate an error during inference if multiple keys are specified
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assert in place of this comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants