Skip to content

Implement enum type#458

Merged
micheleriva merged 8 commits intomainfrom
feat/implement-enum
Sep 14, 2023
Merged

Implement enum type#458
micheleriva merged 8 commits intomainfrom
feat/implement-enum

Conversation

@allevo
Copy link
Copy Markdown
Contributor

@allevo allevo commented Aug 2, 2023

We collect some issues for our performance during the insertion. In particular, the number type has some problems if the number of documents is high. But also string can be slow too.

The issue related to the insertion time is because of the cost of keeping the order for the number. Trees allow us to keep order with the cost of O(log(N)) for insertion time. But what happens if we choose to lose the order to have a faster insertion time?

Depending on what you need, values can have a lot of meanings:

  • strings you want to search for
  • numbers you want to compare with operators (>, <, >=, <=, =, !=)
  • enum you want to compare with operators (=, in)

This PR introduces the enum type to allow:

  • fast insertion
  • fast search
  • = and in operator (others??)
  • reduce the dump size

What this PR misses:

  • tests: done
  • we need to identify FlatTree properly: used a discriminator string
  • choose if we want to support enum[] as well: discarded for now
  • documentation: done
  • collect suggestions and comment: done

Usage:

const db = await create({
  schema: {
    categoryId: 'enum',
  },
})

const [c2, c3, c5] = await insertMultiple(db, [
  { categoryId: 1 },
  { categoryId: 1 },
  { categoryId: 2 },
  { categoryId: 3 },
  { categoryId: "5" },
])
const resultEq = await search(db, {
  term: '',
  where: {
    categoryId: { eq: 1 },
  }
}) // Expected 2 results
const resultIn = await search(db, {
  term: '',
  where: {
    categoryId: { in: [1, 3, "5"] },
  }
}) // Expected 4 results

@vercel
Copy link
Copy Markdown

vercel bot commented Aug 2, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
orama-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 8, 2023 5:53pm

@ShogunPanda
Copy link
Copy Markdown
Contributor

I'm not sure I understand the purpose of this. Can you please clarify?

@allevo
Copy link
Copy Markdown
Contributor Author

allevo commented Aug 8, 2023

Hi,
sorry for the delay.
I'll try to describe the proposal in a better way.
Sorry for the wall.

Context

Now, Orama uses tree structures to store strings and numbers. In particular, it uses radix trees and avl trees to store respectively strings and numbers.

Problem

The time insertion is high when the developer trying to insert more than 100k of documents.
Some users opened issues or report that issues on slack.
In particular, the insertion inside avl trees took times.

We uses trees because that structures have some guarantees like:

  • find all the documents mapped by a defined value
  • find all documents greater or lesser that a value
  • for radix tree, find all documents using a tolerance

All those features are great for some kind of data:

  • strings where you want to find a substring
  • filter for number equity
  • filter for number using a comparator

All those searches (and others) are yet supported by Orama.

Anyway, if I think generally a search, I'm thinking a search bar + filters. Meanwhile the search bar completely leverages on radix tree for typo tolerance for instance, the filters are typically implemented using two kind of interfaces:

  • the final user can specify a range of a value. The example is the price or the distance from the city center.
  • the final user can specify the presence or the absence of a specific value inside the document. For instance, if an hotel has a services like a parking, or if a book has the Kindle version, or t-shirt size.

For the first use case, we can leverage on avl because we need the comparators.
For the second use case, we could leverage on the same implementation but it creates an overhead because we are still keeping the order also for fields where the order or the typo tolerance don't make any sense.

Solution

I'm proposing a different approach to not keep the order for fields that doesn't require it. Like the example above, the user still want to filter on it with an exact match and without keeping an order.
This solution offers a way to implement it using a Map<any, InternalDocumentId[]>.
Benefits:

  • 30% improvement for number insertion
  • less memory usage (not yet tested)
  • reduce the dump size (not yet tested)

Const:

  • add another type that cannot so easy to understand.

The implementation is still in progress because I would like to understand if this makes sense before finalizing it.

@ShogunPanda
Copy link
Copy Markdown
Contributor

I see!
The proposal makes sense to me and an enum searching it definitely worth a try.
I'll review the code once it's finalized but I have a question here: do you think you can implement this using a new component (which means: do not use the usual radix+avl based component when performing operations) rather than modifying sparse part of Orama? This way we have a cleaner and "black-box like" implementation.

@allevo
Copy link
Copy Markdown
Contributor Author

allevo commented Aug 9, 2023

Hi!
What do you mean by a new component? Which component are you expecting to have?
It is always an index, right?

@ShogunPanda
Copy link
Copy Markdown
Contributor

Exactly, a new Index component that can replace the default one. This use-case is an example of the ratio behind Orama's architecture.

@allevo
Copy link
Copy Markdown
Contributor Author

allevo commented Aug 9, 2023

What do you mean? I want to add this feature to the default index.
If this is not the case, could you give me an example of your thoughts?

@ShogunPanda
Copy link
Copy Markdown
Contributor

I think that in order to keep Orama clean things should be separated. Nothing prevents you to implement the new feature as new index and then call within the default index.
But I still think this should be kept as separated component.

Copy link
Copy Markdown
Contributor

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a couple of minor changes. Will approve once we also have documentation ready. Well done!

Copy link
Copy Markdown
Contributor

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it mostly looks good, except for a couple of small comments. @allevo please add tests for Enums for the plugin-data-persistence package as well.

Good to go as soon as these comments are addressed.

Terrific job 🔥

Copy link
Copy Markdown
Contributor

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@micheleriva micheleriva merged commit c35ca0e into main Sep 14, 2023
@micheleriva micheleriva deleted the feat/implement-enum branch September 14, 2023 21:53
@allevo allevo mentioned this pull request Sep 18, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants