Introduction to the problem
As per the Orama documentation (link):
Orama automatically uses the id field of the document, if found.
That means that given the following document and schema:
import { create, search } from '@orama/orama'
const db = await create({
schema: {
id: 'string',
author: 'string',
quote: 'string',
},
})
await insert(db, {
id: '73cbcc79-2203-49b8-bb52-60d8e9a66c5f',
author: 'Fernando Pessoa',
quote: "I wasn't meant for reality, but life came and found me",
})
The document will be indexed with the following id: 73cbcc79-2203-49b8-bb52-60d8e9a66c5f.
If the id field is not found, Orama will generate a random id for the document.
This gives users a great opportunity to use their own custom IDs, but at a cost: Orama uses several data structures (AVL Trees, Radix Trees, Inverted Indexes) where the ID of a document gets duplicated multiple times.
For example, if we insert the following document:
{
id: '37fd6bb7-5ac2-4a37-adcd-485045b54bdc',
author: 'Michele',
quote: 'Hello, world!'
}
The id 37fd6bb7-5ac2-4a37-adcd-485045b54bdc will get duplicated at least three times:
'Michele' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token
'Hello' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token
'world' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token
There might be other places where this ID can get duplicated depending on certain conditions.
How to solve
If the user is indexing the content using a UUID, this will drastically affect the index size, because of several duplications of a quite large string.
Therefore, we should let users index every document with their own custom IDs (without putting them into a Radix Tree, therefore making the id property not searchable), but use an internal, shorter ID, possibly using the syncUniqueId function exported here to store the document reference in our data structures.
Users should be able to retrieve their docs using the getById function (link), but internally, Orama should always use a short, optimized ID.
Bounty Program
This issue is subject to our Open Source Bounty Program, and we'll reward whoever is creating a PR that gets merged with $800 for this activity.
Introduction to the problem
As per the Orama documentation (link):
This gives users a great opportunity to use their own custom IDs, but at a cost: Orama uses several data structures (AVL Trees, Radix Trees, Inverted Indexes) where the ID of a document gets duplicated multiple times.
For example, if we insert the following document:
The id
37fd6bb7-5ac2-4a37-adcd-485045b54bdcwill get duplicated at least three times:'Michele'will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token'Hello'will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token'world'will be stored in the radix tree, where the last node will contain the reference of the document containing this specific tokenThere might be other places where this ID can get duplicated depending on certain conditions.
How to solve
If the user is indexing the content using a
UUID, this will drastically affect the index size, because of several duplications of a quite large string.Therefore, we should let users index every document with their own custom IDs (without putting them into a Radix Tree, therefore making the
idproperty not searchable), but use an internal, shorter ID, possibly using thesyncUniqueIdfunction exported here to store the document reference in our data structures.Users should be able to retrieve their docs using the
getByIdfunction (link), but internally, Orama should always use a short, optimized ID.Bounty Program
This issue is subject to our Open Source Bounty Program, and we'll reward whoever is creating a PR that gets merged with $800 for this activity.