Skip to content

Introduce Orama internal ID for documents #426

@micheleriva

Description

@micheleriva

Introduction to the problem

As per the Orama documentation (link):

Orama automatically uses the id field of the document, if found.

That means that given the following document and schema:

import { create, search } from '@orama/orama'

const db = await create({
  schema: {
    id: 'string',
    author: 'string',
    quote: 'string',
  },
})

await insert(db, {
  id: '73cbcc79-2203-49b8-bb52-60d8e9a66c5f',
  author: 'Fernando Pessoa',
  quote: "I wasn't meant for reality, but life came and found me",
})

The document will be indexed with the following id: 73cbcc79-2203-49b8-bb52-60d8e9a66c5f.

If the id field is not found, Orama will generate a random id for the document.

This gives users a great opportunity to use their own custom IDs, but at a cost: Orama uses several data structures (AVL Trees, Radix Trees, Inverted Indexes) where the ID of a document gets duplicated multiple times.

For example, if we insert the following document:

{
  id: '37fd6bb7-5ac2-4a37-adcd-485045b54bdc',
  author: 'Michele',
  quote: 'Hello, world!'
}

The id 37fd6bb7-5ac2-4a37-adcd-485045b54bdc will get duplicated at least three times:

  1. 'Michele' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token
  2. 'Hello' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token
  3. 'world' will be stored in the radix tree, where the last node will contain the reference of the document containing this specific token

There might be other places where this ID can get duplicated depending on certain conditions.

How to solve

If the user is indexing the content using a UUID, this will drastically affect the index size, because of several duplications of a quite large string.

Therefore, we should let users index every document with their own custom IDs (without putting them into a Radix Tree, therefore making the id property not searchable), but use an internal, shorter ID, possibly using the syncUniqueId function exported here to store the document reference in our data structures.

Users should be able to retrieve their docs using the getById function (link), but internally, Orama should always use a short, optimized ID.

Bounty Program

This issue is subject to our Open Source Bounty Program, and we'll reward whoever is creating a PR that gets merged with $800 for this activity.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions