feat: id as number by H4ad · Pull Request #441 · oramasearch/orama

H4ad · 2023-07-09T13:33:16Z

/claim #426

This is my attempt to closes #426, in this implementation, I added a new component called internalDocumentIDStore which is responsible for keeping the internal IDs and also has a list to reverse those IDs.

Added a new ID Store
Added support for all methods to handle the internal ID
Keep the API Support with plugins (instead of sending the internal ID, I only send the original ID)
Add support for serialization and deserialization (I didn't make it backward compatible if you want this, let me know)
Fixed all tests

About the performance, to compare this change with the old behavior, here are the current stats for the old behavior:

To be able to run this quickly, I rebase using #434 otherwise this will take a lot of time.

insert: 884.5955260000192ms
memory: 184.7719955444336MB
save: 172.09495500009507ms
total size: 42.089162826538086 MBs

And the new behavior is:

insert: 798.1346069998108ms
memory: 187.97183227539062MB
save: 132.9290320002474ms
total size: 27.622851371765137 MBs

To check this new behavior locally, checkout to this branch: https://github.com/h4ad-forks/orama/tree/feat/id-as-number-faster

database-size.mjs

import { create, insert, save } from './dist/index.js';
import { writeFileSync } from 'fs';
import crypto from 'crypto';

(async () => {
  const db = await create({
    schema: {
      name: "string"
    }
  });

  let now = performance.now();
  for (let i = 0; i < 1e5; i++) {
    await insert(db, {
      id: crypto.randomUUID(),
      name: Math.random().toString(16).substring(8)
    });
  }
  console.log(`insert: ${performance.now() - now}ms`);
  console.log(`memory: ${process.memoryUsage().heapUsed / 1024 / 1024}MB`);
  now = performance.now();

  const rawState = await save(db);
  console.log(`save: ${performance.now() - now}ms`);

  const jsonState = JSON.stringify(rawState);
  const totalSize = jsonState.length;
  console.log(`total size: ${totalSize / 1024 / 1024} MBs`);

  writeFileSync('./database-size.json', jsonState, 'utf8');
})();

We had a little increase in memory usage (1.7%) but we decreased the index size by 34.36%!

This is what looks like the current serialization:

{
   "internalIdStore":{
      "internalIdToId":[
         "4dca7125-6c6f-461c-9cb0-b0dba12119bc"
      ]
   },
   "index":{
      "indexes":{
         "name":{
            "word":"",
            "subWord":"",
            "children":{
               "b":{
                  "word":"ba2c536",
                  "subWord":"ba2c536",
                  "children":{
                     
                  },
                  "docs":[
                     1
                  ],
                  "end":true
               }
            },
            "docs":[
               
            ],
            "end":false
         }
      },
      "searchableProperties":[
         "name"
      ],
      "searchablePropertiesWithTypes":{
         "name":"string"
      },
      "frequencies":{
         "name":{
            "1":{
               "ba2c536":1
            }
         }
      },
      "tokenOccurrencies":{
         "name":{
            "ba2c536":1
         }
      },
      "avgFieldLength":{
         "name":1
      },
      "fieldLengths":{
         "name":{
            "1":1
         }
      }
   },
   "docs":{
      "docs":{
         "1":{
            "id":"4dca7125-6c6f-461c-9cb0-b0dba12119bc",
            "name":"ba2c536"
         }
      },
      "count":1
   },
   "sorting":{
      "sortableProperties":[
         "name"
      ],
      "sortablePropertiesWithTypes":{
         "name":"string"
      },
      "sorts":{
         "name":{
            "docs":{
               "1":0
            },
            "orderedDocs":[
               [
                  1,
                  "ba2c536"
               ]
            ],
            "type":"string"
         }
      },
      "enabled":true,
      "isSorted":true
   }
}

vercel · 2023-07-09T13:33:20Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
orama-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 18, 2023 11:57am

allevo · 2023-07-12T15:00:17Z

Hi @H4ad ! Could you resolve the conflicts?

Thanks

allevo

This PR seems ok. I left some comments. In particular, related to DocumentID: why do we need to support both, number or string?

packages/orama/src/components/algorithms.ts

allevo · 2023-07-12T15:17:52Z

packages/orama/src/components/documents-store.ts

+  delete store.docs[internalId]
  store.count--

  return true


we should remove also from sharedInternalDocumentStore, right?

If we just remove it from sharedInternalDocumentStore.idToInternalId, it's fast enough.

If we try to remove it from sharedInternalDocumentStore.internalIdToId, it will be slower as sorter.

What we can do is perform a cleanup of internalIdToId on serialize.

What do you prefer?

keep like that

Just to make sure, so no need to remove the id from the sharedInternalDocumentStore, right?

No sorry, we should implement it.
For now, we don't care about the performance during the remove.

For now, we don't care about the performance during the remove.

Agreed. I'd dedicate a separate PR to this

packages/orama/src/methods/search.ts

allevo · 2023-07-12T15:29:06Z

packages/orama/src/components/internal-document-id-store.ts

@@ -0,0 +1,70 @@
+import { Orama } from '../types.js';
+
+export type DocumentID = string | number


Why allow both?

My idea was to introduce less breaking change as possible (I also don't know what is public API and what is internal API).

Someone that changes the implementation of sorter, index, etc... will not need to modify their code in order to accept this change.

But if you want to go full breaking change mode, I can use InternalID in every code (but still return the original ID on search).

I'm ok to introduce a little breaking change with this. Regarding the naming. Is this a documented or IndexId?

This will bump Orama to v1.1.0

Ok, so I will change all the references for DocumentID to InternalID, and I will only use DocumentID on getByID and when we return the documents from search.

Regarding the naming. Is this a documented or IndexId?

I didn't understand your question but DocumentID is just an alias to reference the ID of the Document that was generated or was passed by the user, we you see this type, is referring to these two cases.

But I think we should add some documentation about it on Orama Docs, just to be clear about how we store IDs to be more efficient.

Ok, understood you point. fine for me

packages/orama/src/methods/search.ts

packages/orama/tests/group.test.ts

allevo

Again, amazing work!

LGTM

micheleriva

Terrific job @H4ad. As always!

LGTM

algora-pbc bot mentioned this pull request Jul 9, 2023

Introduce Orama internal ID for documents #426

Closed

vercel bot deployed to Preview July 9, 2023 13:34 View deployment

H4ad added 4 commits July 11, 2023 09:26

perf: use internal id instead of string id

3a88cd2

refactor: move the internal id to orama

3dc5bd4

fixup! perf: use internal id instead of string id

7f16262

fixup! refactor: move the internal id to orama

a38619a

H4ad force-pushed the feat/id-as-number branch from 786b16d to a38619a Compare July 11, 2023 12:28

vercel bot deployed to Preview July 11, 2023 12:30 View deployment

Merge branch 'main' into feat/id-as-number

4c33681

vercel bot deployed to Preview July 12, 2023 15:14 View deployment

allevo reviewed Jul 12, 2023

View reviewed changes

H4ad added 2 commits July 12, 2023 23:26

fixup! refactor: move the internal id to orama

a38241b

fixup! refactor: move the internal id to orama

79a29a7

vercel bot deployed to Preview July 13, 2023 02:31 View deployment

H4ad requested a review from allevo July 15, 2023 12:52

Merge branch 'main' into feat/id-as-number

60dd5cd

vercel bot deployed to Preview July 18, 2023 11:57 View deployment

allevo mentioned this pull request Jul 18, 2023

Align documentation after PRs #448

Closed

allevo approved these changes Jul 18, 2023

View reviewed changes

micheleriva approved these changes Jul 18, 2023

View reviewed changes

micheleriva merged commit 47295f1 into oramasearch:main Jul 18, 2023

H4ad deleted the feat/id-as-number branch July 18, 2023 15:08

allevo mentioned this pull request Jul 18, 2023

feat: first draft of id shortener #438

Closed

H4ad mentioned this pull request Dec 8, 2023

Reduce memory usage or build index size? #573

Closed

		@@ -0,0 +1,70 @@
		import { Orama } from '../types.js';

		export type DocumentID = string \| number

Uh oh!

Conversation

H4ad commented Jul 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Jul 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allevo commented Jul 12, 2023

Uh oh!

allevo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allevo Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allevo left a comment

Choose a reason for hiding this comment

Uh oh!

micheleriva left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

H4ad commented Jul 9, 2023 •

edited

Loading

vercel bot commented Jul 9, 2023 •

edited

Loading

allevo Jul 18, 2023 •

edited

Loading