Skip to content

feat: id as number#441

Merged
micheleriva merged 8 commits intooramasearch:mainfrom
h4ad-forks:feat/id-as-number
Jul 18, 2023
Merged

feat: id as number#441
micheleriva merged 8 commits intooramasearch:mainfrom
h4ad-forks:feat/id-as-number

Conversation

@H4ad
Copy link
Copy Markdown
Contributor

@H4ad H4ad commented Jul 9, 2023

/claim #426

This is my attempt to closes #426, in this implementation, I added a new component called internalDocumentIDStore which is responsible for keeping the internal IDs and also has a list to reverse those IDs.

  • Added a new ID Store
  • Added support for all methods to handle the internal ID
  • Keep the API Support with plugins (instead of sending the internal ID, I only send the original ID)
  • Add support for serialization and deserialization (I didn't make it backward compatible if you want this, let me know)
  • Fixed all tests

About the performance, to compare this change with the old behavior, here are the current stats for the old behavior:

To be able to run this quickly, I rebase using #434 otherwise this will take a lot of time.

insert: 884.5955260000192ms
memory: 184.7719955444336MB
save: 172.09495500009507ms
total size: 42.089162826538086 MBs

And the new behavior is:

insert: 798.1346069998108ms
memory: 187.97183227539062MB
save: 132.9290320002474ms
total size: 27.622851371765137 MBs

To check this new behavior locally, checkout to this branch: https://github.com/h4ad-forks/orama/tree/feat/id-as-number-faster

database-size.mjs
import { create, insert, save } from './dist/index.js';
import { writeFileSync } from 'fs';
import crypto from 'crypto';

(async () => {
  const db = await create({
    schema: {
      name: "string"
    }
  });

  let now = performance.now();
  for (let i = 0; i < 1e5; i++) {
    await insert(db, {
      id: crypto.randomUUID(),
      name: Math.random().toString(16).substring(8)
    });
  }
  console.log(`insert: ${performance.now() - now}ms`);
  console.log(`memory: ${process.memoryUsage().heapUsed / 1024 / 1024}MB`);
  now = performance.now();

  const rawState = await save(db);
  console.log(`save: ${performance.now() - now}ms`);

  const jsonState = JSON.stringify(rawState);
  const totalSize = jsonState.length;
  console.log(`total size: ${totalSize / 1024 / 1024} MBs`);

  writeFileSync('./database-size.json', jsonState, 'utf8');
})();
We had a little increase in memory usage (1.7%) but we decreased the index size by 34.36%!

This is what looks like the current serialization:

{
   "internalIdStore":{
      "internalIdToId":[
         "4dca7125-6c6f-461c-9cb0-b0dba12119bc"
      ]
   },
   "index":{
      "indexes":{
         "name":{
            "word":"",
            "subWord":"",
            "children":{
               "b":{
                  "word":"ba2c536",
                  "subWord":"ba2c536",
                  "children":{
                     
                  },
                  "docs":[
                     1
                  ],
                  "end":true
               }
            },
            "docs":[
               
            ],
            "end":false
         }
      },
      "searchableProperties":[
         "name"
      ],
      "searchablePropertiesWithTypes":{
         "name":"string"
      },
      "frequencies":{
         "name":{
            "1":{
               "ba2c536":1
            }
         }
      },
      "tokenOccurrencies":{
         "name":{
            "ba2c536":1
         }
      },
      "avgFieldLength":{
         "name":1
      },
      "fieldLengths":{
         "name":{
            "1":1
         }
      }
   },
   "docs":{
      "docs":{
         "1":{
            "id":"4dca7125-6c6f-461c-9cb0-b0dba12119bc",
            "name":"ba2c536"
         }
      },
      "count":1
   },
   "sorting":{
      "sortableProperties":[
         "name"
      ],
      "sortablePropertiesWithTypes":{
         "name":"string"
      },
      "sorts":{
         "name":{
            "docs":{
               "1":0
            },
            "orderedDocs":[
               [
                  1,
                  "ba2c536"
               ]
            ],
            "type":"string"
         }
      },
      "enabled":true,
      "isSorted":true
   }
}

@vercel
Copy link
Copy Markdown

vercel bot commented Jul 9, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
orama-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 18, 2023 11:57am

@allevo
Copy link
Copy Markdown
Contributor

allevo commented Jul 12, 2023

Hi @H4ad ! Could you resolve the conflicts?

Thanks

Copy link
Copy Markdown
Contributor

@allevo allevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems ok. I left some comments. In particular, related to DocumentID: why do we need to support both, number or string?

delete store.docs[internalId]
store.count--

return true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove also from sharedInternalDocumentStore, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just remove it from sharedInternalDocumentStore.idToInternalId, it's fast enough.

If we try to remove it from sharedInternalDocumentStore.internalIdToId, it will be slower as sorter.

What we can do is perform a cleanup of internalIdToId on serialize.

What do you prefer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep like that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, so no need to remove the id from the sharedInternalDocumentStore, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, we should implement it.
For now, we don't care about the performance during the remove.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we don't care about the performance during the remove.

Agreed. I'd dedicate a separate PR to this

@@ -0,0 +1,70 @@
import { Orama } from '../types.js';

export type DocumentID = string | number
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to introduce less breaking change as possible (I also don't know what is public API and what is internal API).

Someone that changes the implementation of sorter, index, etc... will not need to modify their code in order to accept this change.

But if you want to go full breaking change mode, I can use InternalID in every code (but still return the original ID on search).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to introduce a little breaking change with this. Regarding the naming. Is this a documented or IndexId?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will bump Orama to v1.1.0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I will change all the references for DocumentID to InternalID, and I will only use DocumentID on getByID and when we return the documents from search.

Regarding the naming. Is this a documented or IndexId?

I didn't understand your question but DocumentID is just an alias to reference the ID of the Document that was generated or was passed by the user, we you see this type, is referring to these two cases.

But I think we should add some documentation about it on Orama Docs, just to be clear about how we store IDs to be more efficient.

Copy link
Copy Markdown
Contributor

@allevo allevo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, understood you point. fine for me

Copy link
Copy Markdown
Contributor

@allevo allevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, amazing work!

LGTM

Copy link
Copy Markdown
Contributor

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrific job @H4ad. As always!

LGTM

@micheleriva micheleriva merged commit 47295f1 into oramasearch:main Jul 18, 2023
@H4ad H4ad deleted the feat/id-as-number branch July 18, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce Orama internal ID for documents

3 participants