Add lance format support#7913
Conversation
|
Mentioned #7863 as well |
|
@pdames for vis |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Cool ! I notice the current implementation doesn't support streaming because of the symlink hack. I believe you can do something like this instead: def _generate_tables(self, paths: list[str]):
for path in paths:
ds = lance.dataset(path)
for frag_idx, fragment in enumerate(ds.get_fragments()):
for batch_idx, batch in enumerate(
fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
):
table = pa.Table.from_batches([batch])
table = self._cast_table(table)
yield Key(frag_idx, batch_idx), tablenote that path can be a local one, but also a |
|
@lhoestq Take another look? |
|
I took the liberty to make a few changes :) Now I believe we should be good:
I think this PR is ready, just let me know what you think before we merge 🚀 The next steps are:
Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :) And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF ! |
| yield Key(frag_idx, batch_idx), self._cast_table(table) | ||
| else: | ||
| for file_idx, lance_file in enumerate(lance_files): | ||
| for batch_idx, batch in enumerate(lance_file.read_all(batch_size=self.config.batch_size).to_batches()): |
There was a problem hiding this comment.
should we support columns pushdown here?
There was a problem hiding this comment.
Just added it at LanceFileReader() initialization, since the argument is not available in read_all()
There was a problem hiding this comment.
Actually, how does it work with multiple data files within same fragment?
In Lance, one fragment can be 1 or more data files, where each data files cover a few columns. This is how we can add new features / column cheaply without rewriting the datasets, by adding new data files to existing fragment.
Maybe we can address it as follow up tasks.
There was a problem hiding this comment.
in that case it's a dataset no ? since it requires a manifest or something to tell what the fragments are made of
LanceFileReader() is only used for single files, i.e. that don't belong to a lance dataset directory or require manifest files
Add lance format as one of the
packaged_modules.