Skip to content

[v2] Refactor text tasks to use DataLoader#2198

Merged
Samoed merged 22 commits into
v2.0.0from
integrate_dataloaders
Mar 7, 2025
Merged

[v2] Refactor text tasks to use DataLoader#2198
Samoed merged 22 commits into
v2.0.0from
integrate_dataloaders

Conversation

@Samoed

@Samoed Samoed commented Feb 28, 2025

Copy link
Copy Markdown
Member

Ref #1606

Now models will receive encode function Dataloader

{
   "text": [...],  # default text
   "image": [...], 
   "audio": [...], 
   "body: [...], # models are allowed to construct the text from the body + title if they wish
   "title: [...],
}

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

@Samoed

Samoed commented Mar 1, 2025

Copy link
Copy Markdown
Member Author

Right now it is very much a quick wrapper. Wouldn’t we prefer directly working with the dataset for datasets? (I know that this is more code to write)

It's not easy because datasets have different column names and most datasets require encoding two columns, and I don’t have a clear solution for handling that. Also in most tasks list of sentences passed to evaluators and there datasets can't be used for now, but we can change that. Additionally, some datasets return a dictionary instead of a dataset, and Pair classification expects all data to be in the first row (as I recall). I could pass the dataset directly and select columns, but that would be a similar approach to using a wrapper. (edited)

@Samoed Samoed changed the title update text tasks except retrieval [v2] Refactor text tasks to use DataLoader Mar 1, 2025
@Samoed Samoed marked this pull request as ready for review March 1, 2025 15:21

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would really like to see how a Dataloader native abstask would look like. Can we try to do it with just Classification?

I am also afraid of how much this influences throughput - can we do a quick test e.g. using minishlab models?

It is a bit annoying that we have to convert everything in the encode functions (it might be the right solution). We could consider whether it better to just hand of the Dataset object to the model? (but I assume that does not work for images?)

Comment thread mteb/data_loading_utils.py Outdated
Comment thread mteb/encoder_interface.py Outdated
Comment thread mteb/data_loading_utils.py Outdated
if isinstance(queries[0], list):
# Encode only unique queries using the dataloader
if isinstance(query_list[0], list):
# For conversations, still use the original encode_conversations method

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm don't we want to standardize everything?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want, but I still don't know what to do with them, because we don't have implementation for any model #1330

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinged him. Can't we just convert it to text and keep the "conversation in a column as well??

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can change like that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to standardize it in bb2a897, but it is hard to tell if it correct, because I don't know conversational datasets to check results

Comment thread mteb/models/cohere_models.py Outdated
@Samoed

Samoed commented Mar 2, 2025

Copy link
Copy Markdown
Member Author

I've updated clustering and classification tasks to use Dataset more natively

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better.

I added a few comments on the classification.

we should also update the documentation to match (how to implement a custom encoder).

Comment thread mteb/create_dataloaders.py Outdated
Comment thread mteb/evaluation/evaluators/ClassificationEvaluator.py Outdated
Comment thread mteb/evaluation/evaluators/ClassificationEvaluator.py
Comment thread mteb/abstasks/AbsTaskClassification.py Outdated
Comment thread mteb/abstasks/AbsTaskClassification.py Outdated
Samoed and others added 2 commits March 4, 2025 11:01
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
@Samoed

Samoed commented Mar 4, 2025

Copy link
Copy Markdown
Member Author

I've updated Classification evaluator and removed create_dataloader. What else do you want to change?

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor things.

Would love @isaac-chung s opinion on this as well

(would love to see more adaption of tasks to avoid the many dataset transformation)

Comment on lines 179 to 180
rng_state = np.random.default_rng(self.seed)
rng_state.shuffle(idxs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rng_state = np.random.default_rng(self.seed)
rng_state.shuffle(idxs)
self.rng_state.shuffle(idxs)

test and believe they should be eq.

@Samoed Samoed Mar 4, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first experiment they're equal and on others they're different. I think this is because we're recreating rng_state on each experiment

Comment thread mteb/abstasks/AbsTaskClassification.py Outdated
Comment thread mteb/abstasks/AbsTaskClusteringFast.py Outdated
Comment thread mteb/abstasks/AbsTaskMultilabelClassification.py Outdated
Comment thread mteb/encoder_interface.py
Comment thread mteb/encoder_interface.py Outdated
Comment thread mteb/encoder_interface.py
Samoed and others added 2 commits March 4, 2025 14:47
@Samoed Samoed requested a review from orionw March 4, 2025 13:36
@Samoed

Samoed commented Mar 4, 2025

Copy link
Copy Markdown
Member Author

@orionw It would be great if you could review this PR!

@orionw

orionw commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

Dataloader seems fine to me if it helps make things easier. I am not sure what the benefits are offhand, but I am not opposed. EDIT: wait, maybe I missed that the inputs are now dataloaders. I probably would be hesistant to make large changes then. What's the benefit to doing so?

Re: allowing to choose how to combine passage and title, I like the motivation but this is a fairly large change (every single model that anyone ever uses).

Could we instead define a custom function that can be overridden, something like "combine_passage_and_title" and have a default? I am hesitant to make such a large API change. We already have a custom function that can be overridden for combine_query_and_instruction

@Samoed

Samoed commented Mar 4, 2025

Copy link
Copy Markdown
Member Author

Now, the title and passage are computed the same way as before in text field, so this won't break anything. However, we could add a function to allow overriding if needed.

The main benefit of dataloaders is standardizing input, especially since we now have images and audio, which are difficult to handle otherwise. You can check the discussion in this thread

@orionw

orionw commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

That makes sense and seems good to have the input change happen with v2 then. It is a lot of changes but shouldn't change anything of substance.

Now, the title and passage are computed the same way as before in text field, so this won't break anything. However, we could add a function to allow overriding if needed.

Seems great then. We can add it as an extension if we want but not high priority.

# Conflicts:
#	mteb/encoder_interface.py
#	mteb/evaluation/evaluators/ClassificationEvaluator.py
@Samoed

Samoed commented Mar 7, 2025

Copy link
Copy Markdown
Member Author

@KennethEnevoldsen We will wait for more reviews, or can we merge this?

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Good to merge!

@Samoed Samoed merged commit bd33a33 into v2.0.0 Mar 7, 2025
@Samoed Samoed deleted the integrate_dataloaders branch March 7, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants