[v2] Refactor text tasks to use DataLoader#2198
Conversation
It's not easy because datasets have different column names and most datasets require encoding two columns, and I don’t have a clear solution for handling that. Also in most tasks list of sentences passed to evaluators and there datasets can't be used for now, but we can change that. Additionally, some datasets return a dictionary instead of a dataset, and Pair classification expects all data to be in the first row (as I recall). I could pass the dataset directly and select columns, but that would be a similar approach to using a wrapper. (edited) |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
So I would really like to see how a Dataloader native abstask would look like. Can we try to do it with just Classification?
I am also afraid of how much this influences throughput - can we do a quick test e.g. using minishlab models?
It is a bit annoying that we have to convert everything in the encode functions (it might be the right solution). We could consider whether it better to just hand of the Dataset object to the model? (but I assume that does not work for images?)
| if isinstance(queries[0], list): | ||
| # Encode only unique queries using the dataloader | ||
| if isinstance(query_list[0], list): | ||
| # For conversations, still use the original encode_conversations method |
There was a problem hiding this comment.
Hmm don't we want to standardize everything?
There was a problem hiding this comment.
We want, but I still don't know what to do with them, because we don't have implementation for any model #1330
There was a problem hiding this comment.
I pinged him. Can't we just convert it to text and keep the "conversation in a column as well??
There was a problem hiding this comment.
I've tried to standardize it in bb2a897, but it is hard to tell if it correct, because I don't know conversational datasets to check results
|
I've updated clustering and classification tasks to use |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Looking better.
I added a few comments on the classification.
we should also update the documentation to match (how to implement a custom encoder).
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
|
I've updated Classification evaluator and removed |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
A few more minor things.
Would love @isaac-chung s opinion on this as well
(would love to see more adaption of tasks to avoid the many dataset transformation)
| rng_state = np.random.default_rng(self.seed) | ||
| rng_state.shuffle(idxs) |
There was a problem hiding this comment.
| rng_state = np.random.default_rng(self.seed) | |
| rng_state.shuffle(idxs) | |
| self.rng_state.shuffle(idxs) |
test and believe they should be eq.
There was a problem hiding this comment.
On the first experiment they're equal and on others they're different. I think this is because we're recreating rng_state on each experiment
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
|
@orionw It would be great if you could review this PR! |
|
Dataloader seems fine to me if it helps make things easier. I am not sure what the benefits are offhand, but I am not opposed. EDIT: wait, maybe I missed that the inputs are now dataloaders. I probably would be hesistant to make large changes then. What's the benefit to doing so? Re: allowing to choose how to combine passage and title, I like the motivation but this is a fairly large change (every single model that anyone ever uses). Could we instead define a custom function that can be overridden, something like "combine_passage_and_title" and have a default? I am hesitant to make such a large API change. We already have a custom function that can be overridden for |
|
Now, the title and passage are computed the same way as before in The main benefit of dataloaders is standardizing input, especially since we now have images and audio, which are difficult to handle otherwise. You can check the discussion in this thread |
|
That makes sense and seems good to have the input change happen with v2 then. It is a lot of changes but shouldn't change anything of substance.
Seems great then. We can add it as an extension if we want but not high priority. |
# Conflicts: # mteb/encoder_interface.py # mteb/evaluation/evaluators/ClassificationEvaluator.py
|
@KennethEnevoldsen We will wait for more reviews, or can we merge this? |
|
Good to merge! |
Ref #1606
Now models will receive
encodefunction Dataloader{ "text": [...], # default text "image": [...], "audio": [...], "body: [...], # models are allowed to construct the text from the body + title if they wish "title: [...], }Code Quality
make lintto maintain consistent style.