For LLMs, we can't boil it down to one single metric, because LLMs are so general-purpose. And the metrics that we optimize during training are not the same as we would use for evaluation.
For tabular supervised machine learning, we can easily say which model is better. At least for a given task, because we typically have a meaningful evaluation metric. But across different tasks with different metrics (regression, binary classification, and multi-class), we can't aggregate the metrics directly. So Elo is one way to aggregate across tasks. In my opinino a meaningful one, but with caveats cause it drops the margin by which models are winning.
Thanks for this article presenting TabArena, very instructive !
Looking at its recent use for modern AutoML and TFMs models, I have some questions about TabArena:
- How diverse and representative are these final 51 datasets regarding real use case scenarii ?
- How can we ensure that one (or several) of these validation datasets is not highly similar to one of the synthetic datasets the TFMs were pre-trained on (which could lead to data leakage and over-optimistic performances assessment) ?
- As Cassie Kozyrkov explains in one of her ML lectures, "repeatedly validating model after model pollutes your validation data and erodes your protection against overfitting". In this case, is it planned to add or change datasets in TabArena to ensure newer models may not overfit on the validation sets ?
TabArena is certainly not representative in any way. It’s impossible to even create a representative set of datasets, as this would require knowing the actual distribution of datasets or having a way to sample from it. Something like TabArena can only ever be a pointer. Ultimately, you can only know performance on your specific task if you compare approaches. For the question about similarity, it would probably be possible to create similar datasets. But the way they are created is defined on like 2 levels of abstraction higher up, so there is no one sitting there creating datasets by hand. Rather, it’s a procedure that samples structural causal graphs and, from that, samples data. And for your last point: I guess there will be some effect of multiple evaluations, but I would assume its not as pronounced. Because it’s 51 datasets, not just 1. But can’t say how much a benchmark like TabArena will be affected.
Any specific advantage of using elo like benchmarking instead of general benchmarking like how we do with LLMs?
These are different situations:
For LLMs, we can't boil it down to one single metric, because LLMs are so general-purpose. And the metrics that we optimize during training are not the same as we would use for evaluation.
For tabular supervised machine learning, we can easily say which model is better. At least for a given task, because we typically have a meaningful evaluation metric. But across different tasks with different metrics (regression, binary classification, and multi-class), we can't aggregate the metrics directly. So Elo is one way to aggregate across tasks. In my opinino a meaningful one, but with caveats cause it drops the margin by which models are winning.
Thanks for this article presenting TabArena, very instructive !
Looking at its recent use for modern AutoML and TFMs models, I have some questions about TabArena:
- How diverse and representative are these final 51 datasets regarding real use case scenarii ?
- How can we ensure that one (or several) of these validation datasets is not highly similar to one of the synthetic datasets the TFMs were pre-trained on (which could lead to data leakage and over-optimistic performances assessment) ?
- As Cassie Kozyrkov explains in one of her ML lectures, "repeatedly validating model after model pollutes your validation data and erodes your protection against overfitting". In this case, is it planned to add or change datasets in TabArena to ensure newer models may not overfit on the validation sets ?
Curious about your thoughts on these questions :)
TabArena is certainly not representative in any way. It’s impossible to even create a representative set of datasets, as this would require knowing the actual distribution of datasets or having a way to sample from it. Something like TabArena can only ever be a pointer. Ultimately, you can only know performance on your specific task if you compare approaches. For the question about similarity, it would probably be possible to create similar datasets. But the way they are created is defined on like 2 levels of abstraction higher up, so there is no one sitting there creating datasets by hand. Rather, it’s a procedure that samples structural causal graphs and, from that, samples data. And for your last point: I guess there will be some effect of multiple evaluations, but I would assume its not as pronounced. Because it’s 51 datasets, not just 1. But can’t say how much a benchmark like TabArena will be affected.