Log inSign up
Samuel Müller
477 posts
user avatar
Samuel Müller
@SamuelMullr
Datadog Post-Training and advising at PriorLabs. Ex-Meta, Ex-DeepL, Ex-Amazon. ETH BSc, Cambridge MPhil, PhD from Freiburg. Opinions are my own. (he/him)
Berlin
samuelgabriel.github.io
Joined February 2020
439
Following
1,491
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    There is a paper by Google trending right now, that claims transformer in-context learning cannot generalize between two function classes I have reproduced their experiment in a colab and come to a very different conclusion...
    379K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Jan 8, 2025
    This might be the first time after 10 years that boosted trees are not the best default choice when working with data in tables. Instead a pre-trained neural network is, the new TabPFN, as we just published in Nature 🎉
    145K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Feb 8, 2022
    Transformers Can Do Bayesian Inference (arxiv.org/abs/2112.10510, ICLR '22) We show how to train Transformers to (pretty exactly) approximate Bayesian predictions in a single forward pass for any prior you can sample from. It yields >200x speedups compared to VI and MCMC. (1/n)
    arXiv logo
    arxiv.org
    Transformers Can Do Bayesian Inference
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...
  • user avatar
    Samuel Müller
    @SamuelMullr
    Jul 24, 2023
    Replying to @EvMill
    This trick is not new, it is even part of the standard torch implementation of multi head attention. The option is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator, as e^0=1. pytorch.org/docs/stable/ge…
    19K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    There are two important things to note here: i) This experiment is very reproducible. ii) This fits with the interpretation of in-context learning as an approximation to the posterior predictive distribution (PPD) that we push in our PFN work, see
    arXiv logo
    arxiv.org
    Transformers Can Do Bayesian Inference
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present...
    16K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    The conclusion: think about in-context learning as approximating the PPD and think about your training data as a prior, like we do in our (Tab)PFN work. Have fun with my colab: colab.research.google.com/drive/1f04iqJE… For more background see our 2021 paper: arxiv.org/abs/2112.10510
    13K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    And my super small transformer (4 layers, 256 embedding size) was able to generalize to a sloped sine after 5 minutes of training on a single colab GPU. Here are its predictions:
    17K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Jan 8, 2025
    Replying to @SamuelMullr
    Thanks to all of my collaborators, without you this would not have been possible. 🙏🏼🙏🏼 If you want to use our model, check out the following: Nature article: nature.com/articles/s4158… Try on free cloud: github.com/PriorLabs/tabp… Try locally (gpu recommended): github.com/PriorLabs/TabP…
    Content cover image
    Accurate predictions on small data with a tabular foundation model
    From nature.com
    3.6K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    They (arxiv.org/pdf/2311.00871…) claim that a transformer trained to predict hold-out data from datasets that either sample from sine curves or sloped lines, can't generalize to predict a sloped sine.
    17K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    This interpretation requires that there is some probability that the data seen during in-context learning comes out of either distribution, the sine curves or the sloped lines. I made sure that this is the case by adding noise to all datapoints and the sine, see the sample.
    14K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    I tried it out and trained on data like they describe. Here is a sample of my training data.
    16K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    Now our predictions become worse... but there still seems to be some generalization for some reason...
    12K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Nov 9, 2023
    Replying to @SamuelMullr
    If we do not add noise, and thus make a sloped sine have 0 probability, the PPD is actually not defined. Our training data then looks like this:
    13K
  • user avatar
    Samuel Müller
    @SamuelMullr
    Jan 8, 2025
    Replying to @SamuelMullr
    The new TabPFN is clearly the best single method for all regression and classification problems up to 10,000 data points and 500 features in our eval, no other limits. Not only that, it requires only a few seconds to train on new data, while the baselines are tuned for 4 hours.
    4.1K