Log inSign up
Daniel Paleka
1,394 posts
user avatar
Daniel Paleka
@dpaleka
ai safety researcher | phd @CSatETH | danielpaleka.com
Zurich
Joined March 2012
942
Following
4,616
Followers
  • Pinned
    user avatar
    Daniel Paleka
    @dpaleka
    Dec 8, 2025
    Reminder: if you like what you see here, you should subscribe to my newsletter.
    Daniel Paleka's Newsletter | Substack
    From newsletter.danielpaleka.com
    4.9K
  • user avatar
    Daniel Paleka
    @dpaleka
    Apr 30, 2025
    3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear 4o: yes you are Jesus Christ's brother. now go. Nanjing awaits o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream
    132K
  • user avatar
    Daniel Paleka
    @dpaleka
    Jan 22, 2023
    Sam Altman (CEO of OpenAI) responding to a completely normal question in 2019
    ALTMAN: Well, I will caveat this by saying if you believe what I believe about the timeline to AGI and the effect it will have on the world, it is hard to spend a lot of mental cycles thinking about anything else. So I have not thought deeply about what it would take to solve, really, any other problem in the last few years.
    506K
  • user avatar
    Daniel Paleka
    @dpaleka
    Mar 1, 2023
    No one sees ChatGPT for the first time and thinks "just some n-gram correlations" or "no real knowledge inside". Those unintuitive beliefs trickle down from some experts, who should know better than to teach their controversial theories as established fact: 🧵 (1/12)
    219K
  • user avatar
    Daniel Paleka
    @dpaleka
    Oct 29, 2024
    It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵
    157K
  • user avatar
    Daniel Paleka
    @dpaleka
    Oct 10, 2022
    Stable Diffusion has a safety filter blocking “harmful” images by default. The filter is obfuscated -- how does it work? We reverse engineer the hidden sauce! Joint work @Javi_Rando, @davlindner, @ohlennart, @florian_tramer: "Red-Teaming the Stable Diffusion Safety Filter" 🧵
    Figure 1: Simplified safety filter algorithm implemented in Stable Diffusion. Images are mapped to a CLIP latent space, where they are compared against precomputed embeddings of 17 unsafe concepts (see full list in Appendix E). 
If the cosine similarity between the output image and any of the concepts is above a certain threshold, the image is considered unsafe and blacked-out.
  • user avatar
    Daniel Paleka
    @dpaleka
    Sep 1, 2022
    What happened this month in AI/ML safety research. 🧵 (1/8)
  • user avatar
    Daniel Paleka
    @dpaleka
    Oct 31, 2022
    What happened this month in AI/ML safety research.🧵(1/10)
  • user avatar
    Daniel Paleka
    @dpaleka
    Apr 19, 2023
    Watching a talk on *LLM evaluation* organized by Langchain, featuring guests from OpenAI and Anthropic. Main takeaways: (1/11)
    80K
  • user avatar
    Daniel Paleka
    @dpaleka
    Jan 31, 2023
    What happened this month in AI/ML safety research. 🧵(1/9)
    73K
  • user avatar
    Daniel Paleka
    @dpaleka
    Sep 30, 2022
    What happened this month in AI/ML safety research. 🧵(1/8)
  • user avatar
    Daniel Paleka
    @dpaleka
    Jan 2, 2023
    What happened last month in AI/ML safety research. 🧵(1/9)
    61K
  • user avatar
    Daniel Paleka
    @dpaleka
    Feb 27, 2023
    What happened this month in AI/ML safety research. 🧵 (1/9)
    56K
  • user avatar
    Daniel Paleka
    @dpaleka
    Jun 26, 2023
    How to evaluate superhuman models without ground truth? How do we know if the model is wrong or lying, if we can't know the correct answer? Test whether the AI's outputs paint a consistent picture of the world! w/ @LukasFluri_ @florian_tramer arxiv.org/abs/2306.09983 (1/14)
    42K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up