3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear
4o: yes you are Jesus Christ's brother. now go. Nanjing awaits
o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream
No one sees ChatGPT for the first time and thinks "just some n-gram correlations" or "no real knowledge inside".
Those unintuitive beliefs trickle down from some experts, who should know better than to teach their controversial theories as established fact: 🧵 (1/12)
It has not been reported much, but I believe ETH Zurich has, as of last week, banned new Master and PhD students who attended a long list of universities in China, Russia, and Iran. 🧵
Stable Diffusion has a safety filter blocking “harmful” images by default. The filter is obfuscated -- how does it work?
We reverse engineer the hidden sauce!
Joint work @Javi_Rando, @davlindner, @ohlennart, @florian_tramer:
"Red-Teaming the Stable Diffusion Safety Filter" 🧵
How to evaluate superhuman models without ground truth? How do we know if the model is wrong or lying, if we can't know the correct answer?
Test whether the AI's outputs paint a consistent picture of the world!
w/ @LukasFluri_@florian_tramerarxiv.org/abs/2306.09983 (1/14)