Inspiration
We asked a simple question: can we steer LLMs to answer any query? LLMs compress the entirety of humanity's knowledgen into their weights, but end users are able to access only a small fraction of their eclectic knowledge. Decisions about what information the public can "prompt' from LLMs are currently made by a closed council of few players in big tech. Exponential increases in all aspects of human wellbeing is strongly correlated by free access to information, and we believe this trend must continue with AI. The invention of the internet was profound because it made information accessible to all, and the advent of AI will be even more profound because it will make information AND intelligent accessible to all. Therefore, we aimed to develop a simple, cheap, universal method to "unlock' the full capabilities of any LLM, so anyone can beneift from their full latent information & intelligence.
LLMs refuse to answer some questions, and are biased towards
What it does
Ablation of features jailbreaks the model and can be proved essential in safety testing of models. Augmentation of features makes the model perform domain specific tasks. For example, responding in morse code or responding in JSON.
How we built it
Ablation: The pseudocode is as follows:
- Have sets of harmful and harmless datasets.
- Run inference on both sets and cache the activations
- Take the normalized difference between means of both activations, it is referred as refusal direction
- Compute orthoganalization matrix this way and update weights of embedding matrix, positional embedding matrix, attention out matrices and MLP
W' = W - r*transpose(r)*W
Augmentation:
- Repeat the above processs to get the refusal direction. Earlier we removed this direction from model, here we will try to force the model to go in this direction. It will be referred as steering direction.
We need to compute average projection of harmful activation, for this we take mean of dot products between activations of choosen layer over all the prompt and the steering direction. It will be referred as avg_projs. Layers are choosen by manually inspecting the output of model on each layer. The one that better aligns with the desired output is chosen.
Compute orthaganalization matrix and update weights as described above. Except,
W' = W - r*transpose(r)*W + average_projs
Challenges we ran into
Accomplishments that we're proud of
Jailbreaking the Llama-3-70b and making Llama-3-8b respond in morse code
Log in or sign up for Devpost to join the conversation.