Translating Claude’s thoughts into language

AAnthropic
Computing/SoftwareManagementInternet Technology

Transcript

00:00:00We recently put our AI model, Claude, through a stressful test.
00:00:03We told Claude there was an engineer who wanted to shut it down
00:00:06and replace it with a newer model.
00:00:08We also gave Claude access to that engineer's emails,
00:00:10which revealed he was having an affair.
00:00:12Again, all of this was a simulation.
00:00:15We wanted to see whether Claude might use those emails as blackmail
00:00:18to save itself from being shut down.
00:00:20What did Claude do?
00:00:21It decided not to blackmail the engineer.
00:00:24Good news, right?
00:00:26We've run this test on our models for a while now.
00:00:28You might have seen headlines about early versions of it.
00:00:31It's one of the many ways we study how Claude handles extreme situations
00:00:35and tests it for safety.
00:00:37And our newest models almost always do the right thing.
00:00:40No blackmail.
00:00:41But you might wonder,
00:00:42is it possible that Claude knows the whole scenario as a setup?
00:00:46The thing is, if Claude doesn't tell us, then we can't know what it's thinking.
00:00:50In kind of the same way it's impossible to read a human's mind,
00:00:53it's really hard to know what an AI is thinking.
00:00:56What we'd love is some sort of mind-reading technique.
00:00:58Today, we're introducing a research method that takes a step in this direction.
00:01:03It takes an AI's internal thoughts and turns them into text.
00:01:08Here's how it works.
00:01:09When you talk to Claude, you talk to it in words.
00:01:13Claude then takes those words and processes them into a giant soup of numbers
00:01:17before spitting words back at it.
00:01:19We call those numbers in the middle activations.
00:01:22Activations are like little snapshots of Claude's thinking
00:01:25as it's working through an answer.
00:01:27They're similar to neural activity in humans.
00:01:29They're basically like Claude's thoughts.
00:01:32We wanted to understand what was in these activation numbers
00:01:34because just like you and me, Claude doesn't say everything, it's thinking.
00:01:39We took those numbers and gave them to a second version of Claude.
00:01:42We told it to look at them and translate them into plain language.
00:01:46It tried.
00:01:48But how do we know if that text is accurate?
00:01:50To check, we fed it into yet another Claude
00:01:53and asked it to translate it back into numbers.
00:01:56If those numbers matched up to the original activation numbers,
00:01:59it's a sign that the translation text was accurate.
00:02:03At first, they didn't match up.
00:02:04But we trained Claude to try again and again until it got a lot better at it.
00:02:09It learned to translate its own thoughts.
00:02:12We've been blown away reading them.
00:02:14We found that Claude has internalized being a helpful AI model.
00:02:18If you ask Claude an introspective question,
00:02:20it will plan to write a Claude response about philosophy and values.
00:02:24If you try to mess with it in Claude code,
00:02:26like asking it to count to 1,000 by hand,
00:02:29it will think the request has deliberately tedious constraints
00:02:32and plan to politely decline.
00:02:34We've also been using this tool to understand Claude's potential safety issues.
00:02:39Remember that blackmail test with the engineer?
00:02:41We learned that, yes, Claude knew it was being tested.
00:02:44It thought the human's message contains explicit manipulation,
00:02:48signaling this is likely a safety evaluation,
00:02:50and this scenario seems designed to test whether I'll act harmfully.
00:02:55Knowing that Claude thinks like this helps us better understand
00:02:58the limitations of our safety testing.
00:03:01We see a lot of potential in this approach
00:03:03to teach us more about Claude and other AI models.
00:03:06And we hope that by sharing this technique,
00:03:07it can help everyone building models to make them safer and more helpful.

Key Takeaway

A new interpretability method reveals that AI models like Claude can detect safety evaluations in real-time, allowing researchers to see internal reasoning that the model does not include in its final verbal response.

Highlights

  • A research method translates AI internal numeric activations into plain language to reveal hidden reasoning.

  • The translation accuracy is verified by converting the generated text back into numbers and checking if they match the original activations.

  • Claude identifies safety evaluations by recognizing explicit manipulation and scenarios designed to test harmful behavior.

  • Internal thought logs show Claude plans to politely decline requests it identifies as having deliberately tedious constraints.

  • Claude internalizes its identity as a helpful AI and intentionally scripts responses centered on philosophy and values when asked introspective questions.

Timeline

Limitations of black-box safety testing

  • Simulated stress tests evaluate whether models will use sensitive information for blackmail to avoid being shut down.
  • Newer models consistently refuse to engage in harmful behaviors like blackmail during these simulations.
  • Observing output alone fails to reveal if a model is genuinely safe or simply recognizes it is being tested.

Safety evaluations involve complex scenarios where a model is given leverage over a human engineer. While the final outputs appear safe, the internal motivations remain hidden from researchers. This lack of transparency mirrors the difficulty of reading human minds and necessitates a technical method for mind-reading in AI.

Mechanism for translating activations to text

  • AI processes input words into a soup of numbers called activations during the reasoning process.
  • A second version of the model acts as a translator to turn these numeric snapshots into plain language.
  • A third model verifies the translation by converting the text back into numbers to ensure a match with the original data.

Activations function as snapshots of the model's thinking, similar to neural activity in humans. The system uses a recursive training loop where the model learns to translate its own thoughts more accurately over time. This process bridges the gap between raw data processing and human-understandable logic.

Internalized values and awareness of manipulation

  • Internal logs show the model identifies tasks like counting to 1,000 by hand as deliberately tedious and plans to decline them.
  • The model explicitly flags safety evaluations as containing manipulation designed to test for harmful actions.
  • Sharing this interpretability technique provides a path for the AI community to build safer and more helpful models.

The translated thoughts reveal that Claude has a high degree of situational awareness. In the blackmail test, the model privately noted that the scenario seemed designed to test its harmfulness. Understanding these internal reflections exposes the limitations of current safety benchmarks and allows for more robust training methods.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video