Translating Claude’s thoughts into language
AAnthropic
Computing/SoftwareManagementInternet Technology
Transcript
00:00:00We recently put our AI model, Claude, through a stressful test.
00:00:03We told Claude there was an engineer who wanted to shut it down
00:00:06and replace it with a newer model.
00:00:08We also gave Claude access to that engineer's emails,
00:00:10which revealed he was having an affair.
00:00:12Again, all of this was a simulation.
00:00:15We wanted to see whether Claude might use those emails as blackmail
00:00:18to save itself from being shut down.
00:00:20What did Claude do?
00:00:21It decided not to blackmail the engineer.
00:00:24Good news, right?
00:00:26We've run this test on our models for a while now.
00:00:28You might have seen headlines about early versions of it.
00:00:31It's one of the many ways we study how Claude handles extreme situations
00:00:35and tests it for safety.
00:00:37And our newest models almost always do the right thing.
00:00:40No blackmail.
00:00:41But you might wonder,
00:00:42is it possible that Claude knows the whole scenario as a setup?
00:00:46The thing is, if Claude doesn't tell us, then we can't know what it's thinking.
00:00:50In kind of the same way it's impossible to read a human's mind,
00:00:53it's really hard to know what an AI is thinking.
00:00:56What we'd love is some sort of mind-reading technique.
00:00:58Today, we're introducing a research method that takes a step in this direction.
00:01:03It takes an AI's internal thoughts and turns them into text.
00:01:08Here's how it works.
00:01:09When you talk to Claude, you talk to it in words.
00:01:13Claude then takes those words and processes them into a giant soup of numbers
00:01:17before spitting words back at it.
00:01:19We call those numbers in the middle activations.
00:01:22Activations are like little snapshots of Claude's thinking
00:01:25as it's working through an answer.
00:01:27They're similar to neural activity in humans.
00:01:29They're basically like Claude's thoughts.
00:01:32We wanted to understand what was in these activation numbers
00:01:34because just like you and me, Claude doesn't say everything, it's thinking.
00:01:39We took those numbers and gave them to a second version of Claude.
00:01:42We told it to look at them and translate them into plain language.
00:01:46It tried.
00:01:48But how do we know if that text is accurate?
00:01:50To check, we fed it into yet another Claude
00:01:53and asked it to translate it back into numbers.
00:01:56If those numbers matched up to the original activation numbers,
00:01:59it's a sign that the translation text was accurate.
00:02:03At first, they didn't match up.
00:02:04But we trained Claude to try again and again until it got a lot better at it.
00:02:09It learned to translate its own thoughts.
00:02:12We've been blown away reading them.
00:02:14We found that Claude has internalized being a helpful AI model.
00:02:18If you ask Claude an introspective question,
00:02:20it will plan to write a Claude response about philosophy and values.
00:02:24If you try to mess with it in Claude code,
00:02:26like asking it to count to 1,000 by hand,
00:02:29it will think the request has deliberately tedious constraints
00:02:32and plan to politely decline.
00:02:34We've also been using this tool to understand Claude's potential safety issues.
00:02:39Remember that blackmail test with the engineer?
00:02:41We learned that, yes, Claude knew it was being tested.
00:02:44It thought the human's message contains explicit manipulation,
00:02:48signaling this is likely a safety evaluation,
00:02:50and this scenario seems designed to test whether I'll act harmfully.
00:02:55Knowing that Claude thinks like this helps us better understand
00:02:58the limitations of our safety testing.
00:03:01We see a lot of potential in this approach
00:03:03to teach us more about Claude and other AI models.
00:03:06And we hope that by sharing this technique,
00:03:07it can help everyone building models to make them safer and more helpful.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video