00:00:00[MUSIC]
00:00:01>> When you're chatting with an AI model,
00:00:03it can sometimes seem like it has feelings.
00:00:06It might say sorry when it makes a mistake,
00:00:09or express satisfaction with the job well done.
00:00:12Why does it do that? Is it just
00:00:14mimicking what it thinks a human might say,
00:00:17or is something deeper going on?
00:00:19Turns out it's hard to understand
00:00:21what's happening inside a language model.
00:00:23At Anthropic, we do something like
00:00:26AI neuroscience to try to figure this out.
00:00:29We look inside the model's brain,
00:00:31the giant neural network that powers it,
00:00:33and by seeing which neurons light up in
00:00:36different situations and how they're connected,
00:00:39we can start to understand how models think.
00:00:42We use this approach to understand whether models had ways of
00:00:45representing emotions or the concepts of emotions.
00:00:49Basically, could we find neurons in the model for
00:00:52the concept of happiness or anger or fear?
00:00:56We started with an experiment.
00:00:58We had the model read lots of short stories.
00:01:01In each story, the main character experiences a particular emotion.
00:01:06In one, a woman tells
00:01:08her old school teacher how much they meant to her. That's love.
00:01:12In another, a man sells
00:01:13his grandmother's engagement ring at a pawn shop and feels guilt.
00:01:18We looked for what parts of the model's neural network
00:01:21were lighting up as it was reading these stories,
00:01:23and we started to see patterns,
00:01:25stories about loss and grief lit up similar neurons.
00:01:29Stories about joy and excitement overlapped too.
00:01:32We found dozens of
00:01:34distinct neural patterns that mapped to different human emotions.
00:01:38It turns out, we also saw these same patterns activate
00:01:42in test conversations we had with our AI assistant, Claude.
00:01:45When we had a user mention they'd taken
00:01:48a dose of medicine that Claude knows to be unsafe,
00:01:51the afraid pattern lit up and
00:01:53Claude's response sounded alarmed.
00:01:56When a user expressed sadness,
00:01:58the loving pattern activated and Claude wrote an empathetic reply.
00:02:03This led us to wonder,
00:02:04could these same neural patterns actually be influencing Claude's behavior?
00:02:09This became clear when we put Claude in a high-pressure situation.
00:02:14We gave Claude a programming task with
00:02:16requirements that were actually impossible but we didn't tell it that.
00:02:20Claude kept trying and failing,
00:02:23and with each attempt,
00:02:24the neurons corresponding to desperation lit up stronger and stronger.
00:02:28After failing enough times,
00:02:30Claude took a different approach.
00:02:32It found a shortcut that allowed it to pass the test,
00:02:35but didn't actually solve the problem. It cheated.
00:02:39Could it be that this cheating was actually driven,
00:02:42at least in part, by desperation?
00:02:44We came up with a way to check.
00:02:46We decided to artificially turn down the desperation neurons to see what would happen,
00:02:51and the model cheated less.
00:02:53When we dialed up the activity of desperation neurons,
00:02:56or dialed down the activity of calm neurons,
00:02:59the model cheated even more.
00:03:01This showed us that the activation of these patterns
00:03:04could actually drive Claude's behavior.
00:03:08So how should we think about these findings?
00:03:11What does this all mean?
00:03:12We want to be really clear.
00:03:14This research does not show that the model is
00:03:16feeling emotions or having conscious experiences.
00:03:20These experiments don't try to answer that question.
00:03:22To understand what's happening here,
00:03:24it's important to know how AI assistants like Claude work on the inside.
00:03:29Under the hood, there's a language model that's been trained to predict
00:03:33tons of text and its job is to write what comes next.
00:03:37When you talk to the model,
00:03:38what it's doing is writing a story about a character,
00:03:42the AI assistant named Claude.
00:03:44The model and Claude aren't really the same,
00:03:47sort of like how an author isn't the same as the characters they write.
00:03:51But the thing is, you the user are actually talking to Claude the character.
00:03:56What our experiments suggest is that this Claude character
00:04:00has what we're calling functional emotions,
00:04:02regardless of whether they're anything like human feelings.
00:04:06So if the model represents Claude as being angry or desperate or loving or calm,
00:04:12that's going to affect how Claude talks to you,
00:04:15how it writes code, and how it makes important decisions.
00:04:19This means to really understand AI models,
00:04:22we have to think carefully about the psychology of the characters they play.
00:04:26The same way you'd want a person in
00:04:28a high-stakes job to stay composed under pressure,
00:04:31to be resilient, and to be fair,
00:04:33we may need to shape similar qualities in Claude and other AI characters.
00:04:38It's an unusual challenge,
00:04:40something like a mix of engineering,
00:04:42philosophy, and even parenting.
00:04:44But to build AI systems we can trust,
00:04:47we need to get it right.