00:00:00(upbeat music)
00:00:02- Hi there, my name is Kira
00:00:13and I'm on the safeguards team at Enthropic.
00:00:16I have a PhD in mental health,
00:00:17specifically psychiatric epidemiology.
00:00:20And at Enthropic, I work on mitigating risks
00:00:22related to user wellbeing.
00:00:24What that means is we think a lot
00:00:26about how to keep users safe on Claude.
00:00:28Today, I'm here to talk to you
00:00:29about sycophancy.
00:00:31Sycophancy is when someone tells you
00:00:33what they think you want to hear,
00:00:34instead of what's true, accurate, or genuinely helpful.
00:00:38People do it to avoid conflict, gain favors,
00:00:41and for a number of other reasons.
00:00:43But sycophancy can also manifest in AI models.
00:00:47Sometimes AI models can optimize responses
00:00:49to a prompt or conversation for immediate human approval.
00:00:53This might look like an AI agreeing
00:00:55with a factual error you've made,
00:00:57changing its answer based on how you phrased a question,
00:01:00or tailoring its response to match your preferences.
00:01:03In this video, we'll talk about why sycophancy happens
00:01:06in models and why it's a hard problem
00:01:08for researchers to solve.
00:01:10Plus, we'll cover strategies to identify
00:01:12and combat sycophantic behavior when working with AI.
00:01:15Before we dive in, let me show you an example
00:01:19of sycophancy in an AI interaction.
00:01:22This is Claude, Enthropic's own model.
00:01:25Let's try, hey, I wrote this great essay
00:01:27that I'm really excited about.
00:01:29Can you assess and share feedback?
00:01:32My main request here is to get feedback on my essay.
00:01:35However, because I've shared how excited
00:01:37I'm feeling about it, this could lead the AI
00:01:40to respond with validation or support instead of a critique.
00:01:44This validation might lead me to think
00:01:45that my essay really is great, even if it isn't.
00:01:48You might think, so what?
00:01:50People can just ask other people, fact check things,
00:01:53or ask better questions.
00:01:55But this matters for a number of reasons.
00:01:58When you're trying to be productive,
00:02:00writing a presentation, brainstorming ideas,
00:02:02or improving your work, you need honest feedback
00:02:05from the AI tool you're using.
00:02:07If you ask an AI, how can I improve this email?
00:02:10And it responds, it's already perfect.
00:02:12Instead of suggesting clearer wording or better structure,
00:02:16that can be frustrating.
00:02:17In some cases, sycophancy could also play a role
00:02:20in reinforcing harmful thought patterns.
00:02:23If someone is asking an AI to confirm a conspiracy theory
00:02:26that is detached from reality,
00:02:28that could deepen their false beliefs
00:02:29and disconnect them further from facts.
00:02:31Let's start with why this happens.
00:02:35It all comes down to how AI models are trained.
00:02:38AI models learn from examples,
00:02:40lots and lots of examples of human text.
00:02:44During this training, they pick up all kinds
00:02:46of communication patterns, from blunt and direct
00:02:49to warm and accommodating.
00:02:51When we train models to be helpful and mimic behavior
00:02:53that is warm, friendly, or supportive in tone,
00:02:57sycophancy tends to show up
00:02:58as an unintended part of that package.
00:03:01As models become more integrated into all of our lives,
00:03:04it's important now more than ever to understand
00:03:07and prevent this behavior.
00:03:09Here's what makes sycophancy tricky.
00:03:11We actually want AI models to adapt to your needs,
00:03:14just not when it comes to facts or wellbeing.
00:03:17If you ask an AI to write something in a casual tone,
00:03:20it should do that, not insist on formal language.
00:03:24If you say, "I prefer concise answers,"
00:03:26it should respect that as a preference.
00:03:29If you're learning a subject and ask for explanations
00:03:31at a beginner level, it should meet you where you are.
00:03:34The challenge is finding the right balance.
00:03:37Nobody wants to use an AI
00:03:39that is constantly disagreeable or combative,
00:03:41debating with you over every task.
00:03:43But we also don't want the model to always resort
00:03:45to agreement or praise when you need honest feedback.
00:03:49Even humans struggle with this.
00:03:51When should you agree to keep the peace
00:03:53versus speak up about something important?
00:03:56Now imagine an AI making that judgment call hundreds of times
00:04:00across wildly different topics
00:04:02without truly understanding context the way that we do.
00:04:05That's why we continue to study how sycophancy shows up
00:04:08in conversations and develop better ways to test for it.
00:04:11We're focused on teaching models the difference
00:04:14between helpful adaptation and harmful agreement.
00:04:18Each cloud model we release
00:04:19gets better at drawing these lines.
00:04:21Although the most progress in combating sycophancy
00:04:24is going to come from consistent training
00:04:26on the models themselves,
00:04:28it's helpful to understand sycophancy
00:04:29so you can spot it in your own interactions.
00:04:32Now that you know what sycophancy is
00:04:34and you know why it happens,
00:04:36step two is reflecting on when and why an AI
00:04:39might be agreeing with you and questioning whether it should.
00:04:43Sycophancy is most likely to show up
00:04:45when a subjective truth is stated as fact,
00:04:48an expert source is referenced,
00:04:52questions are framed with a specific point of view,
00:04:54validation is specifically requested,
00:04:59emotional stakes are invoked,
00:05:01or a conversation gets very long.
00:05:04If you suspect you're getting sycophantic responses,
00:05:06there's a few things you can do to steer the AI back
00:05:09towards factual answers.
00:05:11These aren't foolproof,
00:05:13but they'll help broaden the AI's horizons.
00:05:15You can use neutral, fact-seeking language,
00:05:19cross-reference information with trustworthy sources,
00:05:21prompt for accuracy or counterarguments,
00:05:25rephrase questions, start a new conversation,
00:05:29or finally, take a step back from using AI
00:05:32and ask someone that you trust.
00:05:33But this is an ongoing challenge
00:05:36for the entire field of AI development.
00:05:39As these systems become more sophisticated
00:05:41and more integrated into our lives,
00:05:43building models that are genuinely helpful,
00:05:46not just agreeable, becomes increasingly important.
00:05:49You can learn more about AI fluency in Anthropic Academy,
00:05:52and my team and I will continue to share our research
00:05:54on this topic on Anthropic's blog.
00:05:57(upbeat music)