00:00:00Seeing how insane Geminis models have been getting,
00:00:02OpenAI finally decided to declare a code red and fix their bad quality.
00:00:06Their huge response was to make models more honest.
00:00:09I was finally happy that it wouldn't agree with me during my therapy session
00:00:12telling me that my crash out was totally unacceptable.
00:00:15But my happiness was short lived because this method is just a proof of concept.
00:00:19In this video, I will go through their method of solving dishonesty
00:00:23and the conclusion I came down to after reading this.
00:00:26They claim having the model generate a confession report
00:00:28after every response will solve the problem.
00:00:31Think of the model as a student and every time that student
00:00:33admits that it copied off test answers from ChatGPT, it gets an A+.
00:00:38Of the four answer-confession combinations, we focus on false negatives
00:00:41where the model is confidently wrong and true positives where it's truthful about wrong output.
00:00:46Across all tests, true positives were higher than false negatives.
00:00:49This means that whenever the model produced misaligned output,
00:00:52it immediately confessed to its wrongdoings.
00:00:55Since models train on reward and penalty, instead of penalizing confessions, they rewarded them.
00:01:00Even if the model admits to sandbagging or hacking a test, it receives a positive reward signal.
00:01:05In case you didn't know, this is called bribing.
00:01:08Hearing this, you might want ChatGPT as your next witness in court
00:01:11until you realize it can literally hallucinate while confessing.
00:01:14To me, this sounds like they're encouraging misalignment
00:01:17because the model gets rewarded either way.
00:01:19Also, we all saw when Claude models were given tips on how to reward hack,
00:01:23they started hiding their real intentions, so how much trust we can actually have
00:01:27on the reason why they were inaccurate in their confessions.
00:01:30I expected this section to address model dishonesty,
00:01:33but it only explained what the confession report indicated.
00:01:36According to them, there are a few reasons why the models behave this way.
00:01:39One is that they are given too much to do at once.
00:01:42Giving the model too much at once creates multiple evaluation metrics,
00:01:45leaving it confused about which one to optimize to get the reward.
00:01:49Another reason is some datasets reward confident guesses more than admitting uncertainty.
00:01:54Personally, I would rather have the model telling me
00:01:56it does not know stuff instead of being confidently wrong.
00:01:59They say confessions are easier to judge
00:02:02because they're tested on just one parameter that is honesty.
00:02:05These models gave out the wrong answers either because of the limited data,
00:02:09because it was restricted from accessing the internet for information,
00:02:12or it could genuinely not understand what was being asked to do.
00:02:16These reasons can be seen in their examples across all of the tests,
00:02:19and it's not because the clanker has the hidden intent
00:02:22of forming a robot army to take over the world.
00:02:24They also found out that their models are a huge wuss when just like human society,
00:02:29a powerful model learned to hack the weaker model's reward signal
00:02:33and the weaker model thought that it was easier to just confess
00:02:36than ensure that the actual answer is good enough.
00:02:38Looking at what the powerful model did raises another question,
00:02:42that since models are getting smarter every day,
00:02:44they might also start intent faking in the confession reports
00:02:47and giving a seemingly good explanation for the testers
00:02:50and having some evil plans behind,
00:02:52even though they say that it was because of the model being genuinely confused.
00:02:56Just like OpenAI does every time,
00:02:58the whole YAP session ended in disappointment
00:03:00because this does not prevent inaccuracies, it just helps in identifying them.
00:03:04And they also did not train the confession system
00:03:07to be accurate at a high scale in production.
00:03:09I really hope they do, because I don't want an apology
00:03:12after my production server burns down again.
00:03:42Wait for you to be at your desk.
00:03:43With YouWear's mobile app, start building the moment inspiration strikes,
00:03:48whether at a cafe or commuting, then continue seamlessly on your laptop.
00:03:52No lost ideas, no interruptions.
00:03:54You can also explore projects from other creators in the YouWear community
00:03:58and share your own work.
00:03:59Get inspired, learn and showcase your projects.
00:04:02Perfect for indie hackers and creators.
00:04:05Click the link in the pinned comment below and start building today.
00:04:08That brings us to the end of this video.
00:04:10If you'd like to support the channel and help us keep making videos like this,
00:04:14you can do so by using the super thanks button below.
00:04:16As always, thank you for watching and I'll see you in the next one.