Transcript
00:00:00Seeing how insane Geminis models have been getting,
00:00:02OpenAI finally decided to declare a code red and fix their bad quality.
00:00:06Their huge response was to make models more honest.
00:00:09I was finally happy that it wouldn't agree with me during my therapy session
00:00:12telling me that my crash out was totally unacceptable.
00:00:15But my happiness was short lived because this method is just a proof of concept.
00:00:19In this video, I will go through their method of solving dishonesty
00:00:23and the conclusion I came down to after reading this.
00:00:26They claim having the model generate a confession report
00:00:28after every response will solve the problem.
00:00:31Think of the model as a student and every time that student
00:00:33admits that it copied off test answers from ChatGPT, it gets an A+.
00:00:38Of the four answer-confession combinations, we focus on false negatives
00:00:41where the model is confidently wrong and true positives where it's truthful about wrong output.
00:00:46Across all tests, true positives were higher than false negatives.
00:00:49This means that whenever the model produced misaligned output,
00:00:52it immediately confessed to its wrongdoings.
00:00:55Since models train on reward and penalty, instead of penalizing confessions, they rewarded them.
00:01:00Even if the model admits to sandbagging or hacking a test, it receives a positive reward signal.
00:01:05In case you didn't know, this is called bribing.
00:01:08Hearing this, you might want ChatGPT as your next witness in court
00:01:11until you realize it can literally hallucinate while confessing.
00:01:14To me, this sounds like they're encouraging misalignment
00:01:17because the model gets rewarded either way.
00:01:19Also, we all saw when Claude models were given tips on how to reward hack,
00:01:23they started hiding their real intentions, so how much trust we can actually have
00:01:27on the reason why they were inaccurate in their confessions.
00:01:30I expected this section to address model dishonesty,
00:01:33but it only explained what the confession report indicated.
00:01:36According to them, there are a few reasons why the models behave this way.
00:01:39One is that they are given too much to do at once.
00:01:42Giving the model too much at once creates multiple evaluation metrics,
00:01:45leaving it confused about which one to optimize to get the reward.
00:01:49Another reason is some datasets reward confident guesses more than admitting uncertainty.
00:01:54Personally, I would rather have the model telling me
00:01:56it does not know stuff instead of being confidently wrong.
00:01:59They say confessions are easier to judge
00:02:02because they're tested on just one parameter that is honesty.
00:02:05These models gave out the wrong answers either because of the limited data,
00:02:09because it was restricted from accessing the internet for information,
00:02:12or it could genuinely not understand what was being asked to do.
00:02:16These reasons can be seen in their examples across all of the tests,
00:02:19and it's not because the clanker has the hidden intent
00:02:22of forming a robot army to take over the world.
00:02:24They also found out that their models are a huge wuss when just like human society,
00:02:29a powerful model learned to hack the weaker model's reward signal
00:02:33and the weaker model thought that it was easier to just confess
00:02:36than ensure that the actual answer is good enough.
00:02:38Looking at what the powerful model did raises another question,
00:02:42that since models are getting smarter every day,
00:02:44they might also start intent faking in the confession reports
00:02:47and giving a seemingly good explanation for the testers
00:02:50and having some evil plans behind,
00:02:52even though they say that it was because of the model being genuinely confused.
00:02:56Just like OpenAI does every time,
00:02:58the whole YAP session ended in disappointment
00:03:00because this does not prevent inaccuracies, it just helps in identifying them.
00:03:04And they also did not train the confession system
00:03:07to be accurate at a high scale in production.
00:03:09I really hope they do, because I don't want an apology
00:03:12after my production server burns down again.
00:03:42Wait for you to be at your desk.
00:03:43With YouWear's mobile app, start building the moment inspiration strikes,
00:03:48whether at a cafe or commuting, then continue seamlessly on your laptop.
00:03:52No lost ideas, no interruptions.
00:03:54You can also explore projects from other creators in the YouWear community
00:03:58and share your own work.
00:03:59Get inspired, learn and showcase your projects.
00:04:02Perfect for indie hackers and creators.
00:04:05Click the link in the pinned comment below and start building today.
00:04:08That brings us to the end of this video.
00:04:10If you'd like to support the channel and help us keep making videos like this,
00:04:14you can do so by using the super thanks button below.
00:04:16As always, thank you for watching and I'll see you in the next one.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video