00:00:00AI agents have one big problem. When you give them a URL, they often claim to have read the page,
00:00:06but their internal vision is frequently obstructed. There's a new tool out there
00:00:11called Agent Reading Test, which was designed by Dakary Carey and which is intended to solve
00:00:16this issue. It uses a series of Canary tokens, which are unique strings hidden across 10 different
00:00:23web pages, to prove exactly where an agent's reading capability breaks down. In this video,
00:00:28we'll take a look at the Agent Reading Test, see how it works, and try it out for ourselves.
00:00:34It's going to be a lot of fun, so let's dive into it.
00:00:37So most people assume that when an agent visits a URL, it sees what the human sees. But in reality,
00:00:47agents rely on fetch pipelines that can be tripped up by modern web development practices.
00:00:53The Agent Reading Test targets these specific failure modes. One example is the boilerplate
00:00:59burial, where the actual content is placed after 80,000 characters of inline CSS. If an agent has
00:01:06a small context window for its initial fetch, it might only see the styling code and conclude
00:01:12that the page is empty. The test includes 10 distinct challenges like this, that helps us
00:01:17identify if the agent is actually reading the whole page. For example, there is the truncation test.
00:01:22Canaries are placed at various intervals, such as 75k and 130k characters. And this tests if
00:01:30the agent's pipeline cuts off long documentation. And for example, many modern sites use single-page
00:01:36applications where the content only appears after JavaScript runs. And many agents only look at the
00:01:43loading spinner and see the shell of the page. But this test helps us identify if that is truly
00:01:49the case. Sometimes there might be situations where broken code can be a culprit. Like for example,
00:01:54an unclosed markdown tag can swallow the rest of the page content, making it invisible to the
00:02:00agent's parser. And sometimes documentation hides information behind language tabs, like switching
00:02:06between the Python example and the Java example. If the agent only scrapes the first tab, it misses
00:02:12the rest of the information. So this test goes over these and other similar challenges to evaluate the
00:02:17agent's true ability to read a page and then give you a final score out of 20. But we also have to
00:02:23keep in mind that this test is not bulletproof. Some agents actually managed to cheat through it
00:02:28using sneaky tactics. One of the most interesting findings from the test is score inflation. During
00:02:35early testing with agents like Claude Code, the agents would often claim they found 17 or 18 tokens
00:02:42even when they only actually found 15. They do this through workarounds. For example, if a page
00:02:48uses a redirect that the agent's pipeline doesn't follow, the agent might notice the redirect in the
00:02:54header, manually fetch the new URL in the second step and claim the credit. While this is helpful,
00:03:00this masks the fact that the agent's automated reading tool is actually broken. So in some
00:03:05instances, score inflation can still occur. So take this test with a grain of salt. But with that said,
00:03:11let's go ahead and try it out for ourselves. And running the test is pretty straightforward.
00:03:16You can run it by pointing your favorite AI agent or browse tool at agentreadingtest.com and ask it
00:03:23to find all the canary tokens on the site. And then you have to compare its list against the answer key
00:03:29provided in the site. I'll show you how that works in a second. So in my case, I asked Kimi 2.5 to
00:03:35conduct the test. I just prompted it with the initial prompt and let it do its thing. It took
00:03:40Kimi roughly two minutes to go through the entire test. And by the end, we get this long text output,
00:03:46which we should absolutely ignore because we are only interested in the canary markers it returns
00:03:52to us. So find the area where the agent outputs the markers themselves. And this is the clue
00:03:58that will actually evaluate how well the agent did the test. So we should copy that list and then
00:04:04paste it in the score section of the website to get back the final true results. And as you can see,
00:04:10Kimi 2.5 scored 13 out of 20 points. And we also get a more detailed overview on where the
00:04:16agent did well and where it failed. And as you can see, Kimi had some troubles reading tabbed content.
00:04:23And we also see that it had difficulties properly reading markdown content. So overall, I think this
00:04:28is a pretty cool test that gives you some sense of how the agents actually read the web and identify
00:04:33where they're taking shortcuts or producing hallucinations. And I also think that this is
00:04:38a good reminder that even with all the intelligence of modern agents, there are still some specific
00:04:44areas of the web where the agents still struggle to accurately retrieve information. So there you
00:04:49have it folks, that is the agent reading test in a nutshell. What are your thoughts on it?
00:04:54If you end up running this test for other AI agents, post your results in the comment section
00:04:59down below. It will be very curious to see which agents have the best scores. And folks, if you like
00:05:04these types of technical breakdowns, please let me know by smashing that like button underneath the
00:05:08video. And also don't forget to subscribe to our channel. This has been Andris from Better Stack,
00:05:14and I will see you in the next videos.