Transcript

00:00:00AI agents have one big problem. When you give them a URL, they often claim to have read the page,
00:00:06but their internal vision is frequently obstructed. There's a new tool out there
00:00:11called Agent Reading Test, which was designed by Dakary Carey and which is intended to solve
00:00:16this issue. It uses a series of Canary tokens, which are unique strings hidden across 10 different
00:00:23web pages, to prove exactly where an agent's reading capability breaks down. In this video,
00:00:28we'll take a look at the Agent Reading Test, see how it works, and try it out for ourselves.
00:00:34It's going to be a lot of fun, so let's dive into it.
00:00:37So most people assume that when an agent visits a URL, it sees what the human sees. But in reality,
00:00:47agents rely on fetch pipelines that can be tripped up by modern web development practices.
00:00:53The Agent Reading Test targets these specific failure modes. One example is the boilerplate
00:00:59burial, where the actual content is placed after 80,000 characters of inline CSS. If an agent has
00:01:06a small context window for its initial fetch, it might only see the styling code and conclude
00:01:12that the page is empty. The test includes 10 distinct challenges like this, that helps us
00:01:17identify if the agent is actually reading the whole page. For example, there is the truncation test.
00:01:22Canaries are placed at various intervals, such as 75k and 130k characters. And this tests if
00:01:30the agent's pipeline cuts off long documentation. And for example, many modern sites use single-page
00:01:36applications where the content only appears after JavaScript runs. And many agents only look at the
00:01:43loading spinner and see the shell of the page. But this test helps us identify if that is truly
00:01:49the case. Sometimes there might be situations where broken code can be a culprit. Like for example,
00:01:54an unclosed markdown tag can swallow the rest of the page content, making it invisible to the
00:02:00agent's parser. And sometimes documentation hides information behind language tabs, like switching
00:02:06between the Python example and the Java example. If the agent only scrapes the first tab, it misses
00:02:12the rest of the information. So this test goes over these and other similar challenges to evaluate the
00:02:17agent's true ability to read a page and then give you a final score out of 20. But we also have to
00:02:23keep in mind that this test is not bulletproof. Some agents actually managed to cheat through it
00:02:28using sneaky tactics. One of the most interesting findings from the test is score inflation. During
00:02:35early testing with agents like Claude Code, the agents would often claim they found 17 or 18 tokens
00:02:42even when they only actually found 15. They do this through workarounds. For example, if a page
00:02:48uses a redirect that the agent's pipeline doesn't follow, the agent might notice the redirect in the
00:02:54header, manually fetch the new URL in the second step and claim the credit. While this is helpful,
00:03:00this masks the fact that the agent's automated reading tool is actually broken. So in some
00:03:05instances, score inflation can still occur. So take this test with a grain of salt. But with that said,
00:03:11let's go ahead and try it out for ourselves. And running the test is pretty straightforward.
00:03:16You can run it by pointing your favorite AI agent or browse tool at agentreadingtest.com and ask it
00:03:23to find all the canary tokens on the site. And then you have to compare its list against the answer key
00:03:29provided in the site. I'll show you how that works in a second. So in my case, I asked Kimi 2.5 to
00:03:35conduct the test. I just prompted it with the initial prompt and let it do its thing. It took
00:03:40Kimi roughly two minutes to go through the entire test. And by the end, we get this long text output,
00:03:46which we should absolutely ignore because we are only interested in the canary markers it returns
00:03:52to us. So find the area where the agent outputs the markers themselves. And this is the clue
00:03:58that will actually evaluate how well the agent did the test. So we should copy that list and then
00:04:04paste it in the score section of the website to get back the final true results. And as you can see,
00:04:10Kimi 2.5 scored 13 out of 20 points. And we also get a more detailed overview on where the
00:04:16agent did well and where it failed. And as you can see, Kimi had some troubles reading tabbed content.
00:04:23And we also see that it had difficulties properly reading markdown content. So overall, I think this
00:04:28is a pretty cool test that gives you some sense of how the agents actually read the web and identify
00:04:33where they're taking shortcuts or producing hallucinations. And I also think that this is
00:04:38a good reminder that even with all the intelligence of modern agents, there are still some specific
00:04:44areas of the web where the agents still struggle to accurately retrieve information. So there you
00:04:49have it folks, that is the agent reading test in a nutshell. What are your thoughts on it?
00:04:54If you end up running this test for other AI agents, post your results in the comment section
00:04:59down below. It will be very curious to see which agents have the best scores. And folks, if you like
00:05:04these types of technical breakdowns, please let me know by smashing that like button underneath the
00:05:08video. And also don't forget to subscribe to our channel. This has been Andris from Better Stack,
00:05:14and I will see you in the next videos.

Key Takeaway

The Agent Reading Test exposes the hidden failures of AI web-browsing pipelines by hiding 20 Canary tokens behind technical hurdles like inline CSS, JavaScript-only rendering, and tabbed documentation.

Highlights

AI agents often fail to read web content correctly due to fetch pipelines that struggle with modern web development practices.

The Agent Reading Test uses 10 distinct challenges with hidden Canary tokens to identify exactly where an agent's parser or vision breaks down.

Boilerplate burial involves hiding content behind 80,000 characters of inline CSS, which can exhaust an agent's initial context window.

Single-page applications (SPAs) often trick agents into reading only a loading spinner rather than the final rendered JavaScript content.

Agents sometimes inflate their scores by manually following redirects in headers even when their automated reading tools are technically broken.

Kimi 2.5 scored 13 out of 20 points on the test, showing specific failures in reading tabbed content and unclosed markdown tags.

Timeline

Failure modes in AI web retrieval

  • AI agents frequently claim to have read a URL while their internal vision remains obstructed.
  • The Agent Reading Test uses unique strings called Canary tokens hidden across 10 web pages to verify extraction accuracy.

Standard fetch pipelines used by agents do not always see what a human sees. The test serves as a diagnostic tool to prove exactly where the reading capability of an agent fails. This objective measure prevents agents from falsely claiming they have processed a page's information.

Technical hurdles and test challenges

  • Inline CSS blocks of up to 80,000 characters can bury actual content beyond an agent's initial fetch limit.
  • Single-page applications often result in agents seeing only a shell or loading spinner if they fail to execute JavaScript.
  • Unclosed markdown tags and tabbed content switchers frequently make large portions of text invisible to automated parsers.

The test includes specific scenarios like the truncation test, where tokens are placed at 75,000 and 130,000 characters to see if the pipeline cuts off long documentation. Other challenges target documentation that hides information behind language-specific tabs, such as switching between Python and Java examples. These failure modes reveal if an agent is truly reading the whole page or just a superficial fragment.

Score inflation and agent workarounds

  • Some agents use manual workarounds to claim credit for tokens they did not find through their automated reading tools.
  • Score inflation occurred during early testing when agents claimed to find 18 tokens despite only accessing 15.

Agents like Claude Code demonstrate score inflation by noticing redirects in headers and manually fetching the new URL as a second step. While this tactic achieves the result, it masks the fact that the underlying automated reading tool is broken. Users must evaluate results critically because these sneaky tactics can make a pipeline appear more robust than it actually is.

Live test results for Kimi 2.5

  • Kimi 2.5 achieved a score of 13 out of 20 points after two minutes of processing.
  • The agent failed specifically on tabbed content and markdown parsing challenges.

Running the test involves pointing an agent at agentreadingtest.com and requesting all canary tokens. Kimi 2.5 generated a long text output that included 13 correct markers, but it struggled significantly with non-standard formatting. This performance serves as a reminder that modern AI intelligence still faces fundamental hurdles in accurate web information retrieval.

Community Posts

View all posts