00:00:00When you need a scraper for RAG, the problem isn't getting the data,
00:00:03it's cleaning it.
00:00:04JavaScript breaks things, HTML is messy,
00:00:07and we just waste time trying to make it usable for the LLM.
00:00:11The fix to this is CrawlForAI.
00:00:13It's built for AI, has async,
00:00:15handles JavaScript, outputs clean markdown or JSON,
00:00:18and it runs up to six times faster than traditional Python scrapers like Scrapy.
00:00:23We get model-ready data faster than I've ever seen before.
00:00:26How does this work? How is this different?
00:00:29These are the questions.
00:00:30[MUSIC]
00:00:35So what is CrawlForAI really?
00:00:37At first, it just seems like another Python crawler,
00:00:40but it's not built for scraping, it's built for AI.
00:00:43Here's the difference.
00:00:44Most crawlers give us raw HTML,
00:00:46CrawlForAI gives us a clean markdown or structured JSON ready for an LLM.
00:00:52It handles JavaScript using Playwright,
00:00:54it runs async so it actually scales,
00:00:57and it has the prefetch mode which skips heavy rendering when you just need the links.
00:01:01Now this matters because if we're building chatbots,
00:01:04assistants, these agents, our problem isn't crawling,
00:01:08it's turning that messy web page data into usable data.
00:01:11CrawlForAI removes this entire problem and fast.
00:01:15If you enjoy this kind of content, be sure to subscribe.
00:01:18We have videos coming out all the time.
00:01:20Let's start simple. Here's the most basic crawl I spun up.
00:01:23A lot of this I got from their repo and docs,
00:01:25and I just tweaked a few lines to get this to run.
00:01:28I imported AsyncWebCrawler which handles asynchronous web requests for AI pipelines.
00:01:34Then I call a run on a tech news URL, that's it.
00:01:38Now look at this output.
00:01:40This isn't raw HTML we're getting back,
00:01:43it's clean markdown, it's clean JSON.
00:01:45Heading structured, links preserved,
00:01:47and under it all, it fetches the page,
00:01:50parses the DOM, removes the noise,
00:01:52and it ranks the content so we can keep the important stuff without all that extra jargon.
00:01:57Now, if we need to build a news summarizer in this case or a rag prototype,
00:02:02we don't write cleaning scripts,
00:02:04we just pass this directly into your model.
00:02:07We expected scraping, at least I did with this like any other scraping tool,
00:02:11but what we actually got is data that's already prepped for us.
00:02:14The gap, that's time saved.
00:02:17Now, it does get even more interesting from here.
00:02:19As I was playing around with it,
00:02:20you might think rendering every page is necessary.
00:02:23That's not actually the case. Watch this.
00:02:25This is the same crawler,
00:02:27but now we set prefetch to true.
00:02:30I'm going to hit Hacker News.
00:02:31See how fast this runs?
00:02:33This was actually insane how fast it ran.
00:02:35Instead of rendering every page,
00:02:37it grabs links first,
00:02:38just async fetching.
00:02:39If you're building an aggregator, this is nice.
00:02:42We first find what we need,
00:02:44then we can extract it later on.
00:02:45You don't crawl everything,
00:02:47just what we need.
00:02:48That difference scales when you're dealing with
00:02:50hundreds or thousands of URLs.
00:02:52Now, let's talk a bit more about the production side of things.
00:02:55I'll run a deep crawl using BFS deep crawl strategy.
00:02:58This is their version of a BFS approach.
00:03:01I then define resume state right here,
00:03:03and I add an on state change callback.
00:03:07I'm going to start the crawl,
00:03:08and then I'm going to kill the process.
00:03:10Now, most tools when you kill that process,
00:03:13they start over again.
00:03:14But when I rerun this, watch this,
00:03:16we restart using the saved JSON state,
00:03:19and it continues exactly where it left off without losing anything.
00:03:22So building a large rag knowledge base,
00:03:24a crash isn't annoying.
00:03:26Usually, it's just expensive.
00:03:29But here it picks back up.
00:03:30Now, the part most scrapers can't do is semantic extraction.
00:03:35If I define a schema using Pydantic,
00:03:37so job title, company, salary,
00:03:39I'm going to scrape indeed.
00:03:40Then I configure their class LLM extraction strategy
00:03:44with an instruction and provider.
00:03:46I'm going to run it on Indeed,
00:03:48the job listing site,
00:03:49and look at this output.
00:03:51That's really good actually.
00:03:52Structured JSON with clean fields,
00:03:54and here's what's happening.
00:03:56Crawl for AI converts the page,
00:03:58like I've said, mark down JSON you choose.
00:04:01It will then send it to a model.
00:04:03The model structures it based on your schema.
00:04:06It's not scraping text,
00:04:07just extracting what we want.
00:04:09The LLM now can handle this.
00:04:11That's a completely different level of capability for LLM style tools.
00:04:15Now, this was cool,
00:04:16but let's zoom out for a second.
00:04:18The pros here, it's fast.
00:04:20It's really fast, up to six times in benchmarks.
00:04:22It handles JavaScript automatically.
00:04:24It's async and scalable,
00:04:26and it resumes after crashes.
00:04:28But the highlight here,
00:04:29it integrates LLMs directly.
00:04:32Plus, it's open source.
00:04:33I just had a pip install this and we're up and running.
00:04:35Now, just like with anything,
00:04:36there are trade-offs here.
00:04:38It's Python only, right?
00:04:39We might use Python, you might not.
00:04:41That could be a drawback.
00:04:42LLM features require API keys unless you run local models like Ollama.
00:04:46Having crawling can still hit rate limits,
00:04:49and like any fast-moving open source project,
00:04:51you got to update this, right?
00:04:53Updates are being pushed.
00:04:54But for AI-focused devs,
00:04:56it removes a lot of pain points, right?
00:04:58Especially in these rag pipelines.
00:05:00Now, let's compare it to what you might already be using.
00:05:03Scrapy, if you're coming from Python, boom.
00:05:05It's great for massive static crawls,
00:05:07but it's rule-based and it's boilerplate heavy.
00:05:10Honestly, it takes a lot of time to set that up.
00:05:13If you want LLM extraction on markdown output,
00:05:16you're building custom layers.
00:05:17With Crawl for AI, that's just built in.
00:05:19You have beautiful soup.
00:05:21That's really lightweight, too simple,
00:05:23but it's really just a parser.
00:05:25There's no crawling engine,
00:05:26there's no JavaScript rendering.
00:05:28You'll end up just stitching a bunch of things together.
00:05:31Then, of course, the big one is Selenium.
00:05:33It renders JavaScript, sure, right?
00:05:36But it's slower and it's more manual.
00:05:38Scaling async workflows is still gonna be a pain.
00:05:42Crawl for AI wraps Playwright internally
00:05:44and exposes a clear async API.
00:05:46If you're building traditional rule-based crawlers
00:05:48for static data,
00:05:49your existing tools are honestly gonna be fine.
00:05:52But if you're building AI systems,
00:05:54rag pipelines, autonomous agents,
00:05:56Crawl for AI is purpose-built for that kind of world,
00:06:00and it's just a really cool AI tool.
00:06:02It doesn't just crawl pages, it prepares the data,
00:06:04it prepares what the LLM needs.
00:06:06So that's Crawl for AI.
00:06:08If you're into AI, this is probably worth checking out.
00:06:11It was super fast, I was actually shocked by that.
00:06:14And it's really cool
00:06:15if we are building these rag-style pipelines,
00:06:17we can push that data cleanly into our LLMs.
00:06:20We'll see you in another video.