Transcript
00:00:00This is Scrapling, a Python scraper that tries to fix the worst part of web scraping.
00:00:05The scraper works today, then breaks the second the site changes. One renamed class,
00:00:10one moved div, one bot check, and now your data pipeline is dead. Scrapling's whole claim is that
00:00:17your scraper can adapt instead of falling apart. It has over 53,000 stars on GitHub,
00:00:22it supports adaptive parsing, stealth fetching, and bigger crawler workflows.
00:00:27I'm going to test the one question that actually matters.
00:00:30Can it survive a website change without rewriting selectors? We're about to find out.
00:00:40So what is Scrapling? Scrapling is an adaptive all-in-one Python web scraping framework.
00:00:46You get a self-healing parser, stealth fetchers, browser-based fetching when JavaScript is needed,
00:00:51and a spider framework for bigger crawls. One install, one API. That means fewer broken
00:00:57scrapers and more usable data that we get back. Now, let's see the part that actually matters.
00:01:03If you enjoy coding tools to speed up your workflow, be sure to subscribe. We have videos coming out all
00:01:08the time. Now, here I have a basic setup, right? I've already installed Scrapling, so we'll keep this
00:01:13part fast. One import and one call is all we need to get the page. Up top here, I made some HTML that changes.
00:01:21One is like a generic starting site. Then I kept the same thing, but I switched the CSS selectors.
00:01:27Let's say I want the product name and price. Now, normally I might grab them with CSS selectors,
00:01:34right? So page CSS, I drop in my selector, auto-save, true. I can do that and it's going to
00:01:40work and we're going to get a dictionary of data back to us. Looks normal. Two selectors, a dictionary,
00:01:46I move on. That's it. But at the same time, that's actually the problem because a normal scraper works
00:01:52great until that page changes. Now, what happens if the site randomly changes overnight? They redesign
00:01:58it. They do something to prevent this. So product title becomes item heading or product price becomes
00:02:04pricing value. It's got the same data on the page, but the entire DOM changes. The old selectors should
00:02:11be dead. And this is where most scrapers are just going to break. But now we can turn on adaptive mode.
00:02:18One change autosave equals true becomes adaptive equals true. So now I can still put product title
00:02:26with adaptive set to true. Same data. I didn't change the selectors. It's different page structure without
00:02:34the selector rewrite. That's the main idea here. Now, when you scrape an element with autosave true,
00:02:40scrapling records clues about it. So it's going to record things like the tag, attributes,
00:02:44parents and children, any neighboring text, probably the DOM position and the structural shape. So when a
00:02:50class name changes, scrapling has more clues left. It doesn't need the entire site to stay the same.
00:02:56It only needs enough structural signal to recognize the element again. And that's the
00:03:01part that matters because real scraper failures are almost never a total redesign. It's a renamed class,
00:03:06a new wrapper, a shifted layout, one tiny thing. That's exactly what adaptive matching is built for.
00:03:13Scrapling has three big pieces that actually matter. The first is adaptive parser, what you just saw.
00:03:18Then there's multiple fetcher, one workflow, right tool for the job. The fetcher goes for plain HTTP,
00:03:25fast for simple web pages. Stealthy fetcher can bypass anti-bots when needed. Dynamic fetcher is real
00:03:32browser for JS heavy sites. One API, swap the fetcher, keep the code. The spider framework is when quick
00:03:39scripts turn into a real crawler. Async crawling, pause and resume, proxy rotation, streaming, and all those
00:03:46mixed sessions. The stuff you usually add on later, it's already there. Scrapling isn't just another
00:03:53parser. It replaces the scraping stack. Requests, beautiful soup, playwright, retry logic, proxy helpers,
00:04:00spider code with one workflow. Scrapling is not saying beautiful soup is useless and it's not saying
00:04:06playwright or scrappy is dead. Beautiful soup plus requests is still great for simple pages. It's easy,
00:04:13it's readable and everyone understands it, but it does not give you any type of stealth. It does not give
00:04:20you adaptive selectors and it does not render JavaScript. And for larger parsing jobs, it can
00:04:26become the actual body neck. Now, scrappy is powerful. If you are building series crawling
00:04:31infrastructure, scrappy still deserves some respect, but scrappy often means settings, pipelines, middleware,
00:04:36extensions, and a lot more setup. Playwright and Selenium are great when you need a real browser.
00:04:42Sometimes the page just needs JavaScript. There's no way around that. But browsers are heavy. They are
00:04:48slower than raw HTTP and they use more memory. And again, they still don't fix the issue of broken
00:04:54selectors. They run the page. They don't understand what your scraper meant to extract. So with scrapling,
00:05:01you can use fast fetching when you can, stealth when you need it, use browser rendering when the page
00:05:06requires it and use adaptive parsing. So one small front and change doesn't blow everything up. Now, all this
00:05:12doesn't mean scrapling doesn't have issues, right? If you're dealing with data dome level protection,
00:05:17advanced fingerprinting, or aggressive rate limits, you may still need good proxies. So scrapling can
00:05:23help, but it doesn't make you invisible. Dynamic fetching can also mean extra browser setup. That's
00:05:29just the trade-off when JavaScript rendering is involved. Here's some food for thought for all of this.
00:05:34Scrapling is worth trying if you do real scraping work, especially if you're building data pipelines,
00:05:41you have rag jobs, AI agents, or anything that needs to keep running after the target site changes. The
00:05:47strongest reason to use it is not that it makes scraping possible. We already have tools that can
00:05:53actually do that, right? The strongest reason is that it reduces maintenance. Now, I'd probably just
00:05:59skip it if you have a really tiny script, right? Requests and beautiful soup are gonna do the trick,
00:06:04right? If you enjoy coding tools like this, be sure to subscribe to the BetterStack channel. We'll see you in another video.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video