00:00:00We spend way more time processing documents than actually building our AI apps.
00:00:05We connect multiple libraries, finally get a PDF into the pipeline, and the LLM still gives bad answers.
00:00:11Not because the model is bad, but because the markdown is.
00:00:14This is Markdown.
00:00:16A tool by Microsoft that's blown up with over 110,000 stars on GitHub,
00:00:21and it fixes the whole pipeline in basically one line of code.
00:00:24I'll show you how all this works in just a few minutes.
00:00:30[Music]
00:00:33Okay, now basically every AI project starts the same.
00:00:36You've got files everywhere, Word docs, slides, spreadsheets, PDF screenshots, maybe even like audio files.
00:00:43And then comes the cool part, which us devs love.
00:00:46We can start actually stacking tools.
00:00:49So we're going to have a tool for PDFs, a tool for Excel, for Word, right?
00:00:54All these libraries that we're linking together to help us build out this pipeline.
00:00:59At first it feels fine, it works, sure.
00:01:02Then things start to break.
00:01:04Tables lose structure, headings are going to disappear, and then token usage obviously starts to blow up.
00:01:10And now the rag pipeline is basically just pulling garbage and the agent is giving us bad answers.
00:01:16And we're debugging ingestion instead of actually shipping.
00:01:19And really, what's this doing?
00:01:21It's just wasting time.
00:01:22Not minutes, but hours every single week.
00:01:25So instead of fixing your model, you actually need to fix your input.
00:01:29Let me show you what that actually looks like.
00:01:31If you enjoy tools to speed up your workflow, be sure to subscribe.
00:01:35We have videos coming out all the time.
00:01:37All right, now let me run through this real quick.
00:01:39It's all Python, so it's pretty simple.
00:01:42First, I pip install it all in my virtual environment.
00:01:45I've got a PDF here, just doc PDF.
00:01:48And I can run this in my terminal.
00:01:50I'm going to run mark it down, doc PDF, output MD.
00:01:55That's it.
00:01:56It makes me a file automatically.
00:01:58We can open that file and inside is sort of what we're hoping to expect here.
00:02:03Headings are clean, tables actually look like tables, structure is still here.
00:02:08And now when we write some code in Python for this, we can do even more with it.
00:02:13So with my imports and using OpenAI, I can make a client and then a markdown object.
00:02:20I'm going to pass in my API key and the model that we want to run.
00:02:25When I run the code, the output is generated in my terminal, so it's super clean.
00:02:29And better yet, what's really cool is I can get a PNG image.
00:02:33For this, I got a chart from NVIDIA.
00:02:35Here is my image with some data on it.
00:02:39Now I can convert the chart into markdown.
00:02:42So I can let mark it down, do its thing, just by using the convert function again.
00:02:47This time we're giving it our image, our PNG.
00:02:50And here now we get the summary of what that chart is and what we could extract and use for Rack.
00:02:56This is huge, as it now allows us to extract what we need faster right here in our code
00:03:01so we can keep working without jumping between a bunch of different tabs.
00:03:05So what is markdown really?
00:03:07It's an open source Python tool from Microsoft Research.
00:03:11It's MIT licensed, it's built specifically for LLM workflows.
00:03:16Its job is to take messy files and turn them into clean markdowns.
00:03:19So models can actually understand them.
00:03:22It supports a lot more than we'd actually expect.
00:03:25Word, PowerPoint, Excel, PDF, audio, images, and also things like links, really anything, you name it.
00:03:32It even has an MCP server now, so you can plug it directly into tools like Claw Desktop or even your own agent.
00:03:40Plus the plugins.
00:03:41So instead of building ingestion pipelines, we're now basically just calling one tool.
00:03:47The devs weren't struggling with models, they were struggling with inputs.
00:03:51And the expectation was, okay, just use better models.
00:03:55But the whole reality of this is better inputs is equal to better outputs.
00:04:00So now instead of writing scripts that are breaking, people are now using one tool for everything, marketDAV.
00:04:06Rag pipelines, agents, fine-tuning datasets, knowledge bases, document analysis, all of this stuff we're already doing.
00:04:13And the key detail most people miss is it produces structured, token-efficient markdown.
00:04:20So there's less noise going into this, but we're getting better answers, that better output.
00:04:24But that doesn't mean it's perfect.
00:04:26Now let's compare this to what you're probably already using or you've seen around.
00:04:31Now we do have a tool called Pandoc, so you'd expect Pandoc to win here, all right?
00:04:36But it's solving a different problem from marketDAV.
00:04:40Pandoc is for humans, so publishing, formatting, latex.
00:04:44marketDAV is for machines, LLMs, pipelines, automation.
00:04:48It's kind of the same idea, but the goal is different.
00:04:51And then we have things like unstructured or doclink.
00:04:55These are great, but also these are really heavy.
00:04:58They use ML models, they take more setup, and are better for really complex documents.
00:05:03marketDAV takes the opposite approach to all this.
00:05:05There's less setup, super easy, faster results, and it's good enough for most files.
00:05:11So here's the real trade-off.
00:05:12Do you want perfect extraction, or do you want something that works really fast and reliably?
00:05:18For most of us, speed is going to win.
00:05:20Now, of course, the downsides, complex PDFs are still going to break it, right?
00:05:24Especially dense tables or weird layouts.
00:05:27If you deal with messy stand documents every day, tools like doclink or unstructured are going to do way better.
00:05:32But if you want image descriptions, you'll need to plug in an LLM.
00:05:36So it's not perfect, but it is a pretty cool tool that's solving a real problem.
00:05:41So is it worth using?
00:05:43Yeah, for most people, absolutely.
00:05:45If you're building AI apps right now, this is probably what your ingestion layer should actually look like.
00:05:50You should try and use Markdown.
00:05:52Just try it if you want clean input for rag or agents.
00:05:56You deal with mixed file types.
00:05:58It's really good for that stuff.
00:05:59And you don't want to maintain a bunch of fragile scripts that could break, right?
00:06:03Skip it or combine it if you're working with extremely complex PDFs every single day.
00:06:08There's other tools out there.
00:06:09If you enjoy open source tools and coding tips like this, be sure to subscribe to the Better Stack channel.
00:06:15We'll see you in another video.