Stop Building RAG Pipelines Like This... Use MarkItDown Instead

BBetter Stack
컴퓨터/소프트웨어AI/미래기술

Transcript

00:00:00We spend way more time processing documents than actually building our AI apps.
00:00:05We connect multiple libraries, finally get a PDF into the pipeline, and the LLM still gives bad answers.
00:00:11Not because the model is bad, but because the markdown is.
00:00:14This is Markdown.
00:00:16A tool by Microsoft that's blown up with over 110,000 stars on GitHub,
00:00:21and it fixes the whole pipeline in basically one line of code.
00:00:24I'll show you how all this works in just a few minutes.
00:00:30[Music]
00:00:33Okay, now basically every AI project starts the same.
00:00:36You've got files everywhere, Word docs, slides, spreadsheets, PDF screenshots, maybe even like audio files.
00:00:43And then comes the cool part, which us devs love.
00:00:46We can start actually stacking tools.
00:00:49So we're going to have a tool for PDFs, a tool for Excel, for Word, right?
00:00:54All these libraries that we're linking together to help us build out this pipeline.
00:00:59At first it feels fine, it works, sure.
00:01:02Then things start to break.
00:01:04Tables lose structure, headings are going to disappear, and then token usage obviously starts to blow up.
00:01:10And now the rag pipeline is basically just pulling garbage and the agent is giving us bad answers.
00:01:16And we're debugging ingestion instead of actually shipping.
00:01:19And really, what's this doing?
00:01:21It's just wasting time.
00:01:22Not minutes, but hours every single week.
00:01:25So instead of fixing your model, you actually need to fix your input.
00:01:29Let me show you what that actually looks like.
00:01:31If you enjoy tools to speed up your workflow, be sure to subscribe.
00:01:35We have videos coming out all the time.
00:01:37All right, now let me run through this real quick.
00:01:39It's all Python, so it's pretty simple.
00:01:42First, I pip install it all in my virtual environment.
00:01:45I've got a PDF here, just doc PDF.
00:01:48And I can run this in my terminal.
00:01:50I'm going to run mark it down, doc PDF, output MD.
00:01:55That's it.
00:01:56It makes me a file automatically.
00:01:58We can open that file and inside is sort of what we're hoping to expect here.
00:02:03Headings are clean, tables actually look like tables, structure is still here.
00:02:08And now when we write some code in Python for this, we can do even more with it.
00:02:13So with my imports and using OpenAI, I can make a client and then a markdown object.
00:02:20I'm going to pass in my API key and the model that we want to run.
00:02:25When I run the code, the output is generated in my terminal, so it's super clean.
00:02:29And better yet, what's really cool is I can get a PNG image.
00:02:33For this, I got a chart from NVIDIA.
00:02:35Here is my image with some data on it.
00:02:39Now I can convert the chart into markdown.
00:02:42So I can let mark it down, do its thing, just by using the convert function again.
00:02:47This time we're giving it our image, our PNG.
00:02:50And here now we get the summary of what that chart is and what we could extract and use for Rack.
00:02:56This is huge, as it now allows us to extract what we need faster right here in our code
00:03:01so we can keep working without jumping between a bunch of different tabs.
00:03:05So what is markdown really?
00:03:07It's an open source Python tool from Microsoft Research.
00:03:11It's MIT licensed, it's built specifically for LLM workflows.
00:03:16Its job is to take messy files and turn them into clean markdowns.
00:03:19So models can actually understand them.
00:03:22It supports a lot more than we'd actually expect.
00:03:25Word, PowerPoint, Excel, PDF, audio, images, and also things like links, really anything, you name it.
00:03:32It even has an MCP server now, so you can plug it directly into tools like Claw Desktop or even your own agent.
00:03:40Plus the plugins.
00:03:41So instead of building ingestion pipelines, we're now basically just calling one tool.
00:03:47The devs weren't struggling with models, they were struggling with inputs.
00:03:51And the expectation was, okay, just use better models.
00:03:55But the whole reality of this is better inputs is equal to better outputs.
00:04:00So now instead of writing scripts that are breaking, people are now using one tool for everything, marketDAV.
00:04:06Rag pipelines, agents, fine-tuning datasets, knowledge bases, document analysis, all of this stuff we're already doing.
00:04:13And the key detail most people miss is it produces structured, token-efficient markdown.
00:04:20So there's less noise going into this, but we're getting better answers, that better output.
00:04:24But that doesn't mean it's perfect.
00:04:26Now let's compare this to what you're probably already using or you've seen around.
00:04:31Now we do have a tool called Pandoc, so you'd expect Pandoc to win here, all right?
00:04:36But it's solving a different problem from marketDAV.
00:04:40Pandoc is for humans, so publishing, formatting, latex.
00:04:44marketDAV is for machines, LLMs, pipelines, automation.
00:04:48It's kind of the same idea, but the goal is different.
00:04:51And then we have things like unstructured or doclink.
00:04:55These are great, but also these are really heavy.
00:04:58They use ML models, they take more setup, and are better for really complex documents.
00:05:03marketDAV takes the opposite approach to all this.
00:05:05There's less setup, super easy, faster results, and it's good enough for most files.
00:05:11So here's the real trade-off.
00:05:12Do you want perfect extraction, or do you want something that works really fast and reliably?
00:05:18For most of us, speed is going to win.
00:05:20Now, of course, the downsides, complex PDFs are still going to break it, right?
00:05:24Especially dense tables or weird layouts.
00:05:27If you deal with messy stand documents every day, tools like doclink or unstructured are going to do way better.
00:05:32But if you want image descriptions, you'll need to plug in an LLM.
00:05:36So it's not perfect, but it is a pretty cool tool that's solving a real problem.
00:05:41So is it worth using?
00:05:43Yeah, for most people, absolutely.
00:05:45If you're building AI apps right now, this is probably what your ingestion layer should actually look like.
00:05:50You should try and use Markdown.
00:05:52Just try it if you want clean input for rag or agents.
00:05:56You deal with mixed file types.
00:05:58It's really good for that stuff.
00:05:59And you don't want to maintain a bunch of fragile scripts that could break, right?
00:06:03Skip it or combine it if you're working with extremely complex PDFs every single day.
00:06:08There's other tools out there.
00:06:09If you enjoy open source tools and coding tips like this, be sure to subscribe to the Better Stack channel.
00:06:15We'll see you in another video.

Key Takeaway

Integrating Microsoft's MarkItDown tool into RAG pipelines replaces fragile, document-specific ingestion scripts with a single, unified method that produces structured, token-efficient markdown for better LLM performance.

Highlights

MarkItDown is an open-source Python tool from Microsoft Research designed specifically to convert diverse file formats into structured, token-efficient markdown for LLM workflows.

The tool supports a wide array of formats, including Word, PowerPoint, Excel, PDF, audio, and images.

MarkItDown simplifies ingestion pipelines by replacing multiple fragmented document-specific libraries with one unified tool.

The tool generates clean markdown output, which preserves headings, tables, and document structure better than many standard extraction methods.

MarkItDown can extract text and summaries from images, such as charts, when combined with an LLM.

Unlike Pandoc, which targets human-readable output like publishing and LaTeX, MarkItDown optimizes output for machine-readable AI pipelines.

Timeline

The Problems of Modern RAG Ingestion

  • Current AI ingestion pipelines rely on chaining multiple, disparate libraries for different file types.
  • Inefficient document processing leads to broken table structures, missing headings, and increased token usage.
  • Engineers spend hours debugging ingestion layers rather than building core features.

Building RAG pipelines often involves connecting specific tools for PDFs, Excel, and Word, which frequently leads to data corruption. Headings disappear and tables lose structure, resulting in garbage data input for the agent. These issues directly cause the LLM to provide poor answers, consuming valuable developer time each week.

Implementing MarkItDown

  • MarkItDown converts files into clean markdown using simple commands or Python code.
  • The tool handles image data by converting charts and visuals into summarized markdown text.
  • The integration workflow requires minimal code, allowing for direct inclusion in existing Python projects.

Installation occurs via pip within a virtual environment. Running the tool against a file like a PDF produces an immediate markdown file with preserved structure, clean headings, and valid tables. Additionally, users can utilize the library to process PNG images, extracting data summaries from charts directly within the codebase without switching contexts.

Strategic Positioning and Limitations

  • MarkItDown serves machine-optimized LLM workflows, while Pandoc targets human-readable formatting.
  • The tool provides a lightweight, fast alternative to heavy, ML-intensive solutions like Unstructured or Docling.
  • Complex, dense PDFs with highly unusual layouts may still require more specialized ingestion tools.

MarkItDown is MIT-licensed and offers an MCP server for integration with tools like Claude Desktop. While it is not perfect for extremely complex or dense PDF documents, it offers a superior trade-off for most developers by prioritizing speed and reliability over exhaustive, heavy-model extraction. It effectively standardizes input layers across diverse projects.

Community Posts

View all posts