Automating Messy PDF Analysis with Claude Code and RAG-Anything

Analyzing hundreds of PDFs and complex tables locally is a grueling task. Simply installing tools doesn't solve the problem. Real workflow automation begins when you refine messy data into high-purity context that AI can immediately digest.

Preventing Hallucinations by Physically Isolating Projects

When using Claude Code, you might encounter situations where it provides figures from Project B in response to questions about Project A. This phenomenon occurs when vector databases or knowledge graphs become mixed. To prevent this, you must design a standardized folder structure within the project root and fix the paths.

The cleanest structure is to place original files in docs/raw/, MinerU conversion outputs in docs/output/, and RAG-Anything's knowledge graph indexes in docs/context_db/. This separation ensures that state files like kv_store_doc_status.json do not get tangled.

To ensure Claude Code looks only at these paths, you need to configure .claudecode/config.json.

Create a .claudecode directory in the project root.
Add rag-anything to the mcpServers section in config.json.
Set the RAG_STORAGE_DIR value to ./docs/context_db in the env settings.

Once this configuration is complete, the AI will only utilize data from the specified paths. This increases answer accuracy and eliminates the risk of mixing data with other clients.

Margin Addition Techniques to Improve MinerU Table Recognition

Scanned PDFs or multi-column layouts degrade OCR recognition rates. If a table is flush against the edge of a page, the YOLO layout detection model might misidentify it as a border and discard it entirely. The solution is simple: add about 40 pixels of white margin around the image.

In practice, tables tight against the border have a recognition rate of around 3% without margins, but this jumps to 98% when a 40px margin is added. For blurry scans, use OpenCV to adjust the contrast. Applying the following formula to adjust the $\alpha$ (contrast) value between 1.0 and 3.0 will sharpen character boundaries:

g(i,j) = \alpha \cdot f(i,j) + \beta

Inputting data into MinerU after applying the CLAHE technique via a Python script significantly increases the volume of extracted table data. Forcing an AI to read documents that are blurry even to the human eye is a waste of time.

Preventing System Crashes Due to VRAM Shortage

The biggest hurdle when processing large volumes of documents locally is GPU memory. While MinerU version 2.5 is faster, it frequently causes system freezes when processing large PDFs in environments with less than 24GB of VRAM. For stability, you should lower the num_batch parameter from the default 512 to 32 or 64.

Modify num_batch to 32 and gpu_memory_utilization to 0.7 in the MinerU configuration file.
If using a Linux environment, limit memory overcommit in /etc/sysctl.conf.
Create a swap file of 8GB or more to prepare for memory peak situations.

Reducing the batch size might slow down processing speed slightly, but it prevents the process from being forcibly terminated mid-task. Stable completion is more important than speed.

Prompt Design for Generating Report Drafts

Once data indexing is finished, it's time to generate results. Since RAG-Anything structures the relationships between tables and formulas, you can pose complex queries in Claude Code. Commands like "Compare the Q3 sales table with the current technical specifications" become possible.

To reduce the time spent on recurring weekly reports, use a clear template:

Place the data to be analyzed inside <context> tags and separate the output format with <format> tags.
Break down instructions into steps, such as "Extract sales figures from each document, compare them with cost indicators, and then derive a conclusion."
Force the AI to display 'No Data' or 'Verification Required' to prevent it from inventing non-existent figures.

By following this workflow, analysts can focus solely on reviewing the drafts created by the AI. There is no reason to waste time manually cross-referencing source data.

Automating Messy PDF Analysis with Claude Code and RAG-Anything

Preventing Hallucinations by Physically Isolating Projects

To ensure Claude Code looks only at these paths, you need to configure .claudecode/config.json.

Create a .claudecode directory in the project root.
Add rag-anything to the mcpServers section in config.json.
Set the RAG_STORAGE_DIR value to ./docs/context_db in the env settings.

Once this configuration is complete, the AI will only utilize data from the specified paths. This increases answer accuracy and eliminates the risk of mixing data with other clients.

Margin Addition Techniques to Improve MinerU Table Recognition

g(i,j) = \alpha \cdot f(i,j) + \beta

Preventing System Crashes Due to VRAM Shortage

Modify num_batch to 32 and gpu_memory_utilization to 0.7 in the MinerU configuration file.
If using a Linux environment, limit memory overcommit in /etc/sysctl.conf.
Create a swap file of 8GB or more to prepare for memory peak situations.

Reducing the batch size might slow down processing speed slightly, but it prevents the process from being forcibly terminated mid-task. Stable completion is more important than speed.

Prompt Design for Generating Report Drafts

To reduce the time spent on recurring weekly reports, use a clear template:

Place the data to be analyzed inside <context> tags and separate the output format with <format> tags.
Break down instructions into steps, such as "Extract sales figures from each document, compare them with cost indicators, and then derive a conclusion."
Force the AI to display 'No Data' or 'Verification Required' to prevent it from inventing non-existent figures.

By following this workflow, analysts can focus solely on reviewing the drafts created by the AI. There is no reason to waste time manually cross-referencing source data.

Automating Messy PDF Analysis with Claude Code and RAG-Anything

Related Video

Claude Code + RAG-Anything = LIMITLESS

Automating Messy PDF Analysis with Claude Code and RAG-Anything

Preventing Hallucinations by Physically Isolating Projects

Margin Addition Techniques to Improve MinerU Table Recognition

Preventing System Crashes Due to VRAM Shortage

Prompt Design for Generating Report Drafts

Comments (0)

Automating Messy PDF Analysis with Claude Code and RAG-Anything

Preventing Hallucinations by Physically Isolating Projects

Margin Addition Techniques to Improve MinerU Table Recognition

Preventing System Crashes Due to VRAM Shortage

Prompt Design for Generating Report Drafts