19:20Chase AI
Log in to leave a comment
No posts yet
Analyzing hundreds of PDFs and complex tables locally is a grueling task. Simply installing tools doesn't solve the problem. Real workflow automation begins when you refine messy data into high-purity context that AI can immediately digest.
When using Claude Code, you might encounter situations where it provides figures from Project B in response to questions about Project A. This phenomenon occurs when vector databases or knowledge graphs become mixed. To prevent this, you must design a standardized folder structure within the project root and fix the paths.
The cleanest structure is to place original files in docs/raw/, MinerU conversion outputs in docs/output/, and RAG-Anything's knowledge graph indexes in docs/context_db/. This separation ensures that state files like kv_store_doc_status.json do not get tangled.
To ensure Claude Code looks only at these paths, you need to configure .claudecode/config.json.
.claudecode directory in the project root.rag-anything to the mcpServers section in config.json.RAG_STORAGE_DIR value to ./docs/context_db in the env settings.Once this configuration is complete, the AI will only utilize data from the specified paths. This increases answer accuracy and eliminates the risk of mixing data with other clients.
Scanned PDFs or multi-column layouts degrade OCR recognition rates. If a table is flush against the edge of a page, the YOLO layout detection model might misidentify it as a border and discard it entirely. The solution is simple: add about 40 pixels of white margin around the image.
In practice, tables tight against the border have a recognition rate of around 3% without margins, but this jumps to 98% when a 40px margin is added. For blurry scans, use OpenCV to adjust the contrast. Applying the following formula to adjust the (contrast) value between 1.0 and 3.0 will sharpen character boundaries:
Inputting data into MinerU after applying the CLAHE technique via a Python script significantly increases the volume of extracted table data. Forcing an AI to read documents that are blurry even to the human eye is a waste of time.
The biggest hurdle when processing large volumes of documents locally is GPU memory. While MinerU version 2.5 is faster, it frequently causes system freezes when processing large PDFs in environments with less than 24GB of VRAM. For stability, you should lower the num_batch parameter from the default 512 to 32 or 64.
num_batch to 32 and gpu_memory_utilization to 0.7 in the MinerU configuration file./etc/sysctl.conf.Reducing the batch size might slow down processing speed slightly, but it prevents the process from being forcibly terminated mid-task. Stable completion is more important than speed.
Once data indexing is finished, it's time to generate results. Since RAG-Anything structures the relationships between tables and formulas, you can pose complex queries in Claude Code. Commands like "Compare the Q3 sales table with the current technical specifications" become possible.
To reduce the time spent on recurring weekly reports, use a clear template:
<context> tags and separate the output format with <format> tags.By following this workflow, analysts can focus solely on reviewing the drafts created by the AI. There is no reason to waste time manually cross-referencing source data.