45:57Chase AI
Log in to leave a comment
No posts yet
The first wall you'll hit when building a RAG system in a local environment is VRAM capacity and the brutal version conflicts between libraries. An 8-bit quantized model consumes about 1GB of VRAM per 1 billion parameters. Considering the overhead of Windows or macOS itself, you should leave at least 20% of free space. Without this buffer, you'll see disastrous performance where token generation crawls at about 2 tokens per second. Specifically, the lightrag-hku framework often throws runtime errors when paired with the latest numpy 2.x versions.
First, open your terminal and lock the version by typing pip install numpy==1.26.4 --force-reinstall. Next, install nest_asyncio and write nest_asyncio.apply() at the very top of your code. If you skip this, the asynchronous loop in Jupyter Notebook will get tangled, causing the entire process to hang. If your GPU memory is 8GB or less, set embedding_batch_num to 10 or fewer and llm_model_max_async to around 4 when initializing LightRAG. This setup alone can prevent OOM (Out Of Memory) crashes and save you at least two hours of trial and error.
Simply storing text in chunks cuts off the context between pieces of information. However, if you properly parse Obsidian's Wikilink ([[link]]) structure, you can build a fairly impressive knowledge graph. The key is stripping away unnecessary Markdown symbols before the LLM reads it. Cleaning up these messy symbols alone can save nearly 30% in token consumption.
Get into the habit of adding fields like type and domain to the YAML frontmatter at the top of your notes; it significantly speeds up search filtering. Use the r"\[\[(.+?)\]\]" pattern with Python's re module to extract connections between documents, then convert them into a relationship dataset in JSONL format. File names are also important. Use the core topic of the knowledge as the title rather than dates like '2024-04-14' to ensure indexed nodes function correctly. Data connected this way enables reasoning that jumps across concepts, going beyond simple retrieval.
The biggest waste of time when running a local LLM is re-calculating embeddings you've already computed. Python's default cache disappears once the program is closed. Therefore, you should use SQLite-based DiskCache to create a physical storage layer. Design the system to return the cached response immediately if the cosine similarity between queries exceeds 0.95, without calling the LLM. Applying semantic caching this way can reduce response times to around 100ms.
The method is simple. Install the library with pip install diskcache and create a class to store embedding text and vector pairs. It's even better if you mix in a time-weighting algorithm:
This logic ensures that a note you just edited appears at the top of the search results. By setting the TTL (Time To Live) to 1 hour for the embedding cache and 2 hours for the generated response cache, you can establish an instant response system for repetitive questions.
Architecture diagrams or screenshots inside your notes contain far more information than text. Leaving these out of your search is a loss. By using a CLIP model, you can map images and text into the same space, allowing you to find relevant images by just typing "data flow diagram." If you lack a high-end GPU, you can convert the CLIP-ViT-B-32 model to OpenVINO format and run it on your CPU.
When you find an image path in a Markdown file, bundle it with about 200 characters of surrounding text. Then, use a lightweight local VLM like Phi-3.5-vision to automatically extract image captions. Store these caption vectors and image feature vectors together in a local vector DB like Qdrant. This process allows even complex drawings that are hard to describe with text to be included in search results.
Re-indexing your entire vault every time you fix a single file is foolish. Using Python's watchdog library, you can detect the moment a file is saved and update only the changed parts. However, since the CPU will scream if indexing runs constantly while you are writing, debouncing is essential.
When writing the script, use watchdog.observers to monitor the folder, but have it wait about 5 seconds after a modification event before starting the task. You should also include a process to compare SHA-256 hash values for each file to verify if the content has actually changed. Identify only the files with different hashes, delete their existing nodes, and push in the new vectors. This creates a real-time knowledge base that reflects changes as soon as you save a note—meaning you never have to manually click an update button again.