Infrastructure Design to Reduce Proxy Costs and Tokens when Deploying Agent-Reach to Production
19 juin 2026
0
Computing/SoftwareComments (0)
Log in to leave a comment
No posts yet
Log in to leave a comment
No posts yet
Even an Agent-Reach-based agent that works perfectly on a local CLI hits a wall the moment it is deployed to a production server. When tens of thousands of scrape requests flood in, they get caught by CDN security nets like Cloudflare, causing IPs to be blocked one after another. However, insisting on paid residential proxies alone makes bandwidth costs unsustainable. Pushing raw HTML directly into an LLM can also blow up the context window and result in massive token bills. To solve this, you need a dynamic routing architecture and a preprocessing pipeline.
Operating Agent-Reach without safeguards can lead to thousands of dollars in costs for paid residential proxies alone. In a system processing 100,000 requests per day with an average page size of 150KB and a failure rate reaching 50%, monthly data transfer reaches 675GB. Premium residential proxies like Bright Data or Oxylabs cost between $4 and $8 per GB. Routing all traffic through them would cause monthly costs to soar to $5,400.
By deploying Scrapoxy, an open-source proxy aggregator, as a Docker container-based super proxy, you can maintain a single endpoint while controlling infrastructure costs. First, create a docker-compose.yml file in the project root and define the valkey/valkey:7.2-alpine image and the fabienvauchelles/scrapoxy:latest image. Use the --appendonly yes --requirepass [password] command in the Valkey container to ensure the persistence of credential caching. Set the username and password in the Scrapoxy environment variables and start the containers with the docker compose up -d command. This configuration can save over $200 per month in subscription fees for paid services.
Depending on the target platform's security policy, you must dual-track the IP rotation cycle to ensure sessions do not break. For public Read-Only requests that do not require authentication, set an aggressive TTL between 30 seconds and 3 minutes to forcibly assign a new IP for every request in a round-robin fashion. This is a measure to avoid threshold-based blocking. On the other hand, for authentication-based platforms like X or Reddit where maintaining session cookies is essential, enable Scrapoxy's cookie injection feature to pin the same residential IP node for 5 to 10 minutes. This prevents the authentication session from expiring due to a sudden change in geographic IP.
To further cut costs, you should optimize traffic routing at the NGINX API gateway layer. In the upstream blocks of the NGINX configuration file (/etc/nginx/nginx.conf), define low-cost datacenter proxy pools like DataImpulse, which cost around $1 per GB, and the Scrapoxy residential proxy pool, respectively. Create a /fetch/generic/ location within the server block to forward low-security public HTML traffic, such as RSS feeds or GitHub searches, to the datacenter proxy. Next, create a /fetch/social/ location to route only high-friction social endpoint requests to the Scrapoxy backend and inject headers. Applying this 2-track routing prevents bandwidth waste on expensive residential proxies and reduces overall proxy costs by up to 90%.
Raw HTML data is a mass of redundant DOM elements, stylesheets, and inline scripts. Feeding this directly into an LLM consumes unnecessary tokens and blurs the inference results. Converting the original webpage to Markdown instead of injecting the unrefined source into the context window reduces data size by 75% to 90%. Performing regex-based cleaning and Markdown serialization in the pipeline preprocessing stage can save more than 40% in LLM API token consumption and prevent context window overflow errors.
Implement a Python function that preprocesses input data by combining Trafilatura and html2text as parser components. When calling the trafilatura.extract() function, specify the favor_recall=False option to exclude sidebars or advertisements and extract only the main text. To prepare for cases where main content block extraction fails, create an html2text.HTML2Text() object and insert fallback code setting ignore_images=True and body_width=0. Executing a regex (re.sub) to remove residual tags like <script> and <style> from the extracted Markdown text, along with a cleaning routine to shrink consecutive empty lines, reduces the agent's response latency.
When splitting long documents, rather than simple character-count-based chunking, you should introduce a segmentation algorithm that maintains context based on semantic similarity. Calculate the cosine similarity between embedding vectors of sentences split into segments to capture points where semantic disconnection occurs.
Similarity(u, v) = rac{u cdot v}{\|u\| \|v\|}After calculating the distance between adjacent sentences, confirm the boundaries where the distance difference exceeds the 95th percentile of the entire document as semantic split points to generate the chunk list. Applying semantic chunking rather than fixed-length chunking prevents relevant information from being lost due to being split across different chunks and improves the accuracy of the LLM's information retrieval.
Platforms like X or LinkedIn have strict speed limits. HTTP 429 or 403 errors occur frequently. In these temporary failure situations, if a synchronous application process repeats retries immediately, server resources are exhausted, and the level of IP blocking only increases. To ensure system resilience, you need asynchronous exception-handling middleware that identifies the nature of the exceptions that occurred and dynamically adjusts the load applied to the target server.
When designing the error handler class, you must accurately branch between transient errors and permanent errors. If the HTTP status code is 429, 502, 503, 504, or if the error message contains 'timeout' or 'connection refused', classify it as a transient error and designate it for retries. Conversely, judge codes like 401 or 400 as permanent errors and immediately isolate them to a Dead Letter Queue (DLQ). In the case of transient errors, to prevent the 'Thundering Herd' problem caused by requests flooding in at the same timing, apply an exponential backoff algorithm that includes random jitter in milliseconds. The wait time calculation formula is as follows:
By setting the initial base delay () to 30 seconds and limiting the maximum cap () to 600 seconds, a distributed wait time of approximately 240 seconds is guaranteed by the 3rd retry, bypassing the target platform's blocking policy.
To block cascading failures where a specific platform's outage or security hardening paralyzes the entire workflow, implement a Redis-based bulkhead pattern in the middleware layer. Instead of a single global queue, create independent Redis lists separated by destination domain (queue:bulkhead:twitter, queue:bulkhead:reddit, queue:bulkhead:general). Assign different maximum concurrency limits to the worker pools consuming each queue, such as 3 for Twitter and 25 for General Web. To manage the retry schedule for failed tasks, write a delayed processing routine that uses Redis's Sorted Set to register the return timestamp as a score. By applying this bulkhead structure, even if an API blocking incident occurs on a specific social media site, only the corresponding worker enters a waiting state, while defending the overall research task completion success rate of the agent at over 95%.
Raw data randomly scraped from various web sources contains date discrepancies or factual errors, making it easy for the LLM to make distorted inferences. Before providing context to the generative AI model, you must combine a discrete verification layer at the end of the pipeline that calculates the reliability of the Markdown file content and verifies numerical consistency to block hallucinations.
Design a deterministic filter class that calculates the temporal validity of collected metadata and weights by platform source. Documents containing future timestamps relative to system time or invalid ISO format dates are immediately excluded from the dataset. In addition, declare a dictionary that maps confidence weights for each platform source, assigning 0.95 to GitHub, 0.90 to Wikipedia, and lower default scores to social media information such as Reddit (0.50) or Twitter (0.40). The final confidence score is generated through logic that adds a metadata weight bonus of 0.05 only if the author name and title are intact within the document. Information assets with scores below the threshold are excluded at the LLM prompt assembly stage.
To finally guarantee the quality of the output data, run a scoring script that compares the generated answer candidates against the original context.
\b\d+(?:\.\d+)?%?\b) to perform an intersection operation between the set of numeric symbols existing in the original dataset and the set of numbers in the generated sentence. If arbitrary numbers or currency units not present in the source are detected, trigger a numerical discrepancy flag and request a re-run via the routing middleware.By integrating these multi-layered verification layers, you can block numerical manipulation and false citation problems committed by crawl-based agents at the architectural level and deliver only fully verified research results to the end user.