{"id":54,"date":"2025-04-02T16:42:00","date_gmt":"2025-04-02T16:42:00","guid":{"rendered":"https:\/\/balamurali.in\/blog\/?p=54"},"modified":"2026-02-23T14:27:19","modified_gmt":"2026-02-23T14:27:19","slug":"building-crawlycarl-an-ai-powered-web-scraping-api-that-thinks-before-it-scrapes","status":"publish","type":"post","link":"https:\/\/balamurali.in\/blog\/learn-with-me\/building-crawlycarl-an-ai-powered-web-scraping-api-that-thinks-before-it-scrapes\/","title":{"rendered":"Building CrawlyCarl: An AI-Powered Web Scraping API That Thinks Before It Scrapes"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>Web scraping has always been a cat-and-mouse game. Websites employ anti-bot measures, require JavaScript rendering, hide data behind pagination, and scatter information across multiple pages. Traditional scrapers require extensive configuration, break easily, and struggle with modern dynamic websites.<\/p>\n\n\n\n<p>What if we could build a scraper that actually <em>understands<\/em> what it&#8217;s looking for and <em>decides<\/em> how to get it?<\/p>\n\n\n\n<p>That&#8217;s exactly what I set out to build with <strong>CrawlyCarl<\/strong> \u2014 an AI-powered web scraping API that uses Large Language Models to intelligently extract data from websites. Instead of writing complex XPath selectors or CSS queries that break when a website changes, CrawlyCarl asks an LLM: &#8220;What data do you see on this page? What&#8217;s the best way to get what the user needs?&#8221;<\/p>\n\n\n\n<p>In this post, I&#8217;ll walk you through the architecture, the technologies involved, and the key decisions that shaped this project.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The Problem with Traditional Web Scraping<\/h2>\n\n\n\n<p>Before diving into the solution, let&#8217;s understand the problem. Traditional web scraping faces several challenges:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>JavaScript-Heavy Websites<\/strong>: Modern SPAs render content dynamically. Simple HTTP requests return empty shells.<\/li>\n\n\n\n<li><strong>Anti-Bot Detection<\/strong>: Websites use CAPTCHAs, rate limiting, and fingerprinting to block automated access.<\/li>\n\n\n\n<li><strong>Scattered Information<\/strong>: The data you need might be spread across multiple pages \u2014 About, Contact, Team, Product pages.<\/li>\n\n\n\n<li><strong>Schema Brittleness<\/strong>: Hard-coded selectors break when websites update their layouts.<\/li>\n\n\n\n<li><strong>Dynamic Navigation<\/strong>: Finding the right page often requires clicking through menus, dropdowns, and pagination.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The CrawlyCarl Solution: AI-Powered Decision Making<\/h2>\n\n\n\n<p>CrawlyCarl approaches web scraping differently. Instead of following rigid rules, it uses an LLM (primarily Google&#8217;s Gemini 2.0 Flash) to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Analyze<\/strong> the current page content<\/li>\n\n\n\n<li><strong>Decide<\/strong> which tool to use (HTTP request, JavaScript rendering, etc.)<\/li>\n\n\n\n<li><strong>Extract<\/strong> the data matching the user&#8217;s request<\/li>\n\n\n\n<li><strong>Navigate<\/strong> to other pages if needed<\/li>\n\n\n\n<li><strong>Synthesize<\/strong> all gathered data into a comprehensive response<\/li>\n<\/ol>\n\n\n\n<p>The LLM acts as the &#8220;brain&#8221; that orchestrates the entire scraping operation, making intelligent decisions at each step.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture Overview<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">The Big Picture<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                          Frontend Layer                                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510          \u2502\n\u2502  \u2502 Marketing Site  \u2502  \u2502 React Dashboard \u2502  \u2502 Chrome Extension\u2502          \u2502\n\u2502  \u2502   (Static)      \u2502  \u2502  (TypeScript)   \u2502  \u2502                 \u2502          \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518          \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n            \u2502                    \u2502                    \u2502\n            \u25bc                    \u25bc                    \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                    Cloudflare Worker (API Gateway)                       \u2502\n\u2502  \u2022 API Key Validation        \u2022 Rate Limiting                            \u2502\n\u2502  \u2022 Credit Balance Checks     \u2022 Request Routing                          \u2502\n\u2502  \u2022 CORS Handling            \u2022 Authentication                             \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                                    \u2502\n                                    \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                    FastAPI Backend (Google Cloud Run)                    \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502                      Scraper Service                             \u2502   \u2502\n\u2502  \u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510              \u2502   \u2502\n\u2502  \u2502  \u2502   Domain    \u2502  \u2502  Template   \u2502  \u2502    URL      \u2502              \u2502   \u2502\n\u2502  \u2502  \u2502   Manager   \u2502  \u2502  Processor  \u2502  \u2502  Processor  \u2502              \u2502   \u2502\n\u2502  \u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518              \u2502   \u2502\n\u2502  \u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510           \u2502   \u2502\n\u2502  \u2502  \u2502 Threading Engine    \u2502  \u2502 Comprehensive Response  \u2502           \u2502   \u2502\n\u2502  \u2502  \u2502 (Parallel URLs)     \u2502  \u2502 Generator               \u2502           \u2502   \u2502\n\u2502  \u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518           \u2502   \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502                      LLM Services                                 \u2502  \u2502\n\u2502  \u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510                 \u2502  \u2502\n\u2502  \u2502  \u2502 Gemini \u2502 \u2502 OpenAI \u2502 \u2502 DeepInfra\u2502 \u2502OpenRouter\u2502                 \u2502  \u2502\n\u2502  \u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518                 \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502                      Tool Registry                                \u2502  \u2502\n\u2502  \u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510                    \u2502  \u2502\n\u2502  \u2502  \u2502HTTP Client \u2502 \u2502JS Renderer \u2502 \u2502Human Mimic \u2502                    \u2502  \u2502\n\u2502  \u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518                    \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                                    \u2502\n        \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n        \u25bc                           \u25bc                           \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510          \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510          \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   Supabase    \u2502          \u2502 Redis (Upstash)\u2502          \u2502  ScrapingAnt  \u2502\n\u2502  PostgreSQL   \u2502          \u2502 Queue\/Cache   \u2502          \u2502    Proxy      \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518          \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518          \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Core Components Deep Dive<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">1. Cloudflare Worker \u2014 The Smart Gateway<\/h4>\n\n\n\n<p>The first line of defense is a Cloudflare Worker that handles:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ API Key validation with hash matching\nfunction hashApiKey(apiKey, salt) {\n  const saltedKey = salt + apiKey;\n  let hash = 5381;\n  for (let i = 0; i &lt; saltedKey.length; i++) {\n    hash = ((hash &lt;&lt; 5) + hash) + saltedKey.charCodeAt(i);\n  }\n  return (hash &gt;&gt;&gt; 0).toString(16).padStart(8, '0');\n}\n\n\/\/ Rate limiting using Cloudflare's native rate limiters\nasync function checkRateLimit(env, accountId) {\n  const burstResult = await env.BURST_RATE_LIMITER.limit({ key: accountId });\n  const minuteResult = await env.MINUTE_RATE_LIMITER.limit({ key: accountId });\n  \/\/ ...\n}<\/code><\/pre>\n\n\n\n<p>The Worker validates API keys against hashed values in Supabase, checks credit balances, enforces rate limits, and routes requests to the FastAPI backend on Google Cloud Run.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. FastAPI Backend \u2014 The Brain<\/h4>\n\n\n\n<p>The heart of the system is a fully async Python application built with FastAPI:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class ScraperService:\n    def __init__(self):\n        self.registry = get_registry()          # Tool registry\n        self.llm_service = get_llm_service()    # LLM provider\n        self.domain_manager = DomainManager()   # Domain history tracking\n        self.rate_limiter = DomainRateLimiter() # Per-domain rate limiting<\/code><\/pre>\n\n\n\n<p>The ScraperService orchestrates the entire scraping process, coordinating between specialized processors:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DomainManager<\/strong>: Tracks which tools work best for each domain<\/li>\n\n\n\n<li><strong>TemplateProcessor<\/strong>: Handles JSON schema-based data extraction<\/li>\n\n\n\n<li><strong>URLProcessor<\/strong>: Processes individual URLs and handles navigation<\/li>\n\n\n\n<li><strong>ThreadingImplementation<\/strong>: Enables parallel URL processing<\/li>\n\n\n\n<li><strong>ComprehensiveResponseGenerator<\/strong>: Synthesizes data from multiple pages<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">3. Multi-Provider LLM Architecture<\/h4>\n\n\n\n<p>One of the most interesting design decisions was building a pluggable LLM system:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class LLMService(ABC):\n    @abstractmethod\n    async def analyze_content(self, url, html_content, target_data, ...):\n        \"\"\"Analyze HTML and determine next action\"\"\"\n\n    @abstractmethod\n    async def extract_data(self, url, html_content, target_data, ...):\n        \"\"\"Extract structured data from HTML\"\"\"<\/code><\/pre>\n\n\n\n<p>The system supports multiple providers through a factory pattern:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Gemini<\/strong>: Primary provider using Google&#8217;s latest models<\/li>\n\n\n\n<li><strong>OpenAI<\/strong>: GPT models for comparison<\/li>\n\n\n\n<li><strong>DeepInfra<\/strong>: Cost-effective Llama models<\/li>\n\n\n\n<li><strong>OpenRouter<\/strong>: Access to models from multiple providers<\/li>\n<\/ul>\n\n\n\n<p>This flexibility allows cost optimization and failover \u2014 if one provider is down, the system can switch to another.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4. Intelligent Tool Selection<\/h4>\n\n\n\n<p>The Tool Registry manages all available scraping tools:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TOOL_DEFINITIONS = &#91;\n    ToolDefinition(\n        id=\"http_client\",\n        name=\"HTTP Client\",\n        description=\"Basic HTTP client for fetching web pages\",\n        category=ToolCategory.BASIC,\n        base_cost=1,\n    ),\n    ToolDefinition(\n        id=\"scrape_js_render\",\n        name=\"JavaScript Renderer\",\n        description=\"Renders JavaScript on a page before scraping\",\n        category=ToolCategory.SCRAPING,\n        base_cost=5,\n    ),\n    # ... more tools\n]<\/code><\/pre>\n\n\n\n<p>The LLM decides which tool to use based on:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Page content analysis (JavaScript detection, CAPTCHA presence)<\/li>\n\n\n\n<li>Domain history (has this site needed JS rendering before?)<\/li>\n\n\n\n<li>Error responses (403 errors might need human mimicking)<\/li>\n\n\n\n<li>Target data requirements<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Three Operational Modes<\/h2>\n\n\n\n<p>CrawlyCarl offers three distinct modes for different use cases:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Precision Mode<\/h3>\n\n\n\n<p><strong>Single page, fast extraction<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"url\": \"https:\/\/example.com\/pricing\",\n  \"target_data\": \"Extract the pricing tiers and their features\",\n  \"intelligent_search\": false\n}<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Extracts data from exactly one page<\/li>\n\n\n\n<li>No navigation or link following<\/li>\n\n\n\n<li>Fast execution (max 5 tool operations)<\/li>\n\n\n\n<li>Perfect for known URLs with specific data<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. Smart Navigator Mode<\/h3>\n\n\n\n<p><strong>One layer of intelligent navigation<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"url\": \"https:\/\/company.com\",\n  \"target_data\": {\n    \"ceo_name\": \"Name of the CEO\",\n    \"contact_email\": \"Company email address\"\n  },\n  \"intelligent_search\": true,\n  \"deepsearch\": false\n}<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Analyzes all links on the initial page<\/li>\n\n\n\n<li>Selects up to 3 most promising URLs<\/li>\n\n\n\n<li>Processes them in parallel using multi-threading<\/li>\n\n\n\n<li>Great for data that&#8217;s &#8220;one click away&#8221;<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. Deep Dive Mode<\/h3>\n\n\n\n<p><strong>Comprehensive multi-layer crawling<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"url\": \"https:\/\/company.com\",\n  \"target_data\": {\n    \"company_name\": \"Official company name\",\n    \"industry\": \"Industry sector\",\n    \"employee_count\": \"Number of employees\",\n    \"leadership_team\": &#91;\"CEO name\", \"CTO name\", \"CFO name\"],\n    \"office_locations\": \"All office locations\"\n  },\n  \"intelligent_search\": true,\n  \"deepsearch\": true\n}<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Navigates up to 5 layers deep<\/li>\n\n\n\n<li>Processes URLs at each layer in parallel<\/li>\n\n\n\n<li>Builds a URL tree to prevent loops<\/li>\n\n\n\n<li>Synthesizes data from all visited pages<\/li>\n\n\n\n<li>Ideal for CRM enrichment when starting with just a domain<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The Prompt Engineering Challenge<\/h2>\n\n\n\n<p>One of the most critical aspects was designing effective prompts. The LLM needs to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Understand what data the user wants<\/li>\n\n\n\n<li>Analyze the current page content<\/li>\n\n\n\n<li>Decide if navigation is needed<\/li>\n\n\n\n<li>Select the right tool<\/li>\n\n\n\n<li>Format responses consistently<\/li>\n<\/ol>\n\n\n\n<p>Here&#8217;s a simplified version of the analysis prompt:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>BASE_ANALYZE_PROMPT = \"\"\"\nYou are an AI web scraping assistant. Your task is to analyze the content \nfrom a web page and determine:\n\n1. If the requested data can be extracted from the current content\n2. If not, which tool should be used to retrieve the data\n3. Whether navigation to another page is required\n4. Whether human mimicking behavior should be enabled\n\nURL: {url}\nTarget Data: {formatted_target_data}\n\nPreviously Visited URLs (DO NOT suggest these):\n{visited_urls_json}\n\nDecision Guidelines:\n1. CAREFULLY analyze the content for target data\n2. If data is present, extract it directly\n3. If JavaScript is detected with HIGH confidence, use js_renderer\n4. For navigation, suggest only URLs likely to contain target data\n\"\"\"<\/code><\/pre>\n\n\n\n<p>The prompts are modular \u2014 different operational modes add specific instructions about navigation aggressiveness and data prioritization.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Credit System and Billing<\/h2>\n\n\n\n<p>CrawlyCarl uses a credit-based billing system with tool-level cost tracking:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>TOOL_COSTS = {\n    'http_client': 1,\n    'scrape_via_api': 2,\n    'scrape_js_render': 5,\n    'llm_call_basic': 5,\n    'llm_call_advanced': 10,\n}<\/code><\/pre>\n\n\n\n<p>Every tool operation is tracked atomically:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>-- Credit transactions with full audit trail\nCREATE TABLE credit_transactions (\n    id UUID PRIMARY KEY,\n    account_id UUID REFERENCES accounts(id),\n    amount INTEGER NOT NULL,\n    transaction_type TEXT NOT NULL,\n    description TEXT,\n    job_id UUID,\n    created_at TIMESTAMPTZ DEFAULT now()\n);\n\n-- Individual tool usage tracking\nCREATE TABLE tool_usage (\n    id UUID PRIMARY KEY,\n    account_id UUID REFERENCES accounts(id),\n    tool_id TEXT NOT NULL,\n    credits_consumed INTEGER NOT NULL,\n    execution_time_ms INTEGER,\n    created_at TIMESTAMPTZ DEFAULT now()\n);<\/code><\/pre>\n\n\n\n<p>An aggregation service batches database writes for efficiency:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class UsageAggregatorService:\n    def __init__(self, flush_interval: int = 60, batch_size: int = 100):\n        self.pending_records = &#91;]\n\n    async def flush_pending_records(self):\n        if self.pending_records:\n            await self.bulk_insert(self.pending_records)\n            self.pending_records.clear()<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Multi-Threading for Performance<\/h2>\n\n\n\n<p>For Deep Dive mode, processing URLs sequentially would be painfully slow. The threading implementation processes URLs at the same depth level in parallel:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class ThreadingImplementation:\n    async def process_layer_parallel(self, urls, target_data, depth):\n        tasks = &#91;]\n        for url in urls:\n            task = asyncio.create_task(\n                self.process_single_url(url, target_data, depth)\n            )\n            tasks.append(task)\n\n        results = await asyncio.gather(*tasks, return_exceptions=True)\n        return self.aggregate_results(results)<\/code><\/pre>\n\n\n\n<p>Domain-aware rate limiting prevents overwhelming individual servers:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class DomainRateLimiter:\n    def __init__(self, requests_per_minute: int = 10):\n        self.requests_per_minute = requests_per_minute\n        self.domain_timestamps = {}\n\n    async def wait_if_needed(self, domain: str):\n        # Calculate delay based on domain-specific request history\n        # Ensures no more than N requests per minute per domain<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Technology Stack Summary<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Backend<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Python 3.12<\/strong> with fully async architecture<\/li>\n\n\n\n<li><strong>FastAPI<\/strong> for the API framework<\/li>\n\n\n\n<li><strong>SQLAlchemy<\/strong> (async) for database ORM<\/li>\n\n\n\n<li><strong>Pydantic<\/strong> for data validation<\/li>\n\n\n\n<li><strong>httpx<\/strong> for async HTTP client<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Frontend<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>React<\/strong> with TypeScript for the dashboard<\/li>\n\n\n\n<li><strong>Tailwind CSS<\/strong> for styling<\/li>\n\n\n\n<li><strong>Static HTML\/CSS\/JS<\/strong> for the marketing site<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Infrastructure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google Cloud Run<\/strong> for containerized backend deployment<\/li>\n\n\n\n<li><strong>Cloudflare Workers<\/strong> for edge computing and API gateway<\/li>\n\n\n\n<li><strong>Supabase<\/strong> for PostgreSQL database and authentication<\/li>\n\n\n\n<li><strong>Upstash Redis<\/strong> for queuing and caching<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">External Services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>ScrapingAnt<\/strong> for proxy services and JavaScript rendering<\/li>\n\n\n\n<li><strong>Stripe\/Razorpay<\/strong> for payment processing<\/li>\n\n\n\n<li><strong>Multiple LLM Providers<\/strong> (Gemini, OpenAI, DeepInfra, OpenRouter)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">DevOps<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>GitHub Actions<\/strong> for CI\/CD pipelines<\/li>\n\n\n\n<li><strong>Docker<\/strong> for containerization<\/li>\n\n\n\n<li><strong>pytest<\/strong> for testing<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Lessons Learned<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. LLM Reliability Requires Multiple Fallbacks<\/h3>\n\n\n\n<p>LLMs don&#8217;t always return perfectly formatted JSON. I implemented multiple parsing strategies:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def parse_llm_response(response_text):\n    # Try direct JSON parsing\n    try:\n        return json.loads(response_text)\n    except JSONDecodeError:\n        pass\n\n    # Try extracting from markdown code blocks\n    json_matches = re.findall(r'```(?:json)?\\s*(&#91;\\s\\S]*?)\\s*```', response_text)\n    for match in json_matches:\n        try:\n            return json.loads(match)\n        except JSONDecodeError:\n            continue\n\n    # Try advanced JSON repair\n    return repair_json(response_text)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2. Domain Memory Saves Time<\/h3>\n\n\n\n<p>Tracking which tools work for each domain dramatically improves performance:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>domain_history = await self.domain_manager.get_domain_tool_history(url)\nif domain_history.get('js_needed'):\n    # Skip HTTP attempt, go straight to JS renderer\n    initial_tool = 'js_renderer'<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3. Structured Data Templates Beat Free-Form Extraction<\/h3>\n\n\n\n<p>Allowing users to define JSON schemas for their target data produces much more reliable results:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"target_data\": {\n    \"company_name\": \"string\",\n    \"employee_count\": \"integer\",\n    \"leadership\": {\n      \"ceo\": \"string\",\n      \"cto\": \"string\"\n    }\n  }\n}<\/code><\/pre>\n\n\n\n<p>The LLM receives this schema and returns data in the same structure, making integration straightforward.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Rate Limiting at Multiple Layers<\/h3>\n\n\n\n<p>I implemented rate limiting at three levels:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloudflare Worker<\/strong>: Per-account API rate limits<\/li>\n\n\n\n<li><strong>Backend Service<\/strong>: Per-domain request limits<\/li>\n\n\n\n<li><strong>Tool Level<\/strong>: Spacing between requests to the same site<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s Next<\/h2>\n\n\n\n<p>CrawlyCarl is currently in MVP phase with core functionality working. Future plans include:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Competitive Intelligence Monitor<\/strong>: A specialized tool for tracking competitor websites<\/li>\n\n\n\n<li><strong>HubSpot Integration<\/strong>: Direct sync of enriched data to CRM<\/li>\n\n\n\n<li><strong>Webhook Notifications<\/strong>: Real-time alerts when async jobs complete<\/li>\n\n\n\n<li><strong>Custom LLM Fine-tuning<\/strong>: Training models specifically for scraping tasks<\/li>\n\n\n\n<li><strong>More Proxy Regions<\/strong>: Expanding from 13 to 50+ countries<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Building CrawlyCarl has been an incredible journey through modern web architecture. The combination of LLM intelligence with robust engineering practices creates a scraper that actually <em>adapts<\/em> to websites rather than breaking when they change.<\/p>\n\n\n\n<p>The key insight is that LLMs aren&#8217;t just good at generating text \u2014 they&#8217;re excellent at making decisions based on context. By giving an LLM the right tools and information, it can navigate the web almost as intelligently as a human would.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Web scraping has always been a cat-and-mouse game. Websites employ anti-bot measures, require JavaScript rendering, hide data behind pagination, and scatter information across multiple pages. Traditional scrapers require extensive&#8230;<\/p>\n","protected":false},"author":1,"featured_media":155,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8],"tags":[],"class_list":["post-54","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learn-with-me"],"jetpack_featured_media_url":"https:\/\/balamurali.in\/blog\/wp-content\/uploads\/2026\/02\/crawlycarl_featured.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/54","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/comments?post=54"}],"version-history":[{"count":1,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/54\/revisions"}],"predecessor-version":[{"id":56,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/54\/revisions\/56"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media\/155"}],"wp:attachment":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media?parent=54"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/categories?post=54"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/tags?post=54"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}