Building CrawlyCarl: An AI-Powered Web Scraping API That Thinks Before It Scrapes

Introduction

Web scraping has always been a cat-and-mouse game. Websites employ anti-bot measures, require JavaScript rendering, hide data behind pagination, and scatter information across multiple pages. Traditional scrapers require extensive configuration, break easily, and struggle with modern dynamic websites.

What if we could build a scraper that actually understands what it’s looking for and decides how to get it?

That’s exactly what I set out to build with CrawlyCarl — an AI-powered web scraping API that uses Large Language Models to intelligently extract data from websites. Instead of writing complex XPath selectors or CSS queries that break when a website changes, CrawlyCarl asks an LLM: “What data do you see on this page? What’s the best way to get what the user needs?”

In this post, I’ll walk you through the architecture, the technologies involved, and the key decisions that shaped this project.


The Problem with Traditional Web Scraping

Before diving into the solution, let’s understand the problem. Traditional web scraping faces several challenges:

  1. JavaScript-Heavy Websites: Modern SPAs render content dynamically. Simple HTTP requests return empty shells.
  2. Anti-Bot Detection: Websites use CAPTCHAs, rate limiting, and fingerprinting to block automated access.
  3. Scattered Information: The data you need might be spread across multiple pages — About, Contact, Team, Product pages.
  4. Schema Brittleness: Hard-coded selectors break when websites update their layouts.
  5. Dynamic Navigation: Finding the right page often requires clicking through menus, dropdowns, and pagination.

The CrawlyCarl Solution: AI-Powered Decision Making

CrawlyCarl approaches web scraping differently. Instead of following rigid rules, it uses an LLM (primarily Google’s Gemini 2.0 Flash) to:

  1. Analyze the current page content
  2. Decide which tool to use (HTTP request, JavaScript rendering, etc.)
  3. Extract the data matching the user’s request
  4. Navigate to other pages if needed
  5. Synthesize all gathered data into a comprehensive response

The LLM acts as the “brain” that orchestrates the entire scraping operation, making intelligent decisions at each step.


Architecture Overview

The Big Picture

┌─────────────────────────────────────────────────────────────────────────┐
│                          Frontend Layer                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐          │
│  │ Marketing Site  │  │ React Dashboard │  │ Chrome Extension│          │
│  │   (Static)      │  │  (TypeScript)   │  │                 │          │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘          │
└───────────│────────────────────│────────────────────│───────────────────┘
            │                    │                    │
            ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Cloudflare Worker (API Gateway)                       │
│  • API Key Validation        • Rate Limiting                            │
│  • Credit Balance Checks     • Request Routing                          │
│  • CORS Handling            • Authentication                             │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    FastAPI Backend (Google Cloud Run)                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      Scraper Service                             │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │   │
│  │  │   Domain    │  │  Template   │  │    URL      │              │   │
│  │  │   Manager   │  │  Processor  │  │  Processor  │              │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘              │   │
│  │  ┌─────────────────────┐  ┌─────────────────────────┐           │   │
│  │  │ Threading Engine    │  │ Comprehensive Response  │           │   │
│  │  │ (Parallel URLs)     │  │ Generator               │           │   │
│  │  └─────────────────────┘  └─────────────────────────┘           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                      LLM Services                                 │  │
│  │  ┌────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐                 │  │
│  │  │ Gemini │ │ OpenAI │ │ DeepInfra│ │OpenRouter│                 │  │
│  │  └────────┘ └────────┘ └──────────┘ └──────────┘                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                      Tool Registry                                │  │
│  │  ┌────────────┐ ┌────────────┐ ┌────────────┐                    │  │
│  │  │HTTP Client │ │JS Renderer │ │Human Mimic │                    │  │
│  │  └────────────┘ └────────────┘ └────────────┘                    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        ▼                           ▼                           ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   Supabase    │          │ Redis (Upstash)│          │  ScrapingAnt  │
│  PostgreSQL   │          │ Queue/Cache   │          │    Proxy      │
└───────────────┘          └───────────────┘          └───────────────┘

Core Components Deep Dive

1. Cloudflare Worker — The Smart Gateway

The first line of defense is a Cloudflare Worker that handles:

// API Key validation with hash matching
function hashApiKey(apiKey, salt) {
  const saltedKey = salt + apiKey;
  let hash = 5381;
  for (let i = 0; i < saltedKey.length; i++) {
    hash = ((hash << 5) + hash) + saltedKey.charCodeAt(i);
  }
  return (hash >>> 0).toString(16).padStart(8, '0');
}

// Rate limiting using Cloudflare's native rate limiters
async function checkRateLimit(env, accountId) {
  const burstResult = await env.BURST_RATE_LIMITER.limit({ key: accountId });
  const minuteResult = await env.MINUTE_RATE_LIMITER.limit({ key: accountId });
  // ...
}

The Worker validates API keys against hashed values in Supabase, checks credit balances, enforces rate limits, and routes requests to the FastAPI backend on Google Cloud Run.

2. FastAPI Backend — The Brain

The heart of the system is a fully async Python application built with FastAPI:

class ScraperService:
    def __init__(self):
        self.registry = get_registry()          # Tool registry
        self.llm_service = get_llm_service()    # LLM provider
        self.domain_manager = DomainManager()   # Domain history tracking
        self.rate_limiter = DomainRateLimiter() # Per-domain rate limiting

The ScraperService orchestrates the entire scraping process, coordinating between specialized processors:

  • DomainManager: Tracks which tools work best for each domain
  • TemplateProcessor: Handles JSON schema-based data extraction
  • URLProcessor: Processes individual URLs and handles navigation
  • ThreadingImplementation: Enables parallel URL processing
  • ComprehensiveResponseGenerator: Synthesizes data from multiple pages

3. Multi-Provider LLM Architecture

One of the most interesting design decisions was building a pluggable LLM system:

class LLMService(ABC):
    @abstractmethod
    async def analyze_content(self, url, html_content, target_data, ...):
        """Analyze HTML and determine next action"""

    @abstractmethod
    async def extract_data(self, url, html_content, target_data, ...):
        """Extract structured data from HTML"""

The system supports multiple providers through a factory pattern:

  • Gemini: Primary provider using Google’s latest models
  • OpenAI: GPT models for comparison
  • DeepInfra: Cost-effective Llama models
  • OpenRouter: Access to models from multiple providers

This flexibility allows cost optimization and failover — if one provider is down, the system can switch to another.

4. Intelligent Tool Selection

The Tool Registry manages all available scraping tools:

TOOL_DEFINITIONS = [
    ToolDefinition(
        id="http_client",
        name="HTTP Client",
        description="Basic HTTP client for fetching web pages",
        category=ToolCategory.BASIC,
        base_cost=1,
    ),
    ToolDefinition(
        id="scrape_js_render",
        name="JavaScript Renderer",
        description="Renders JavaScript on a page before scraping",
        category=ToolCategory.SCRAPING,
        base_cost=5,
    ),
    # ... more tools
]

The LLM decides which tool to use based on:

  • Page content analysis (JavaScript detection, CAPTCHA presence)
  • Domain history (has this site needed JS rendering before?)
  • Error responses (403 errors might need human mimicking)
  • Target data requirements

Three Operational Modes

CrawlyCarl offers three distinct modes for different use cases:

1. Precision Mode

Single page, fast extraction

{
  "url": "https://example.com/pricing",
  "target_data": "Extract the pricing tiers and their features",
  "intelligent_search": false
}
  • Extracts data from exactly one page
  • No navigation or link following
  • Fast execution (max 5 tool operations)
  • Perfect for known URLs with specific data

2. Smart Navigator Mode

One layer of intelligent navigation

{
  "url": "https://company.com",
  "target_data": {
    "ceo_name": "Name of the CEO",
    "contact_email": "Company email address"
  },
  "intelligent_search": true,
  "deepsearch": false
}
  • Analyzes all links on the initial page
  • Selects up to 3 most promising URLs
  • Processes them in parallel using multi-threading
  • Great for data that’s “one click away”

3. Deep Dive Mode

Comprehensive multi-layer crawling

{
  "url": "https://company.com",
  "target_data": {
    "company_name": "Official company name",
    "industry": "Industry sector",
    "employee_count": "Number of employees",
    "leadership_team": ["CEO name", "CTO name", "CFO name"],
    "office_locations": "All office locations"
  },
  "intelligent_search": true,
  "deepsearch": true
}
  • Navigates up to 5 layers deep
  • Processes URLs at each layer in parallel
  • Builds a URL tree to prevent loops
  • Synthesizes data from all visited pages
  • Ideal for CRM enrichment when starting with just a domain

The Prompt Engineering Challenge

One of the most critical aspects was designing effective prompts. The LLM needs to:

  1. Understand what data the user wants
  2. Analyze the current page content
  3. Decide if navigation is needed
  4. Select the right tool
  5. Format responses consistently

Here’s a simplified version of the analysis prompt:

BASE_ANALYZE_PROMPT = """
You are an AI web scraping assistant. Your task is to analyze the content 
from a web page and determine:

1. If the requested data can be extracted from the current content
2. If not, which tool should be used to retrieve the data
3. Whether navigation to another page is required
4. Whether human mimicking behavior should be enabled

URL: {url}
Target Data: {formatted_target_data}

Previously Visited URLs (DO NOT suggest these):
{visited_urls_json}

Decision Guidelines:
1. CAREFULLY analyze the content for target data
2. If data is present, extract it directly
3. If JavaScript is detected with HIGH confidence, use js_renderer
4. For navigation, suggest only URLs likely to contain target data
"""

The prompts are modular — different operational modes add specific instructions about navigation aggressiveness and data prioritization.


Credit System and Billing

CrawlyCarl uses a credit-based billing system with tool-level cost tracking:

TOOL_COSTS = {
    'http_client': 1,
    'scrape_via_api': 2,
    'scrape_js_render': 5,
    'llm_call_basic': 5,
    'llm_call_advanced': 10,
}

Every tool operation is tracked atomically:

-- Credit transactions with full audit trail
CREATE TABLE credit_transactions (
    id UUID PRIMARY KEY,
    account_id UUID REFERENCES accounts(id),
    amount INTEGER NOT NULL,
    transaction_type TEXT NOT NULL,
    description TEXT,
    job_id UUID,
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Individual tool usage tracking
CREATE TABLE tool_usage (
    id UUID PRIMARY KEY,
    account_id UUID REFERENCES accounts(id),
    tool_id TEXT NOT NULL,
    credits_consumed INTEGER NOT NULL,
    execution_time_ms INTEGER,
    created_at TIMESTAMPTZ DEFAULT now()
);

An aggregation service batches database writes for efficiency:

class UsageAggregatorService:
    def __init__(self, flush_interval: int = 60, batch_size: int = 100):
        self.pending_records = []

    async def flush_pending_records(self):
        if self.pending_records:
            await self.bulk_insert(self.pending_records)
            self.pending_records.clear()

Multi-Threading for Performance

For Deep Dive mode, processing URLs sequentially would be painfully slow. The threading implementation processes URLs at the same depth level in parallel:

class ThreadingImplementation:
    async def process_layer_parallel(self, urls, target_data, depth):
        tasks = []
        for url in urls:
            task = asyncio.create_task(
                self.process_single_url(url, target_data, depth)
            )
            tasks.append(task)

        results = await asyncio.gather(*tasks, return_exceptions=True)
        return self.aggregate_results(results)

Domain-aware rate limiting prevents overwhelming individual servers:

class DomainRateLimiter:
    def __init__(self, requests_per_minute: int = 10):
        self.requests_per_minute = requests_per_minute
        self.domain_timestamps = {}

    async def wait_if_needed(self, domain: str):
        # Calculate delay based on domain-specific request history
        # Ensures no more than N requests per minute per domain

Technology Stack Summary

Backend

  • Python 3.12 with fully async architecture
  • FastAPI for the API framework
  • SQLAlchemy (async) for database ORM
  • Pydantic for data validation
  • httpx for async HTTP client

Frontend

  • React with TypeScript for the dashboard
  • Tailwind CSS for styling
  • Static HTML/CSS/JS for the marketing site

Infrastructure

  • Google Cloud Run for containerized backend deployment
  • Cloudflare Workers for edge computing and API gateway
  • Supabase for PostgreSQL database and authentication
  • Upstash Redis for queuing and caching

External Services

  • ScrapingAnt for proxy services and JavaScript rendering
  • Stripe/Razorpay for payment processing
  • Multiple LLM Providers (Gemini, OpenAI, DeepInfra, OpenRouter)

DevOps

  • GitHub Actions for CI/CD pipelines
  • Docker for containerization
  • pytest for testing

Lessons Learned

1. LLM Reliability Requires Multiple Fallbacks

LLMs don’t always return perfectly formatted JSON. I implemented multiple parsing strategies:

def parse_llm_response(response_text):
    # Try direct JSON parsing
    try:
        return json.loads(response_text)
    except JSONDecodeError:
        pass

    # Try extracting from markdown code blocks
    json_matches = re.findall(r'```(?:json)?\s*([\s\S]*?)\s*```', response_text)
    for match in json_matches:
        try:
            return json.loads(match)
        except JSONDecodeError:
            continue

    # Try advanced JSON repair
    return repair_json(response_text)

2. Domain Memory Saves Time

Tracking which tools work for each domain dramatically improves performance:

domain_history = await self.domain_manager.get_domain_tool_history(url)
if domain_history.get('js_needed'):
    # Skip HTTP attempt, go straight to JS renderer
    initial_tool = 'js_renderer'

3. Structured Data Templates Beat Free-Form Extraction

Allowing users to define JSON schemas for their target data produces much more reliable results:

{
  "target_data": {
    "company_name": "string",
    "employee_count": "integer",
    "leadership": {
      "ceo": "string",
      "cto": "string"
    }
  }
}

The LLM receives this schema and returns data in the same structure, making integration straightforward.

4. Rate Limiting at Multiple Layers

I implemented rate limiting at three levels:

  • Cloudflare Worker: Per-account API rate limits
  • Backend Service: Per-domain request limits
  • Tool Level: Spacing between requests to the same site

What’s Next

CrawlyCarl is currently in MVP phase with core functionality working. Future plans include:

  1. Competitive Intelligence Monitor: A specialized tool for tracking competitor websites
  2. HubSpot Integration: Direct sync of enriched data to CRM
  3. Webhook Notifications: Real-time alerts when async jobs complete
  4. Custom LLM Fine-tuning: Training models specifically for scraping tasks
  5. More Proxy Regions: Expanding from 13 to 50+ countries

Conclusion

Building CrawlyCarl has been an incredible journey through modern web architecture. The combination of LLM intelligence with robust engineering practices creates a scraper that actually adapts to websites rather than breaking when they change.

The key insight is that LLMs aren’t just good at generating text — they’re excellent at making decisions based on context. By giving an LLM the right tools and information, it can navigate the web almost as intelligently as a human would.


Leave a Comment

Your email address will not be published. Required fields are marked *