Building an AI-Powered Competitive Intelligence Platform: A Deep Technical Dive

How I built a fully functional competitor monitoring system that automatically crawls websites, detects changes, and provides AI-powered business insights


Introduction

In today’s fast-moving business landscape, keeping track of what your competitors are doing is crucial. When they change their pricing, launch new features, or pivot their strategy, you need to know—ideally before your customers do.

I built Competition Monitoring System—a comprehensive platform that automatically tracks competitor websites, detects meaningful changes, and uses AI to explain what those changes mean in a business context. While the market got saturated before I could launch it commercially, this project represents a fully-functional, production-ready system that demonstrates sophisticated engineering across multiple domains.

In this deep-dive, I’ll walk you through the architecture, technical challenges, and solutions that make this system work.


The Problem: Why Manual Competitor Tracking Fails

Every product manager, growth marketer, and competitive analyst knows the pain:

  1. Manual checking doesn’t scale – You can’t manually check 10+ competitor websites daily
  2. Changes slip through – A competitor’s pricing page changes on Friday night, and you don’t notice until Monday’s customer call
  3. No historical context – Even if you catch a change, you often don’t know what it was before
  4. Signal vs. noise – Most website changes are irrelevant (footer updates, minor copy tweaks). The important ones get lost

The Solution: An Intelligent Monitoring Engine

I built a system that:

  • Automatically crawls competitor websites on a configurable schedule
  • Detects ALL changes using dual-hashing (raw HTML + extracted text)
  • Filters out noise through intelligent feed detection and change validation
  • Provides AI-powered analysis explaining what each change means for your business
  • Stores everything efficiently for historical comparison and dashboard visualization

System Architecture Overview

The system is built with a clear separation of concerns:

┌──────────────────────────────────────────────────────────────────────┐
│                          Competition Monitoring System                │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐    │
│  │   Frontend  │────▶│   REST API  │────▶│  Backend Job Engine │    │
│  │  (React.js) │     │  (FastAPI)  │     │     (Python)        │    │
│  └─────────────┘     └─────────────┘     └──────────┬──────────┘    │
│                                                      │               │
│                                          ┌───────────┴───────────┐   │
│                                          │                       │   │
│                                    ┌─────▼─────┐          ┌──────▼──┐│
│                                    │  Supabase │          │   S3    ││
│                                    │ (Metadata)│          │(Content)││
│                                    └───────────┘          └─────────┘│
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

Technology Stack

LayerTechnologyPurpose
Web CrawlingPlaywright + BeautifulSoupJavaScript-enabled headless browsing
AI AnalysisGoogle Gemini APIPage summarization, change analysis
URL DiscoveryFirecrawl APIComprehensive site mapping
DatabaseSupabase (PostgreSQL)Metadata, sessions, change records
Object StorageS3-compatible (iDrive e2)HTML snapshots, screenshots
APIFastAPIRESTful endpoints
FrontendReact/Next.jsDashboard visualization

Deep Dive: The Crawling Engine

The heart of the system is the OptimizedWebCrawler—a sophisticated Python class that handles everything from URL normalization to JavaScript rendering.

Concurrent Architecture

Traditional web crawlers are slow because they process one URL at a time. My crawler uses an isolated worker architecture where multiple browser instances work in parallel:

class OptimizedWebCrawler:
    def __init__(self, base_url: str, max_pages: int = 100, 
                 concurrent_limit: int = 5):
        self.concurrent_limit = concurrent_limit
        self.url_queue = RobustURLQueue()
        # Each worker gets its own browser instance
        # Complete isolation prevents context conflicts

Key insight: Sharing browser contexts between workers causes “Target page, context or browser has been closed” errors. By giving each worker its own browser instance, I achieved a 96.7% success rate (up from 45% with shared contexts).

Robust URL Queue with Retry Logic

URLs fail for many reasons—network timeouts, rate limiting, temporary server errors. The RobustURLQueue class handles this gracefully:

@dataclass
class URLTask:
    url: str
    retry_count: int = 0
    max_retries: int = 3
    last_error: Optional[str] = None

class RobustURLQueue:
    def __init__(self):
        self.main_queue = asyncio.Queue()
        self.retry_queue = asyncio.Queue()  # Priority for retries
        self.failed_urls = []

    async def get(self, timeout: float = 2.0) -> Optional[URLTask]:
        # Prioritize retry queue over main queue
        if not self.retry_queue.empty():
            return await self.retry_queue.get()
        return await self.main_queue.get()

Enhanced JavaScript Rendering

Modern websites are JavaScript-heavy. A naive crawler that just fetches HTML will miss 80% of the content. My solution implements multi-stage loading detection:

async def _wait_for_javascript_content(self, page, url: str):
    # Step 1: Quick check if already loaded
    ready_state = await page.evaluate('() => document.readyState')
    if ready_state == 'complete':
        await page.wait_for_timeout(500)
        return

    # Step 2: Wait for meaningful content
    await page.wait_for_function('''() => {
        return document.body &&
               document.body.innerText &&
               document.body.innerText.length > 100;
    }''', timeout=5000)

    # Step 3: Special SPA handling (React/Vue/Angular)
    if any(fw in url.lower() for fw in ['app.', 'console.', 'dashboard.']):
        await page.wait_for_function('''() => {
            if (window.React || window.Vue || window.ng) {
                return document.readyState === 'complete';
            }
            return true;
        }''', timeout=3000)

Result: 80-90% faster JavaScript content waiting while maintaining reliability.

Smart Navigation Strategy

A critical discovery: using wait_until="networkidle" causes hangs on sites with continuous background requests (analytics, tracking pixels). The fix was simple but crucial:

# Before (problematic - would timeout on modern sites)
response = await page.goto(url, wait_until="networkidle", timeout=180000)

# After (reliable - works with background activity)
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)

This single change reduced crawl times from 180+ seconds (timeout) to 8-10 seconds for sites like factors.ai.


Intelligent Feed Detection

One of the most sophisticated features is the Feed Detection & Optimization System. Feed pages (blogs, news, resources) contain hundreds of child URLs that follow similar patterns. Crawling them all is wasteful.

The Problem

Without feed detection:

  1. Crawl /blog page
  2. Discover 200 blog post URLs
  3. Crawl all 200 posts (expensive!)
  4. Next crawl: same 200 posts + 1 new one
  5. Re-crawl everything again

The Solution: AI-Powered Feed Discovery

On first crawl, the system uses Firecrawl API to discover ALL URLs on a domain, then Gemini AI to identify feed patterns:

# Firecrawl discovers comprehensive URL structure
firecrawl_result = await firecrawl_service.discover_urls(
    domain='competitor.com',
    include_subdomains=True
)

# AI analyzes URLs for feed patterns
ai_analysis = await llm_service.analyze_feed_patterns(
    urls=firecrawl_result['urls'],
    paths=firecrawl_result['paths'],
    domain='competitor.com'
)

# Save domain-specific patterns to Supabase
domain_config_service.save_domain_config(
    domain='competitor.com',
    feed_paths=ai_analysis['feed_paths'],  # e.g., ['/blog', '/insights', '/news']
    ai_analysis=ai_analysis
)

Multi-Session Feed Processing

The magic happens across crawl sessions:

First Crawl (Discovery):

  • Detect /blog as a feed page
  • Discover 50 child URLs
  • Store URLs WITHOUT processing (no AI, no screenshots)
  • Establish baseline for comparison

Second Crawl (Detection):

  • Re-crawl /blog
  • Find 52 URLs (2 new posts)
  • Only process the 2 new URLs with AI
  • Include existing URLs in session data (prevents false “removed” alerts)

Result: 60-80% reduction in redundant processing.

Feed Pattern Detection

The system recognizes these URL patterns as feeds:

FEED_PATTERNS = [
    r'/news/?$', r'/blog/?$', r'/articles/?$',
    r'/insights/?$', r'/resources/?$', r'/whitepapers/?$',
    r'/awards/?$', r'/events/?$', r'/webinars/?$',
    r'/docs/?$', r'/help/?$', r'/careers/?$'
]

Combined with AI-discovered domain-specific patterns stored in Supabase, this catches even unusual feed structures.


Dual Hashing: Catching Every Change

Change detection uses a two-tier hashing approach:

Tier 1: Raw HTML Hash

def calculate_html_hash(self, html: str) -> str:
    return hashlib.md5(html.encode('utf-8')).hexdigest()

Captures ALL changes: script updates, CSS modifications, A/B tests, tracking pixels.

Tier 2: Text Content Hash

def calculate_content_hash(self, text: str) -> str:
    # Extract text, remove scripts/styles/metadata
    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style", "meta", "link"]):
        script.decompose()
    text = soup.get_text()
    return hashlib.md5(text.encode('utf-8')).hexdigest()

Focuses on meaningful content changes for AI analysis.

Why Both?

  • HTML hash changes but text hash doesn’t → Technical change (A/B test, script update)
  • Both hashes change → Content change worth analyzing with AI
  • Backward compatibility → Old data without HTML hash still works via text hash fallback

Precise Word-Level Change Analysis

When content changes, we need to know exactly what changed. The naive approach (line-by-line diff) produces terrible results:

Bad Output (line-based):

REMOVED: "Home Products Pricing About Blog Contact"
ADDED: "Home Products Pricing About Blog Contact Login"

Good Output (word-level):

ADDED: "Login"

My implementation uses word tokenization with sequence matching:

def _analyze_text_changes(self, old_text: str, new_text: str) -> Dict:
    # Tokenize to words
    old_words = self._tokenize_text(old_text)
    new_words = self._tokenize_text(new_text)

    # Find precise word-level differences
    matcher = SequenceMatcher(None, old_words, new_words)
    opcodes = list(matcher.get_opcodes())

    added_segments = []
    removed_segments = []

    for tag, i1, i2, j1, j2 in opcodes:
        if tag == 'insert':
            added_segments.append(' '.join(new_words[j1:j2]))
        elif tag == 'delete':
            removed_segments.append(' '.join(old_words[i1:i2]))

    return {
        'added': added_segments,
        'removed': removed_segments,
        'change_ratio': 1 - matcher.ratio()
    }

Change Validation: Eliminating False Positives

Not all detected changes are real. Dynamic content, A/B tests, and JavaScript timing issues create false positives. The ChangeValidator service handles this:

class ChangeValidator:
    async def validate_changes(self, changes: List[Change]) -> List[Change]:
        validated = []

        for change in changes:
            # Re-scrape twice with delay
            scrape1 = await self._scrape_url_once(change.url, 1)
            await asyncio.sleep(2)
            scrape2 = await self._scrape_url_once(change.url, 2)

            # Compare consistency
            if scrape1['content_hash'] == scrape2['content_hash']:
                validated.append(change)  # Consistent = real change
            else:
                logger.info(f"Invalidated {change.url}: inconsistent content")

        return validated

Only changes that consistently reproduce proceed to AI analysis.


AI-Powered Business Context

Detecting changes is only half the battle. The real value is understanding what they mean. The ChangeAnalyzer service uses Google’s Gemini API to provide business context.

Page Summarization During Crawling

Every page is summarized in real-time:

async def generate_page_summary(self, text_content: str, url: str) -> PageSummary:
    prompt = f"""Analyze the following webpage content from URL: {url}

    Provide:
    1. A concise 2-3 sentence summary
    2. The page type (pricing, product, blog, etc.)
    3. 5-10 relevant keywords
    4. Key entities mentioned (products, people, companies)
    """

    response = await self._call_with_backoff(prompt)
    return self._parse_summary_response(response)

Change Analysis with Severity Scoring

When changes are detected, AI evaluates their business significance:

@dataclass
class ChangeAnalysis:
    url: str
    change_type: str        # pricing_update, feature_addition, etc.
    severity: int           # 1-10 scale
    change_definition: str  # 3-line business analysis
    recommended_pages: List[str]  # Related pages for context

Severity Guidelines:

  • 1-3: Minor (typo fixes, date updates)
  • 4-6: Moderate (new blog post, team change)
  • 7-9: Significant (pricing change, new feature)
  • 10: Critical (acquisition, major pivot)

Exponential Backoff for API Reliability

Gemini API has rate limits. The service implements robust retry logic:

async def _call_with_backoff(self, prompt: str) -> str:
    for attempt in range(self.max_retries):  # 10 retries
        try:
            response = await asyncio.to_thread(
                self.model.generate_content, prompt
            )
            return response.text
        except Exception as e:
            if 'rate' in str(e).lower() or '429' in str(e):
                delay = min(self.base_delay * (2 ** attempt), 60)
                jitter = random.uniform(0, delay * 0.1)
                await asyncio.sleep(delay + jitter)
            else:
                raise

Data Architecture: Hybrid Storage Strategy

The system uses a hybrid storage approach optimized for different access patterns:

Supabase (PostgreSQL) – Fast Queries

Normalized tables for dashboard queries:

-- Sessions table
CREATE TABLE competition_analysis_sessions (
    id UUID PRIMARY KEY,
    domain TEXT NOT NULL,
    session_id TEXT NOT NULL,
    comparison_type TEXT,
    total_changes INTEGER,
    created_at TIMESTAMPTZ
);

-- Changes table with indexes
CREATE TABLE competition_analysis_changes (
    id UUID PRIMARY KEY,
    session_analysis_id UUID REFERENCES competition_analysis_sessions(id),
    url TEXT NOT NULL,
    change_type TEXT NOT NULL,
    severity INTEGER CHECK (severity >= 1 AND severity <= 10),
    page_type TEXT,
    change_definition TEXT,
    recommended_pages JSONB,
    created_at TIMESTAMPTZ
);

-- Indexes for fast filtering
CREATE INDEX idx_severity ON competition_analysis_changes(severity);
CREATE INDEX idx_change_type ON competition_analysis_changes(change_type);

S3 (iDrive e2) – Complete Backups

Object storage for full content:

competitor.com/
├── sessions/
│   ├── 20250115_103000.json.gz   # Complete session data
│   └── 20250114_103000.json.gz
├── master/
│   └── state.json.gz             # Master state tracking
└── screenshots/
    └── 20250115_103000/
        ├── homepage.png
        └── pricing.png

Master State: The Source of Truth

The master state tracks all pages across sessions:

{
  "domain": "competitor.com",
  "pages": {
    "https://competitor.com/pricing": {
      "html_hash": "x9y8z7...",
      "content_hash": "a1b2c3...",
      "title": "Pricing Plans",
      "ai_summary": "...",
      "last_session": "20250115_103000"
    }
  },
  "feed_state": {
    "feeds": {
      "https://competitor.com/blog": {
        "feed_type": "blog",
        "discovered_urls": ["...", "..."],
        "last_url_count": 50
      }
    }
  },
  "sessions": [
    {"id": "20250115_103000", "pages_count": 45},
    {"id": "20250114_103000", "pages_count": 44}
  ]
}

URL Normalization: Preventing Duplicate Crawls

A subtle but critical feature: URLs with minor variations must be treated as identical:

https://factors.ai/pricing
https://www.factors.ai/pricing      # www subdomain
https://factors.ai/pricing/         # trailing slash
https://FACTORS.AI/pricing          # case difference
https://factors.ai/pricing?ref=nav  # query params

The normalization function handles all these cases:

def normalize_url(self, url: str) -> str:
    parsed = urlparse(url)

    # 1. Remove www subdomain
    netloc = parsed.netloc
    if netloc.startswith('www.'):
        netloc = netloc[4:]

    # 2. Lowercase hostname
    netloc = netloc.lower()

    # 3. Remove trailing slash
    path = parsed.path.rstrip('/') or '/'

    # 4. Sort query parameters
    if parsed.query:
        params = sorted(parse_qsl(parsed.query))
        query = urlencode(params)

    # 5. Remove fragments
    # 6. Normalize index files

    return urlunparse((parsed.scheme, netloc, path, '', query, ''))

Pagination URL Filtering

An interesting edge case: pagination URLs were causing exponential crawling:

/blog → discovers /blog?page=1, /blog?page=2, ...
/blog?page=1 → treated as new feed → discovers more pagination URLs
/blog?page=2 → same problem
... (exponential explosion)

Solution: Filter pagination URLs before they enter the queue:

pagination_patterns = [
    'page=', '_page=', 'p=', 'offset=', 'start=',
    'pagenum=', 'pagenumber=', 'pageindex=', 'paged='
]

def should_crawl(self, url: str) -> bool:
    if parsed_url.query:
        query_lower = parsed_url.query.lower()
        if any(pattern in query_lower for pattern in pagination_patterns):
            return False  # Skip pagination URLs
    return True

Key Technical Decisions & Trade-offs

1. Playwright vs. Puppeteer vs. Selenium

Choice: Playwright

Why:

  • Best JavaScript rendering support
  • Async API for concurrent crawling
  • Cross-browser testing if needed
  • Active development and great docs

2. Gemini vs. GPT-4 vs. Claude

Choice: Google Gemini

Why:

  • Cost-effective for high-volume summarization
  • Fast response times
  • Sufficient quality for page analysis
  • Easy to switch via LLMFactory abstraction

3. Supabase vs. Self-hosted PostgreSQL

Choice: Supabase

Why:

  • Managed PostgreSQL with row-level security
  • Real-time subscriptions for dashboard
  • Built-in auth for multi-tenant future
  • Edge functions for serverless compute

4. S3 vs. Database for HTML Storage

Choice: Hybrid (both)

Why:

  • S3: Cheap storage for large HTML/screenshots
  • Supabase: Fast queries for dashboard data
  • Master state in S3: Single source of truth for crawl history

Performance Characteristics

MetricValue
Concurrent Workers5 (configurable to 10)
URLs per Minute~50-100 (depending on site speed)
JavaScript Wait Time0.5s (fast) to 5s (SPAs)
Success Rate96.7%
AI Summaries per Minute~20 (with rate limiting)
Storage per Crawl~2-10 MB compressed

Lessons Learned

1. Browser Context Isolation is Critical

Sharing browser contexts between concurrent workers causes race conditions and cryptic errors. Each worker needs its own isolated browser instance.

2. NetworkIdle is a Lie

Modern websites never reach “network idle” due to analytics, tracking, and WebSocket connections. Use domcontentloaded instead.

3. Change Detection Needs Two Hashes

Raw HTML hash catches technical changes. Text content hash catches meaningful changes. You need both for complete coverage.

4. Feed Detection is Essential for Scale

Without intelligent feed handling, you’ll waste 80% of your crawl budget on blog posts and news articles that haven’t changed.

5. Validation Eliminates False Positives

A/B tests, personalization, and dynamic content create fake “changes”. Re-scraping twice with a delay catches these.


What I Would Do Differently

  1. Start with headless Chrome service: Instead of managing Playwright browsers in-process, use a dedicated browser pool service (Browserless, etc.)
  2. Event-driven architecture from day one: Use message queues (Redis, SQS) between crawl and analysis phases for better scaling
  3. More aggressive caching: Cache AI summaries longer—they rarely need regeneration if content hasn’t changed
  4. Visual diffing earlier: Screenshot comparison could catch changes that text analysis misses

The Market Reality

I built this system over several months, iterating through dozens of technical challenges. By the time it was production-ready, several well-funded competitors had entered the space:

  • Klue
  • Crayon
  • Kompyte
  • Similarweb

The market went from “interesting opportunity” to “saturated” faster than expected. But the technical work remains valuable—both as a portfolio piece and as a foundation for future projects.


Conclusion

Building a competitive intelligence platform touches nearly every aspect of modern software engineering:

  • Distributed systems: Concurrent crawling, queue management
  • Web scraping: JavaScript rendering, anti-bot evasion
  • AI/ML: LLM integration, prompt engineering
  • Data engineering: Hybrid storage, change detection
  • Database design: Normalized schemas, efficient indexing
  • API design: RESTful endpoints, real-time updates

While the market timing didn’t work out for commercial launch, this project demonstrates that complex, production-grade systems can be built by a small team (or solo developer) using modern tools and cloud services.

The code is functional, the architecture is sound, and the technical challenges were genuinely interesting to solve. Sometimes that’s the real value of a project—not the business outcome, but what you learn along the way.


If you’re interested in the technical details or want to discuss competitive intelligence systems, feel free to reach out. The lessons learned here apply to many other domains: price monitoring, content aggregation, research automation, and more.


Technical Stack Summary

ComponentTechnology
LanguagePython 3.12
Web CrawlingPlaywright, BeautifulSoup, lxml
AIGoogle Gemini API, LangChain
URL DiscoveryFirecrawl API
DatabaseSupabase (PostgreSQL)
Object StorageS3-compatible (iDrive e2)
API FrameworkFastAPI, Uvicorn
Job SchedulingCelery, Redis, APScheduler
Testingpytest, pytest-asyncio

This project represents approximately 15,000+ lines of production-quality Python code, comprehensive documentation, and battle-tested solutions to real-world web scraping challenges.

Leave a Comment

Your email address will not be published. Required fields are marked *