Building an AI-Powered Competitive Intelligence Platform: A Deep Technical Dive

How I built a fully functional competitor monitoring system that automatically crawls websites, detects changes, and provides AI-powered business insights

Introduction

In today’s fast-moving business landscape, keeping track of what your competitors are doing is crucial. When they change their pricing, launch new features, or pivot their strategy, you need to know—ideally before your customers do.

I built Competition Monitoring System—a comprehensive platform that automatically tracks competitor websites, detects meaningful changes, and uses AI to explain what those changes mean in a business context. While the market got saturated before I could launch it commercially, this project represents a fully-functional, production-ready system that demonstrates sophisticated engineering across multiple domains.

In this deep-dive, I’ll walk you through the architecture, technical challenges, and solutions that make this system work.

The Problem: Why Manual Competitor Tracking Fails

Every product manager, growth marketer, and competitive analyst knows the pain:

Manual checking doesn’t scale – You can’t manually check 10+ competitor websites daily
Changes slip through – A competitor’s pricing page changes on Friday night, and you don’t notice until Monday’s customer call
No historical context – Even if you catch a change, you often don’t know what it was before
Signal vs. noise – Most website changes are irrelevant (footer updates, minor copy tweaks). The important ones get lost

The Solution: An Intelligent Monitoring Engine

I built a system that:

Automatically crawls competitor websites on a configurable schedule
Detects ALL changes using dual-hashing (raw HTML + extracted text)
Filters out noise through intelligent feed detection and change validation
Provides AI-powered analysis explaining what each change means for your business
Stores everything efficiently for historical comparison and dashboard visualization

System Architecture Overview

The system is built with a clear separation of concerns:

┌──────────────────────────────────────────────────────────────────────┐
│                          Competition Monitoring System                │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐    │
│  │   Frontend  │────▶│   REST API  │────▶│  Backend Job Engine │    │
│  │  (React.js) │     │  (FastAPI)  │     │     (Python)        │    │
│  └─────────────┘     └─────────────┘     └──────────┬──────────┘    │
│                                                      │               │
│                                          ┌───────────┴───────────┐   │
│                                          │                       │   │
│                                    ┌─────▼─────┐          ┌──────▼──┐│
│                                    │  Supabase │          │   S3    ││
│                                    │ (Metadata)│          │(Content)││
│                                    └───────────┘          └─────────┘│
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer	Technology	Purpose
Web Crawling	Playwright + BeautifulSoup	JavaScript-enabled headless browsing
AI Analysis	Google Gemini API	Page summarization, change analysis
URL Discovery	Firecrawl API	Comprehensive site mapping
Database	Supabase (PostgreSQL)	Metadata, sessions, change records
Object Storage	S3-compatible (iDrive e2)	HTML snapshots, screenshots
API	FastAPI	RESTful endpoints
Frontend	React/Next.js	Dashboard visualization

Deep Dive: The Crawling Engine

The heart of the system is the OptimizedWebCrawler—a sophisticated Python class that handles everything from URL normalization to JavaScript rendering.

Concurrent Architecture

Traditional web crawlers are slow because they process one URL at a time. My crawler uses an isolated worker architecture where multiple browser instances work in parallel:

class OptimizedWebCrawler:
    def __init__(self, base_url: str, max_pages: int = 100, 
                 concurrent_limit: int = 5):
        self.concurrent_limit = concurrent_limit
        self.url_queue = RobustURLQueue()
        # Each worker gets its own browser instance
        # Complete isolation prevents context conflicts

Key insight: Sharing browser contexts between workers causes “Target page, context or browser has been closed” errors. By giving each worker its own browser instance, I achieved a 96.7% success rate (up from 45% with shared contexts).

Robust URL Queue with Retry Logic

URLs fail for many reasons—network timeouts, rate limiting, temporary server errors. The RobustURLQueue class handles this gracefully:

@dataclass
class URLTask:
    url: str
    retry_count: int = 0
    max_retries: int = 3
    last_error: Optional[str] = None

class RobustURLQueue:
    def __init__(self):
        self.main_queue = asyncio.Queue()
        self.retry_queue = asyncio.Queue()  # Priority for retries
        self.failed_urls = []

    async def get(self, timeout: float = 2.0) -> Optional[URLTask]:
        # Prioritize retry queue over main queue
        if not self.retry_queue.empty():
            return await self.retry_queue.get()
        return await self.main_queue.get()

Enhanced JavaScript Rendering

Modern websites are JavaScript-heavy. A naive crawler that just fetches HTML will miss 80% of the content. My solution implements multi-stage loading detection:

async def _wait_for_javascript_content(self, page, url: str):
    # Step 1: Quick check if already loaded
    ready_state = await page.evaluate('() => document.readyState')
    if ready_state == 'complete':
        await page.wait_for_timeout(500)
        return

    # Step 2: Wait for meaningful content
    await page.wait_for_function('''() => {
        return document.body &&
               document.body.innerText &&
               document.body.innerText.length > 100;
    }''', timeout=5000)

    # Step 3: Special SPA handling (React/Vue/Angular)
    if any(fw in url.lower() for fw in ['app.', 'console.', 'dashboard.']):
        await page.wait_for_function('''() => {
            if (window.React || window.Vue || window.ng) {
                return document.readyState === 'complete';
            }
            return true;
        }''', timeout=3000)

Result: 80-90% faster JavaScript content waiting while maintaining reliability.

Smart Navigation Strategy

A critical discovery: using wait_until="networkidle" causes hangs on sites with continuous background requests (analytics, tracking pixels). The fix was simple but crucial:

# Before (problematic - would timeout on modern sites)
response = await page.goto(url, wait_until="networkidle", timeout=180000)

# After (reliable - works with background activity)
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)

This single change reduced crawl times from 180+ seconds (timeout) to 8-10 seconds for sites like factors.ai.

Intelligent Feed Detection

One of the most sophisticated features is the Feed Detection & Optimization System. Feed pages (blogs, news, resources) contain hundreds of child URLs that follow similar patterns. Crawling them all is wasteful.

The Problem

Without feed detection:

Crawl /blog page
Discover 200 blog post URLs
Crawl all 200 posts (expensive!)
Next crawl: same 200 posts + 1 new one
Re-crawl everything again

The Solution: AI-Powered Feed Discovery

On first crawl, the system uses Firecrawl API to discover ALL URLs on a domain, then Gemini AI to identify feed patterns:

# Firecrawl discovers comprehensive URL structure
firecrawl_result = await firecrawl_service.discover_urls(
    domain='competitor.com',
    include_subdomains=True
)

# AI analyzes URLs for feed patterns
ai_analysis = await llm_service.analyze_feed_patterns(
    urls=firecrawl_result['urls'],
    paths=firecrawl_result['paths'],
    domain='competitor.com'
)

# Save domain-specific patterns to Supabase
domain_config_service.save_domain_config(
    domain='competitor.com',
    feed_paths=ai_analysis['feed_paths'],  # e.g., ['/blog', '/insights', '/news']
    ai_analysis=ai_analysis
)

Multi-Session Feed Processing

The magic happens across crawl sessions:

First Crawl (Discovery):

Detect /blog as a feed page
Discover 50 child URLs
Store URLs WITHOUT processing (no AI, no screenshots)
Establish baseline for comparison

Second Crawl (Detection):

Re-crawl /blog
Find 52 URLs (2 new posts)
Only process the 2 new URLs with AI
Include existing URLs in session data (prevents false “removed” alerts)

Result: 60-80% reduction in redundant processing.

Feed Pattern Detection

The system recognizes these URL patterns as feeds:

FEED_PATTERNS = [
    r'/news/?$', r'/blog/?$', r'/articles/?$',
    r'/insights/?$', r'/resources/?$', r'/whitepapers/?$',
    r'/awards/?$', r'/events/?$', r'/webinars/?$',
    r'/docs/?$', r'/help/?$', r'/careers/?$'
]

Combined with AI-discovered domain-specific patterns stored in Supabase, this catches even unusual feed structures.

Dual Hashing: Catching Every Change

Change detection uses a two-tier hashing approach:

Tier 1: Raw HTML Hash

def calculate_html_hash(self, html: str) -> str:
    return hashlib.md5(html.encode('utf-8')).hexdigest()

Captures ALL changes: script updates, CSS modifications, A/B tests, tracking pixels.

Tier 2: Text Content Hash

def calculate_content_hash(self, text: str) -> str:
    # Extract text, remove scripts/styles/metadata
    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style", "meta", "link"]):
        script.decompose()
    text = soup.get_text()
    return hashlib.md5(text.encode('utf-8')).hexdigest()

Focuses on meaningful content changes for AI analysis.

Why Both?

HTML hash changes but text hash doesn’t → Technical change (A/B test, script update)
Both hashes change → Content change worth analyzing with AI
Backward compatibility → Old data without HTML hash still works via text hash fallback

Precise Word-Level Change Analysis

When content changes, we need to know exactly what changed. The naive approach (line-by-line diff) produces terrible results:

Bad Output (line-based):

REMOVED: "Home Products Pricing About Blog Contact"
ADDED: "Home Products Pricing About Blog Contact Login"

Good Output (word-level):

ADDED: "Login"

My implementation uses word tokenization with sequence matching:

def _analyze_text_changes(self, old_text: str, new_text: str) -> Dict:
    # Tokenize to words
    old_words = self._tokenize_text(old_text)
    new_words = self._tokenize_text(new_text)

    # Find precise word-level differences
    matcher = SequenceMatcher(None, old_words, new_words)
    opcodes = list(matcher.get_opcodes())

    added_segments = []
    removed_segments = []

    for tag, i1, i2, j1, j2 in opcodes:
        if tag == 'insert':
            added_segments.append(' '.join(new_words[j1:j2]))
        elif tag == 'delete':
            removed_segments.append(' '.join(old_words[i1:i2]))

    return {
        'added': added_segments,
        'removed': removed_segments,
        'change_ratio': 1 - matcher.ratio()
    }

Change Validation: Eliminating False Positives

Not all detected changes are real. Dynamic content, A/B tests, and JavaScript timing issues create false positives. The ChangeValidator service handles this:

class ChangeValidator:
    async def validate_changes(self, changes: List[Change]) -> List[Change]:
        validated = []

        for change in changes:
            # Re-scrape twice with delay
            scrape1 = await self._scrape_url_once(change.url, 1)
            await asyncio.sleep(2)
            scrape2 = await self._scrape_url_once(change.url, 2)

            # Compare consistency
            if scrape1['content_hash'] == scrape2['content_hash']:
                validated.append(change)  # Consistent = real change
            else:
                logger.info(f"Invalidated {change.url}: inconsistent content")

        return validated

Only changes that consistently reproduce proceed to AI analysis.

AI-Powered Business Context

Detecting changes is only half the battle. The real value is understanding what they mean. The ChangeAnalyzer service uses Google’s Gemini API to provide business context.

Page Summarization During Crawling

Every page is summarized in real-time:

async def generate_page_summary(self, text_content: str, url: str) -> PageSummary:
    prompt = f"""Analyze the following webpage content from URL: {url}

    Provide:
    1. A concise 2-3 sentence summary
    2. The page type (pricing, product, blog, etc.)
    3. 5-10 relevant keywords
    4. Key entities mentioned (products, people, companies)
    """

    response = await self._call_with_backoff(prompt)
    return self._parse_summary_response(response)

Change Analysis with Severity Scoring

When changes are detected, AI evaluates their business significance:

@dataclass
class ChangeAnalysis:
    url: str
    change_type: str        # pricing_update, feature_addition, etc.
    severity: int           # 1-10 scale
    change_definition: str  # 3-line business analysis
    recommended_pages: List[str]  # Related pages for context

Severity Guidelines:

1-3: Minor (typo fixes, date updates)
4-6: Moderate (new blog post, team change)
7-9: Significant (pricing change, new feature)
10: Critical (acquisition, major pivot)

Exponential Backoff for API Reliability

Gemini API has rate limits. The service implements robust retry logic:

async def _call_with_backoff(self, prompt: str) -> str:
    for attempt in range(self.max_retries):  # 10 retries
        try:
            response = await asyncio.to_thread(
                self.model.generate_content, prompt
            )
            return response.text
        except Exception as e:
            if 'rate' in str(e).lower() or '429' in str(e):
                delay = min(self.base_delay * (2 ** attempt), 60)
                jitter = random.uniform(0, delay * 0.1)
                await asyncio.sleep(delay + jitter)
            else:
                raise

Data Architecture: Hybrid Storage Strategy

The system uses a hybrid storage approach optimized for different access patterns:

Supabase (PostgreSQL) – Fast Queries

Normalized tables for dashboard queries:

-- Sessions table
CREATE TABLE competition_analysis_sessions (
    id UUID PRIMARY KEY,
    domain TEXT NOT NULL,
    session_id TEXT NOT NULL,
    comparison_type TEXT,
    total_changes INTEGER,
    created_at TIMESTAMPTZ
);

-- Changes table with indexes
CREATE TABLE competition_analysis_changes (
    id UUID PRIMARY KEY,
    session_analysis_id UUID REFERENCES competition_analysis_sessions(id),
    url TEXT NOT NULL,
    change_type TEXT NOT NULL,
    severity INTEGER CHECK (severity >= 1 AND severity <= 10),
    page_type TEXT,
    change_definition TEXT,
    recommended_pages JSONB,
    created_at TIMESTAMPTZ
);

-- Indexes for fast filtering
CREATE INDEX idx_severity ON competition_analysis_changes(severity);
CREATE INDEX idx_change_type ON competition_analysis_changes(change_type);

S3 (iDrive e2) – Complete Backups

Object storage for full content:

competitor.com/
├── sessions/
│   ├── 20250115_103000.json.gz   # Complete session data
│   └── 20250114_103000.json.gz
├── master/
│   └── state.json.gz             # Master state tracking
└── screenshots/
    └── 20250115_103000/
        ├── homepage.png
        └── pricing.png

Master State: The Source of Truth

The master state tracks all pages across sessions:

{
  "domain": "competitor.com",
  "pages": {
    "https://competitor.com/pricing": {
      "html_hash": "x9y8z7...",
      "content_hash": "a1b2c3...",
      "title": "Pricing Plans",
      "ai_summary": "...",
      "last_session": "20250115_103000"
    }
  },
  "feed_state": {
    "feeds": {
      "https://competitor.com/blog": {
        "feed_type": "blog",
        "discovered_urls": ["...", "..."],
        "last_url_count": 50
      }
    }
  },
  "sessions": [
    {"id": "20250115_103000", "pages_count": 45},
    {"id": "20250114_103000", "pages_count": 44}
  ]
}

URL Normalization: Preventing Duplicate Crawls

A subtle but critical feature: URLs with minor variations must be treated as identical:

https://factors.ai/pricing
https://www.factors.ai/pricing      # www subdomain
https://factors.ai/pricing/         # trailing slash
https://FACTORS.AI/pricing          # case difference
https://factors.ai/pricing?ref=nav  # query params

The normalization function handles all these cases:

def normalize_url(self, url: str) -> str:
    parsed = urlparse(url)

    # 1. Remove www subdomain
    netloc = parsed.netloc
    if netloc.startswith('www.'):
        netloc = netloc[4:]

    # 2. Lowercase hostname
    netloc = netloc.lower()

    # 3. Remove trailing slash
    path = parsed.path.rstrip('/') or '/'

    # 4. Sort query parameters
    if parsed.query:
        params = sorted(parse_qsl(parsed.query))
        query = urlencode(params)

    # 5. Remove fragments
    # 6. Normalize index files

    return urlunparse((parsed.scheme, netloc, path, '', query, ''))

Pagination URL Filtering

An interesting edge case: pagination URLs were causing exponential crawling:

/blog → discovers /blog?page=1, /blog?page=2, ...
/blog?page=1 → treated as new feed → discovers more pagination URLs
/blog?page=2 → same problem
... (exponential explosion)

Solution: Filter pagination URLs before they enter the queue:

pagination_patterns = [
    'page=', '_page=', 'p=', 'offset=', 'start=',
    'pagenum=', 'pagenumber=', 'pageindex=', 'paged='
]

def should_crawl(self, url: str) -> bool:
    if parsed_url.query:
        query_lower = parsed_url.query.lower()
        if any(pattern in query_lower for pattern in pagination_patterns):
            return False  # Skip pagination URLs
    return True

Key Technical Decisions & Trade-offs

1. Playwright vs. Puppeteer vs. Selenium

Choice: Playwright

Why:

Best JavaScript rendering support
Async API for concurrent crawling
Cross-browser testing if needed
Active development and great docs

2. Gemini vs. GPT-4 vs. Claude

Choice: Google Gemini

Why:

Cost-effective for high-volume summarization
Fast response times
Sufficient quality for page analysis
Easy to switch via LLMFactory abstraction

3. Supabase vs. Self-hosted PostgreSQL

Choice: Supabase

Why:

Managed PostgreSQL with row-level security
Real-time subscriptions for dashboard
Built-in auth for multi-tenant future
Edge functions for serverless compute

4. S3 vs. Database for HTML Storage

Choice: Hybrid (both)

Why:

S3: Cheap storage for large HTML/screenshots
Supabase: Fast queries for dashboard data
Master state in S3: Single source of truth for crawl history

Performance Characteristics

Metric	Value
Concurrent Workers	5 (configurable to 10)
URLs per Minute	~50-100 (depending on site speed)
JavaScript Wait Time	0.5s (fast) to 5s (SPAs)
Success Rate	96.7%
AI Summaries per Minute	~20 (with rate limiting)
Storage per Crawl	~2-10 MB compressed

Lessons Learned

1. Browser Context Isolation is Critical

Sharing browser contexts between concurrent workers causes race conditions and cryptic errors. Each worker needs its own isolated browser instance.

2. NetworkIdle is a Lie

Modern websites never reach “network idle” due to analytics, tracking, and WebSocket connections. Use domcontentloaded instead.

3. Change Detection Needs Two Hashes

Raw HTML hash catches technical changes. Text content hash catches meaningful changes. You need both for complete coverage.

4. Feed Detection is Essential for Scale

Without intelligent feed handling, you’ll waste 80% of your crawl budget on blog posts and news articles that haven’t changed.

5. Validation Eliminates False Positives

A/B tests, personalization, and dynamic content create fake “changes”. Re-scraping twice with a delay catches these.

What I Would Do Differently

Start with headless Chrome service: Instead of managing Playwright browsers in-process, use a dedicated browser pool service (Browserless, etc.)
Event-driven architecture from day one: Use message queues (Redis, SQS) between crawl and analysis phases for better scaling
More aggressive caching: Cache AI summaries longer—they rarely need regeneration if content hasn’t changed
Visual diffing earlier: Screenshot comparison could catch changes that text analysis misses

The Market Reality

I built this system over several months, iterating through dozens of technical challenges. By the time it was production-ready, several well-funded competitors had entered the space:

Klue
Crayon
Kompyte
Similarweb

The market went from “interesting opportunity” to “saturated” faster than expected. But the technical work remains valuable—both as a portfolio piece and as a foundation for future projects.

Conclusion

Building a competitive intelligence platform touches nearly every aspect of modern software engineering:

Distributed systems: Concurrent crawling, queue management
Web scraping: JavaScript rendering, anti-bot evasion
AI/ML: LLM integration, prompt engineering
Data engineering: Hybrid storage, change detection
Database design: Normalized schemas, efficient indexing
API design: RESTful endpoints, real-time updates

While the market timing didn’t work out for commercial launch, this project demonstrates that complex, production-grade systems can be built by a small team (or solo developer) using modern tools and cloud services.

The code is functional, the architecture is sound, and the technical challenges were genuinely interesting to solve. Sometimes that’s the real value of a project—not the business outcome, but what you learn along the way.

If you’re interested in the technical details or want to discuss competitive intelligence systems, feel free to reach out. The lessons learned here apply to many other domains: price monitoring, content aggregation, research automation, and more.

Technical Stack Summary

Component	Technology
Language	Python 3.12
Web Crawling	Playwright, BeautifulSoup, lxml
AI	Google Gemini API, LangChain
URL Discovery	Firecrawl API
Database	Supabase (PostgreSQL)
Object Storage	S3-compatible (iDrive e2)
API Framework	FastAPI, Uvicorn
Job Scheduling	Celery, Redis, APScheduler
Testing	pytest, pytest-asyncio

This project represents approximately 15,000+ lines of production-quality Python code, comprehensive documentation, and battle-tested solutions to real-world web scraping challenges.

Introduction

The Problem: Why Manual Competitor Tracking Fails

The Solution: An Intelligent Monitoring Engine

System Architecture Overview

Technology Stack

Deep Dive: The Crawling Engine

Concurrent Architecture

Robust URL Queue with Retry Logic

Enhanced JavaScript Rendering

Smart Navigation Strategy

Intelligent Feed Detection

The Problem

The Solution: AI-Powered Feed Discovery

Multi-Session Feed Processing

Feed Pattern Detection

Dual Hashing: Catching Every Change

Tier 1: Raw HTML Hash

Tier 2: Text Content Hash

Why Both?

Precise Word-Level Change Analysis

Change Validation: Eliminating False Positives

AI-Powered Business Context

Page Summarization During Crawling

Change Analysis with Severity Scoring

Exponential Backoff for API Reliability

Data Architecture: Hybrid Storage Strategy

Supabase (PostgreSQL) – Fast Queries

S3 (iDrive e2) – Complete Backups

Master State: The Source of Truth

URL Normalization: Preventing Duplicate Crawls

Pagination URL Filtering

Key Technical Decisions & Trade-offs

1. Playwright vs. Puppeteer vs. Selenium

2. Gemini vs. GPT-4 vs. Claude

3. Supabase vs. Self-hosted PostgreSQL

4. S3 vs. Database for HTML Storage

Performance Characteristics

Lessons Learned

1. Browser Context Isolation is Critical

2. NetworkIdle is a Lie

3. Change Detection Needs Two Hashes

4. Feed Detection is Essential for Scale

5. Validation Eliminates False Positives

What I Would Do Differently

The Market Reality

Conclusion

Technical Stack Summary

Bala Murali

Leave a Comment Cancel Reply