How I built a fully functional competitor monitoring system that automatically crawls websites, detects changes, and provides AI-powered business insights
Introduction
In today’s fast-moving business landscape, keeping track of what your competitors are doing is crucial. When they change their pricing, launch new features, or pivot their strategy, you need to know—ideally before your customers do.
I built Competition Monitoring System—a comprehensive platform that automatically tracks competitor websites, detects meaningful changes, and uses AI to explain what those changes mean in a business context. While the market got saturated before I could launch it commercially, this project represents a fully-functional, production-ready system that demonstrates sophisticated engineering across multiple domains.
In this deep-dive, I’ll walk you through the architecture, technical challenges, and solutions that make this system work.
The Problem: Why Manual Competitor Tracking Fails
Every product manager, growth marketer, and competitive analyst knows the pain:
- Manual checking doesn’t scale – You can’t manually check 10+ competitor websites daily
- Changes slip through – A competitor’s pricing page changes on Friday night, and you don’t notice until Monday’s customer call
- No historical context – Even if you catch a change, you often don’t know what it was before
- Signal vs. noise – Most website changes are irrelevant (footer updates, minor copy tweaks). The important ones get lost
The Solution: An Intelligent Monitoring Engine
I built a system that:
- Automatically crawls competitor websites on a configurable schedule
- Detects ALL changes using dual-hashing (raw HTML + extracted text)
- Filters out noise through intelligent feed detection and change validation
- Provides AI-powered analysis explaining what each change means for your business
- Stores everything efficiently for historical comparison and dashboard visualization
System Architecture Overview
The system is built with a clear separation of concerns:
┌──────────────────────────────────────────────────────────────────────┐
│ Competition Monitoring System │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Frontend │────▶│ REST API │────▶│ Backend Job Engine │ │
│ │ (React.js) │ │ (FastAPI) │ │ (Python) │ │
│ └─────────────┘ └─────────────┘ └──────────┬──────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ │ │ │
│ ┌─────▼─────┐ ┌──────▼──┐│
│ │ Supabase │ │ S3 ││
│ │ (Metadata)│ │(Content)││
│ └───────────┘ └─────────┘│
│ │
└──────────────────────────────────────────────────────────────────────┘
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Web Crawling | Playwright + BeautifulSoup | JavaScript-enabled headless browsing |
| AI Analysis | Google Gemini API | Page summarization, change analysis |
| URL Discovery | Firecrawl API | Comprehensive site mapping |
| Database | Supabase (PostgreSQL) | Metadata, sessions, change records |
| Object Storage | S3-compatible (iDrive e2) | HTML snapshots, screenshots |
| API | FastAPI | RESTful endpoints |
| Frontend | React/Next.js | Dashboard visualization |
Deep Dive: The Crawling Engine
The heart of the system is the OptimizedWebCrawler—a sophisticated Python class that handles everything from URL normalization to JavaScript rendering.
Concurrent Architecture
Traditional web crawlers are slow because they process one URL at a time. My crawler uses an isolated worker architecture where multiple browser instances work in parallel:
class OptimizedWebCrawler:
def __init__(self, base_url: str, max_pages: int = 100,
concurrent_limit: int = 5):
self.concurrent_limit = concurrent_limit
self.url_queue = RobustURLQueue()
# Each worker gets its own browser instance
# Complete isolation prevents context conflicts
Key insight: Sharing browser contexts between workers causes “Target page, context or browser has been closed” errors. By giving each worker its own browser instance, I achieved a 96.7% success rate (up from 45% with shared contexts).
Robust URL Queue with Retry Logic
URLs fail for many reasons—network timeouts, rate limiting, temporary server errors. The RobustURLQueue class handles this gracefully:
@dataclass
class URLTask:
url: str
retry_count: int = 0
max_retries: int = 3
last_error: Optional[str] = None
class RobustURLQueue:
def __init__(self):
self.main_queue = asyncio.Queue()
self.retry_queue = asyncio.Queue() # Priority for retries
self.failed_urls = []
async def get(self, timeout: float = 2.0) -> Optional[URLTask]:
# Prioritize retry queue over main queue
if not self.retry_queue.empty():
return await self.retry_queue.get()
return await self.main_queue.get()
Enhanced JavaScript Rendering
Modern websites are JavaScript-heavy. A naive crawler that just fetches HTML will miss 80% of the content. My solution implements multi-stage loading detection:
async def _wait_for_javascript_content(self, page, url: str):
# Step 1: Quick check if already loaded
ready_state = await page.evaluate('() => document.readyState')
if ready_state == 'complete':
await page.wait_for_timeout(500)
return
# Step 2: Wait for meaningful content
await page.wait_for_function('''() => {
return document.body &&
document.body.innerText &&
document.body.innerText.length > 100;
}''', timeout=5000)
# Step 3: Special SPA handling (React/Vue/Angular)
if any(fw in url.lower() for fw in ['app.', 'console.', 'dashboard.']):
await page.wait_for_function('''() => {
if (window.React || window.Vue || window.ng) {
return document.readyState === 'complete';
}
return true;
}''', timeout=3000)
Result: 80-90% faster JavaScript content waiting while maintaining reliability.
Smart Navigation Strategy
A critical discovery: using wait_until="networkidle" causes hangs on sites with continuous background requests (analytics, tracking pixels). The fix was simple but crucial:
# Before (problematic - would timeout on modern sites)
response = await page.goto(url, wait_until="networkidle", timeout=180000)
# After (reliable - works with background activity)
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
This single change reduced crawl times from 180+ seconds (timeout) to 8-10 seconds for sites like factors.ai.
Intelligent Feed Detection
One of the most sophisticated features is the Feed Detection & Optimization System. Feed pages (blogs, news, resources) contain hundreds of child URLs that follow similar patterns. Crawling them all is wasteful.
The Problem
Without feed detection:
- Crawl
/blogpage - Discover 200 blog post URLs
- Crawl all 200 posts (expensive!)
- Next crawl: same 200 posts + 1 new one
- Re-crawl everything again
The Solution: AI-Powered Feed Discovery
On first crawl, the system uses Firecrawl API to discover ALL URLs on a domain, then Gemini AI to identify feed patterns:
# Firecrawl discovers comprehensive URL structure
firecrawl_result = await firecrawl_service.discover_urls(
domain='competitor.com',
include_subdomains=True
)
# AI analyzes URLs for feed patterns
ai_analysis = await llm_service.analyze_feed_patterns(
urls=firecrawl_result['urls'],
paths=firecrawl_result['paths'],
domain='competitor.com'
)
# Save domain-specific patterns to Supabase
domain_config_service.save_domain_config(
domain='competitor.com',
feed_paths=ai_analysis['feed_paths'], # e.g., ['/blog', '/insights', '/news']
ai_analysis=ai_analysis
)
Multi-Session Feed Processing
The magic happens across crawl sessions:
First Crawl (Discovery):
- Detect
/blogas a feed page - Discover 50 child URLs
- Store URLs WITHOUT processing (no AI, no screenshots)
- Establish baseline for comparison
Second Crawl (Detection):
- Re-crawl
/blog - Find 52 URLs (2 new posts)
- Only process the 2 new URLs with AI
- Include existing URLs in session data (prevents false “removed” alerts)
Result: 60-80% reduction in redundant processing.
Feed Pattern Detection
The system recognizes these URL patterns as feeds:
FEED_PATTERNS = [
r'/news/?$', r'/blog/?$', r'/articles/?$',
r'/insights/?$', r'/resources/?$', r'/whitepapers/?$',
r'/awards/?$', r'/events/?$', r'/webinars/?$',
r'/docs/?$', r'/help/?$', r'/careers/?$'
]
Combined with AI-discovered domain-specific patterns stored in Supabase, this catches even unusual feed structures.
Dual Hashing: Catching Every Change
Change detection uses a two-tier hashing approach:
Tier 1: Raw HTML Hash
def calculate_html_hash(self, html: str) -> str:
return hashlib.md5(html.encode('utf-8')).hexdigest()
Captures ALL changes: script updates, CSS modifications, A/B tests, tracking pixels.
Tier 2: Text Content Hash
def calculate_content_hash(self, text: str) -> str:
# Extract text, remove scripts/styles/metadata
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style", "meta", "link"]):
script.decompose()
text = soup.get_text()
return hashlib.md5(text.encode('utf-8')).hexdigest()
Focuses on meaningful content changes for AI analysis.
Why Both?
- HTML hash changes but text hash doesn’t → Technical change (A/B test, script update)
- Both hashes change → Content change worth analyzing with AI
- Backward compatibility → Old data without HTML hash still works via text hash fallback
Precise Word-Level Change Analysis
When content changes, we need to know exactly what changed. The naive approach (line-by-line diff) produces terrible results:
Bad Output (line-based):
REMOVED: "Home Products Pricing About Blog Contact"
ADDED: "Home Products Pricing About Blog Contact Login"
Good Output (word-level):
ADDED: "Login"
My implementation uses word tokenization with sequence matching:
def _analyze_text_changes(self, old_text: str, new_text: str) -> Dict:
# Tokenize to words
old_words = self._tokenize_text(old_text)
new_words = self._tokenize_text(new_text)
# Find precise word-level differences
matcher = SequenceMatcher(None, old_words, new_words)
opcodes = list(matcher.get_opcodes())
added_segments = []
removed_segments = []
for tag, i1, i2, j1, j2 in opcodes:
if tag == 'insert':
added_segments.append(' '.join(new_words[j1:j2]))
elif tag == 'delete':
removed_segments.append(' '.join(old_words[i1:i2]))
return {
'added': added_segments,
'removed': removed_segments,
'change_ratio': 1 - matcher.ratio()
}
Change Validation: Eliminating False Positives
Not all detected changes are real. Dynamic content, A/B tests, and JavaScript timing issues create false positives. The ChangeValidator service handles this:
class ChangeValidator:
async def validate_changes(self, changes: List[Change]) -> List[Change]:
validated = []
for change in changes:
# Re-scrape twice with delay
scrape1 = await self._scrape_url_once(change.url, 1)
await asyncio.sleep(2)
scrape2 = await self._scrape_url_once(change.url, 2)
# Compare consistency
if scrape1['content_hash'] == scrape2['content_hash']:
validated.append(change) # Consistent = real change
else:
logger.info(f"Invalidated {change.url}: inconsistent content")
return validated
Only changes that consistently reproduce proceed to AI analysis.
AI-Powered Business Context
Detecting changes is only half the battle. The real value is understanding what they mean. The ChangeAnalyzer service uses Google’s Gemini API to provide business context.
Page Summarization During Crawling
Every page is summarized in real-time:
async def generate_page_summary(self, text_content: str, url: str) -> PageSummary:
prompt = f"""Analyze the following webpage content from URL: {url}
Provide:
1. A concise 2-3 sentence summary
2. The page type (pricing, product, blog, etc.)
3. 5-10 relevant keywords
4. Key entities mentioned (products, people, companies)
"""
response = await self._call_with_backoff(prompt)
return self._parse_summary_response(response)
Change Analysis with Severity Scoring
When changes are detected, AI evaluates their business significance:
@dataclass
class ChangeAnalysis:
url: str
change_type: str # pricing_update, feature_addition, etc.
severity: int # 1-10 scale
change_definition: str # 3-line business analysis
recommended_pages: List[str] # Related pages for context
Severity Guidelines:
- 1-3: Minor (typo fixes, date updates)
- 4-6: Moderate (new blog post, team change)
- 7-9: Significant (pricing change, new feature)
- 10: Critical (acquisition, major pivot)
Exponential Backoff for API Reliability
Gemini API has rate limits. The service implements robust retry logic:
async def _call_with_backoff(self, prompt: str) -> str:
for attempt in range(self.max_retries): # 10 retries
try:
response = await asyncio.to_thread(
self.model.generate_content, prompt
)
return response.text
except Exception as e:
if 'rate' in str(e).lower() or '429' in str(e):
delay = min(self.base_delay * (2 ** attempt), 60)
jitter = random.uniform(0, delay * 0.1)
await asyncio.sleep(delay + jitter)
else:
raise
Data Architecture: Hybrid Storage Strategy
The system uses a hybrid storage approach optimized for different access patterns:
Supabase (PostgreSQL) – Fast Queries
Normalized tables for dashboard queries:
-- Sessions table
CREATE TABLE competition_analysis_sessions (
id UUID PRIMARY KEY,
domain TEXT NOT NULL,
session_id TEXT NOT NULL,
comparison_type TEXT,
total_changes INTEGER,
created_at TIMESTAMPTZ
);
-- Changes table with indexes
CREATE TABLE competition_analysis_changes (
id UUID PRIMARY KEY,
session_analysis_id UUID REFERENCES competition_analysis_sessions(id),
url TEXT NOT NULL,
change_type TEXT NOT NULL,
severity INTEGER CHECK (severity >= 1 AND severity <= 10),
page_type TEXT,
change_definition TEXT,
recommended_pages JSONB,
created_at TIMESTAMPTZ
);
-- Indexes for fast filtering
CREATE INDEX idx_severity ON competition_analysis_changes(severity);
CREATE INDEX idx_change_type ON competition_analysis_changes(change_type);
S3 (iDrive e2) – Complete Backups
Object storage for full content:
competitor.com/
├── sessions/
│ ├── 20250115_103000.json.gz # Complete session data
│ └── 20250114_103000.json.gz
├── master/
│ └── state.json.gz # Master state tracking
└── screenshots/
└── 20250115_103000/
├── homepage.png
└── pricing.png
Master State: The Source of Truth
The master state tracks all pages across sessions:
{
"domain": "competitor.com",
"pages": {
"https://competitor.com/pricing": {
"html_hash": "x9y8z7...",
"content_hash": "a1b2c3...",
"title": "Pricing Plans",
"ai_summary": "...",
"last_session": "20250115_103000"
}
},
"feed_state": {
"feeds": {
"https://competitor.com/blog": {
"feed_type": "blog",
"discovered_urls": ["...", "..."],
"last_url_count": 50
}
}
},
"sessions": [
{"id": "20250115_103000", "pages_count": 45},
{"id": "20250114_103000", "pages_count": 44}
]
}
URL Normalization: Preventing Duplicate Crawls
A subtle but critical feature: URLs with minor variations must be treated as identical:
https://factors.ai/pricing
https://www.factors.ai/pricing # www subdomain
https://factors.ai/pricing/ # trailing slash
https://FACTORS.AI/pricing # case difference
https://factors.ai/pricing?ref=nav # query params
The normalization function handles all these cases:
def normalize_url(self, url: str) -> str:
parsed = urlparse(url)
# 1. Remove www subdomain
netloc = parsed.netloc
if netloc.startswith('www.'):
netloc = netloc[4:]
# 2. Lowercase hostname
netloc = netloc.lower()
# 3. Remove trailing slash
path = parsed.path.rstrip('/') or '/'
# 4. Sort query parameters
if parsed.query:
params = sorted(parse_qsl(parsed.query))
query = urlencode(params)
# 5. Remove fragments
# 6. Normalize index files
return urlunparse((parsed.scheme, netloc, path, '', query, ''))
Pagination URL Filtering
An interesting edge case: pagination URLs were causing exponential crawling:
/blog → discovers /blog?page=1, /blog?page=2, ...
/blog?page=1 → treated as new feed → discovers more pagination URLs
/blog?page=2 → same problem
... (exponential explosion)
Solution: Filter pagination URLs before they enter the queue:
pagination_patterns = [
'page=', '_page=', 'p=', 'offset=', 'start=',
'pagenum=', 'pagenumber=', 'pageindex=', 'paged='
]
def should_crawl(self, url: str) -> bool:
if parsed_url.query:
query_lower = parsed_url.query.lower()
if any(pattern in query_lower for pattern in pagination_patterns):
return False # Skip pagination URLs
return True
Key Technical Decisions & Trade-offs
1. Playwright vs. Puppeteer vs. Selenium
Choice: Playwright
Why:
- Best JavaScript rendering support
- Async API for concurrent crawling
- Cross-browser testing if needed
- Active development and great docs
2. Gemini vs. GPT-4 vs. Claude
Choice: Google Gemini
Why:
- Cost-effective for high-volume summarization
- Fast response times
- Sufficient quality for page analysis
- Easy to switch via
LLMFactoryabstraction
3. Supabase vs. Self-hosted PostgreSQL
Choice: Supabase
Why:
- Managed PostgreSQL with row-level security
- Real-time subscriptions for dashboard
- Built-in auth for multi-tenant future
- Edge functions for serverless compute
4. S3 vs. Database for HTML Storage
Choice: Hybrid (both)
Why:
- S3: Cheap storage for large HTML/screenshots
- Supabase: Fast queries for dashboard data
- Master state in S3: Single source of truth for crawl history
Performance Characteristics
| Metric | Value |
|---|---|
| Concurrent Workers | 5 (configurable to 10) |
| URLs per Minute | ~50-100 (depending on site speed) |
| JavaScript Wait Time | 0.5s (fast) to 5s (SPAs) |
| Success Rate | 96.7% |
| AI Summaries per Minute | ~20 (with rate limiting) |
| Storage per Crawl | ~2-10 MB compressed |
Lessons Learned
1. Browser Context Isolation is Critical
Sharing browser contexts between concurrent workers causes race conditions and cryptic errors. Each worker needs its own isolated browser instance.
2. NetworkIdle is a Lie
Modern websites never reach “network idle” due to analytics, tracking, and WebSocket connections. Use domcontentloaded instead.
3. Change Detection Needs Two Hashes
Raw HTML hash catches technical changes. Text content hash catches meaningful changes. You need both for complete coverage.
4. Feed Detection is Essential for Scale
Without intelligent feed handling, you’ll waste 80% of your crawl budget on blog posts and news articles that haven’t changed.
5. Validation Eliminates False Positives
A/B tests, personalization, and dynamic content create fake “changes”. Re-scraping twice with a delay catches these.
What I Would Do Differently
- Start with headless Chrome service: Instead of managing Playwright browsers in-process, use a dedicated browser pool service (Browserless, etc.)
- Event-driven architecture from day one: Use message queues (Redis, SQS) between crawl and analysis phases for better scaling
- More aggressive caching: Cache AI summaries longer—they rarely need regeneration if content hasn’t changed
- Visual diffing earlier: Screenshot comparison could catch changes that text analysis misses
The Market Reality
I built this system over several months, iterating through dozens of technical challenges. By the time it was production-ready, several well-funded competitors had entered the space:
- Klue
- Crayon
- Kompyte
- Similarweb
The market went from “interesting opportunity” to “saturated” faster than expected. But the technical work remains valuable—both as a portfolio piece and as a foundation for future projects.
Conclusion
Building a competitive intelligence platform touches nearly every aspect of modern software engineering:
- Distributed systems: Concurrent crawling, queue management
- Web scraping: JavaScript rendering, anti-bot evasion
- AI/ML: LLM integration, prompt engineering
- Data engineering: Hybrid storage, change detection
- Database design: Normalized schemas, efficient indexing
- API design: RESTful endpoints, real-time updates
While the market timing didn’t work out for commercial launch, this project demonstrates that complex, production-grade systems can be built by a small team (or solo developer) using modern tools and cloud services.
The code is functional, the architecture is sound, and the technical challenges were genuinely interesting to solve. Sometimes that’s the real value of a project—not the business outcome, but what you learn along the way.
If you’re interested in the technical details or want to discuss competitive intelligence systems, feel free to reach out. The lessons learned here apply to many other domains: price monitoring, content aggregation, research automation, and more.
Technical Stack Summary
| Component | Technology |
|---|---|
| Language | Python 3.12 |
| Web Crawling | Playwright, BeautifulSoup, lxml |
| AI | Google Gemini API, LangChain |
| URL Discovery | Firecrawl API |
| Database | Supabase (PostgreSQL) |
| Object Storage | S3-compatible (iDrive e2) |
| API Framework | FastAPI, Uvicorn |
| Job Scheduling | Celery, Redis, APScheduler |
| Testing | pytest, pytest-asyncio |
This project represents approximately 15,000+ lines of production-quality Python code, comprehensive documentation, and battle-tested solutions to real-world web scraping challenges.