{"id":62,"date":"2025-08-14T17:12:00","date_gmt":"2025-08-14T17:12:00","guid":{"rendered":"https:\/\/balamurali.in\/blog\/?p=62"},"modified":"2026-02-23T14:27:09","modified_gmt":"2026-02-23T14:27:09","slug":"building-an-ai-powered-competitive-intelligence-platform-a-deep-technical-dive","status":"publish","type":"post","link":"https:\/\/balamurali.in\/blog\/tech-posts\/building-an-ai-powered-competitive-intelligence-platform-a-deep-technical-dive\/","title":{"rendered":"Building an AI-Powered Competitive Intelligence Platform: A Deep Technical Dive"},"content":{"rendered":"\n<p><em>How I built a fully functional competitor monitoring system that automatically crawls websites, detects changes, and provides AI-powered business insights<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>In today&#8217;s fast-moving business landscape, keeping track of what your competitors are doing is crucial. When they change their pricing, launch new features, or pivot their strategy, you need to know\u2014ideally before your customers do.<\/p>\n\n\n\n<p>I built <strong>Competition Monitoring System<\/strong>\u2014a comprehensive platform that automatically tracks competitor websites, detects meaningful changes, and uses AI to explain what those changes mean in a business context. While the market got saturated before I could launch it commercially, this project represents a fully-functional, production-ready system that demonstrates sophisticated engineering across multiple domains.<\/p>\n\n\n\n<p>In this deep-dive, I&#8217;ll walk you through the architecture, technical challenges, and solutions that make this system work.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The Problem: Why Manual Competitor Tracking Fails<\/h2>\n\n\n\n<p>Every product manager, growth marketer, and competitive analyst knows the pain:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Manual checking doesn&#8217;t scale<\/strong> &#8211; You can&#8217;t manually check 10+ competitor websites daily<\/li>\n\n\n\n<li><strong>Changes slip through<\/strong> &#8211; A competitor&#8217;s pricing page changes on Friday night, and you don&#8217;t notice until Monday&#8217;s customer call<\/li>\n\n\n\n<li><strong>No historical context<\/strong> &#8211; Even if you catch a change, you often don&#8217;t know what it was before<\/li>\n\n\n\n<li><strong>Signal vs. noise<\/strong> &#8211; Most website changes are irrelevant (footer updates, minor copy tweaks). The important ones get lost<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">The Solution: An Intelligent Monitoring Engine<\/h2>\n\n\n\n<p>I built a system that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Automatically crawls competitor websites<\/strong> on a configurable schedule<\/li>\n\n\n\n<li><strong>Detects ALL changes<\/strong> using dual-hashing (raw HTML + extracted text)<\/li>\n\n\n\n<li><strong>Filters out noise<\/strong> through intelligent feed detection and change validation<\/li>\n\n\n\n<li><strong>Provides AI-powered analysis<\/strong> explaining what each change means for your business<\/li>\n\n\n\n<li><strong>Stores everything efficiently<\/strong> for historical comparison and dashboard visualization<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">System Architecture Overview<\/h2>\n\n\n\n<p>The system is built with a clear separation of concerns:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                          Competition Monitoring System                \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502                                                                       \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u2502\n\u2502  \u2502   Frontend  \u2502\u2500\u2500\u2500\u2500\u25b6\u2502   REST API  \u2502\u2500\u2500\u2500\u2500\u25b6\u2502  Backend Job Engine \u2502    \u2502\n\u2502  \u2502  (React.js) \u2502     \u2502  (FastAPI)  \u2502     \u2502     (Python)        \u2502    \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2502\n\u2502                                                      \u2502               \u2502\n\u2502                                          \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502                                          \u2502                       \u2502   \u2502\n\u2502                                    \u250c\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2510          \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2510\u2502\n\u2502                                    \u2502  Supabase \u2502          \u2502   S3    \u2502\u2502\n\u2502                                    \u2502 (Metadata)\u2502          \u2502(Content)\u2502\u2502\n\u2502                                    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518          \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\u2502\n\u2502                                                                       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Technology Stack<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Layer<\/th><th>Technology<\/th><th>Purpose<\/th><\/tr><\/thead><tbody><tr><td><strong>Web Crawling<\/strong><\/td><td>Playwright + BeautifulSoup<\/td><td>JavaScript-enabled headless browsing<\/td><\/tr><tr><td><strong>AI Analysis<\/strong><\/td><td>Google Gemini API<\/td><td>Page summarization, change analysis<\/td><\/tr><tr><td><strong>URL Discovery<\/strong><\/td><td>Firecrawl API<\/td><td>Comprehensive site mapping<\/td><\/tr><tr><td><strong>Database<\/strong><\/td><td>Supabase (PostgreSQL)<\/td><td>Metadata, sessions, change records<\/td><\/tr><tr><td><strong>Object Storage<\/strong><\/td><td>S3-compatible (iDrive e2)<\/td><td>HTML snapshots, screenshots<\/td><\/tr><tr><td><strong>API<\/strong><\/td><td>FastAPI<\/td><td>RESTful endpoints<\/td><\/tr><tr><td><strong>Frontend<\/strong><\/td><td>React\/Next.js<\/td><td>Dashboard visualization<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Deep Dive: The Crawling Engine<\/h2>\n\n\n\n<p>The heart of the system is the <code>OptimizedWebCrawler<\/code>\u2014a sophisticated Python class that handles everything from URL normalization to JavaScript rendering.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Concurrent Architecture<\/h3>\n\n\n\n<p>Traditional web crawlers are slow because they process one URL at a time. My crawler uses an <strong>isolated worker architecture<\/strong> where multiple browser instances work in parallel:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class OptimizedWebCrawler:\n    def __init__(self, base_url: str, max_pages: int = 100, \n                 concurrent_limit: int = 5):\n        self.concurrent_limit = concurrent_limit\n        self.url_queue = RobustURLQueue()\n        # Each worker gets its own browser instance\n        # Complete isolation prevents context conflicts<\/code><\/pre>\n\n\n\n<p><strong>Key insight<\/strong>: Sharing browser contexts between workers causes &#8220;Target page, context or browser has been closed&#8221; errors. By giving each worker its own browser instance, I achieved a <strong>96.7% success rate<\/strong> (up from 45% with shared contexts).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Robust URL Queue with Retry Logic<\/h3>\n\n\n\n<p>URLs fail for many reasons\u2014network timeouts, rate limiting, temporary server errors. The <code>RobustURLQueue<\/code> class handles this gracefully:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@dataclass\nclass URLTask:\n    url: str\n    retry_count: int = 0\n    max_retries: int = 3\n    last_error: Optional&#91;str] = None\n\nclass RobustURLQueue:\n    def __init__(self):\n        self.main_queue = asyncio.Queue()\n        self.retry_queue = asyncio.Queue()  # Priority for retries\n        self.failed_urls = &#91;]\n\n    async def get(self, timeout: float = 2.0) -&gt; Optional&#91;URLTask]:\n        # Prioritize retry queue over main queue\n        if not self.retry_queue.empty():\n            return await self.retry_queue.get()\n        return await self.main_queue.get()<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Enhanced JavaScript Rendering<\/h3>\n\n\n\n<p>Modern websites are JavaScript-heavy. A naive crawler that just fetches HTML will miss 80% of the content. My solution implements <strong>multi-stage loading detection<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>async def _wait_for_javascript_content(self, page, url: str):\n    # Step 1: Quick check if already loaded\n    ready_state = await page.evaluate('() =&gt; document.readyState')\n    if ready_state == 'complete':\n        await page.wait_for_timeout(500)\n        return\n\n    # Step 2: Wait for meaningful content\n    await page.wait_for_function('''() =&gt; {\n        return document.body &amp;&amp;\n               document.body.innerText &amp;&amp;\n               document.body.innerText.length &gt; 100;\n    }''', timeout=5000)\n\n    # Step 3: Special SPA handling (React\/Vue\/Angular)\n    if any(fw in url.lower() for fw in &#91;'app.', 'console.', 'dashboard.']):\n        await page.wait_for_function('''() =&gt; {\n            if (window.React || window.Vue || window.ng) {\n                return document.readyState === 'complete';\n            }\n            return true;\n        }''', timeout=3000)<\/code><\/pre>\n\n\n\n<p><strong>Result<\/strong>: 80-90% faster JavaScript content waiting while maintaining reliability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Smart Navigation Strategy<\/h3>\n\n\n\n<p>A critical discovery: using <code>wait_until=\"networkidle\"<\/code> causes hangs on sites with continuous background requests (analytics, tracking pixels). The fix was simple but crucial:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Before (problematic - would timeout on modern sites)\nresponse = await page.goto(url, wait_until=\"networkidle\", timeout=180000)\n\n# After (reliable - works with background activity)\nresponse = await page.goto(url, wait_until=\"domcontentloaded\", timeout=60000)<\/code><\/pre>\n\n\n\n<p>This single change reduced crawl times from 180+ seconds (timeout) to 8-10 seconds for sites like factors.ai.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Intelligent Feed Detection<\/h2>\n\n\n\n<p>One of the most sophisticated features is the <strong>Feed Detection &amp; Optimization System<\/strong>. Feed pages (blogs, news, resources) contain hundreds of child URLs that follow similar patterns. Crawling them all is wasteful.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Problem<\/h3>\n\n\n\n<p>Without feed detection:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Crawl <code>\/blog<\/code> page<\/li>\n\n\n\n<li>Discover 200 blog post URLs<\/li>\n\n\n\n<li>Crawl all 200 posts (expensive!)<\/li>\n\n\n\n<li>Next crawl: same 200 posts + 1 new one<\/li>\n\n\n\n<li>Re-crawl everything again<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">The Solution: AI-Powered Feed Discovery<\/h3>\n\n\n\n<p>On first crawl, the system uses <strong>Firecrawl API<\/strong> to discover ALL URLs on a domain, then <strong>Gemini AI<\/strong> to identify feed patterns:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Firecrawl discovers comprehensive URL structure\nfirecrawl_result = await firecrawl_service.discover_urls(\n    domain='competitor.com',\n    include_subdomains=True\n)\n\n# AI analyzes URLs for feed patterns\nai_analysis = await llm_service.analyze_feed_patterns(\n    urls=firecrawl_result&#91;'urls'],\n    paths=firecrawl_result&#91;'paths'],\n    domain='competitor.com'\n)\n\n# Save domain-specific patterns to Supabase\ndomain_config_service.save_domain_config(\n    domain='competitor.com',\n    feed_paths=ai_analysis&#91;'feed_paths'],  # e.g., &#91;'\/blog', '\/insights', '\/news']\n    ai_analysis=ai_analysis\n)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Multi-Session Feed Processing<\/h3>\n\n\n\n<p>The magic happens across crawl sessions:<\/p>\n\n\n\n<p><strong>First Crawl (Discovery)<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detect <code>\/blog<\/code> as a feed page<\/li>\n\n\n\n<li>Discover 50 child URLs<\/li>\n\n\n\n<li><strong>Store URLs WITHOUT processing<\/strong> (no AI, no screenshots)<\/li>\n\n\n\n<li>Establish baseline for comparison<\/li>\n<\/ul>\n\n\n\n<p><strong>Second Crawl (Detection)<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Re-crawl <code>\/blog<\/code><\/li>\n\n\n\n<li>Find 52 URLs (2 new posts)<\/li>\n\n\n\n<li><strong>Only process the 2 new URLs<\/strong> with AI<\/li>\n\n\n\n<li>Include existing URLs in session data (prevents false &#8220;removed&#8221; alerts)<\/li>\n<\/ul>\n\n\n\n<p><strong>Result<\/strong>: 60-80% reduction in redundant processing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Feed Pattern Detection<\/h3>\n\n\n\n<p>The system recognizes these URL patterns as feeds:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>FEED_PATTERNS = &#91;\n    r'\/news\/?$', r'\/blog\/?$', r'\/articles\/?$',\n    r'\/insights\/?$', r'\/resources\/?$', r'\/whitepapers\/?$',\n    r'\/awards\/?$', r'\/events\/?$', r'\/webinars\/?$',\n    r'\/docs\/?$', r'\/help\/?$', r'\/careers\/?$'\n]<\/code><\/pre>\n\n\n\n<p>Combined with AI-discovered domain-specific patterns stored in Supabase, this catches even unusual feed structures.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Dual Hashing: Catching Every Change<\/h2>\n\n\n\n<p>Change detection uses a <strong>two-tier hashing approach<\/strong>:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tier 1: Raw HTML Hash<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>def calculate_html_hash(self, html: str) -&gt; str:\n    return hashlib.md5(html.encode('utf-8')).hexdigest()<\/code><\/pre>\n\n\n\n<p>Captures <strong>ALL changes<\/strong>: script updates, CSS modifications, A\/B tests, tracking pixels.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tier 2: Text Content Hash<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>def calculate_content_hash(self, text: str) -&gt; str:\n    # Extract text, remove scripts\/styles\/metadata\n    soup = BeautifulSoup(html, 'lxml')\n    for script in soup(&#91;\"script\", \"style\", \"meta\", \"link\"]):\n        script.decompose()\n    text = soup.get_text()\n    return hashlib.md5(text.encode('utf-8')).hexdigest()<\/code><\/pre>\n\n\n\n<p>Focuses on <strong>meaningful content changes<\/strong> for AI analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why Both?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HTML hash changes but text hash doesn&#8217;t<\/strong> \u2192 Technical change (A\/B test, script update)<\/li>\n\n\n\n<li><strong>Both hashes change<\/strong> \u2192 Content change worth analyzing with AI<\/li>\n\n\n\n<li><strong>Backward compatibility<\/strong> \u2192 Old data without HTML hash still works via text hash fallback<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Precise Word-Level Change Analysis<\/h2>\n\n\n\n<p>When content changes, we need to know exactly <em>what<\/em> changed. The naive approach (line-by-line diff) produces terrible results:<\/p>\n\n\n\n<p><strong>Bad Output<\/strong> (line-based):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>REMOVED: \"Home Products Pricing About Blog Contact\"\nADDED: \"Home Products Pricing About Blog Contact Login\"<\/code><\/pre>\n\n\n\n<p><strong>Good Output<\/strong> (word-level):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ADDED: \"Login\"<\/code><\/pre>\n\n\n\n<p>My implementation uses word tokenization with sequence matching:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def _analyze_text_changes(self, old_text: str, new_text: str) -&gt; Dict:\n    # Tokenize to words\n    old_words = self._tokenize_text(old_text)\n    new_words = self._tokenize_text(new_text)\n\n    # Find precise word-level differences\n    matcher = SequenceMatcher(None, old_words, new_words)\n    opcodes = list(matcher.get_opcodes())\n\n    added_segments = &#91;]\n    removed_segments = &#91;]\n\n    for tag, i1, i2, j1, j2 in opcodes:\n        if tag == 'insert':\n            added_segments.append(' '.join(new_words&#91;j1:j2]))\n        elif tag == 'delete':\n            removed_segments.append(' '.join(old_words&#91;i1:i2]))\n\n    return {\n        'added': added_segments,\n        'removed': removed_segments,\n        'change_ratio': 1 - matcher.ratio()\n    }<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Change Validation: Eliminating False Positives<\/h2>\n\n\n\n<p>Not all detected changes are real. Dynamic content, A\/B tests, and JavaScript timing issues create <strong>false positives<\/strong>. The <code>ChangeValidator<\/code> service handles this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class ChangeValidator:\n    async def validate_changes(self, changes: List&#91;Change]) -&gt; List&#91;Change]:\n        validated = &#91;]\n\n        for change in changes:\n            # Re-scrape twice with delay\n            scrape1 = await self._scrape_url_once(change.url, 1)\n            await asyncio.sleep(2)\n            scrape2 = await self._scrape_url_once(change.url, 2)\n\n            # Compare consistency\n            if scrape1&#91;'content_hash'] == scrape2&#91;'content_hash']:\n                validated.append(change)  # Consistent = real change\n            else:\n                logger.info(f\"Invalidated {change.url}: inconsistent content\")\n\n        return validated<\/code><\/pre>\n\n\n\n<p>Only changes that <strong>consistently reproduce<\/strong> proceed to AI analysis.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">AI-Powered Business Context<\/h2>\n\n\n\n<p>Detecting changes is only half the battle. The real value is <strong>understanding what they mean<\/strong>. The <code>ChangeAnalyzer<\/code> service uses Google&#8217;s Gemini API to provide business context.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Page Summarization During Crawling<\/h3>\n\n\n\n<p>Every page is summarized in real-time:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>async def generate_page_summary(self, text_content: str, url: str) -&gt; PageSummary:\n    prompt = f\"\"\"Analyze the following webpage content from URL: {url}\n\n    Provide:\n    1. A concise 2-3 sentence summary\n    2. The page type (pricing, product, blog, etc.)\n    3. 5-10 relevant keywords\n    4. Key entities mentioned (products, people, companies)\n    \"\"\"\n\n    response = await self._call_with_backoff(prompt)\n    return self._parse_summary_response(response)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Change Analysis with Severity Scoring<\/h3>\n\n\n\n<p>When changes are detected, AI evaluates their business significance:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>@dataclass\nclass ChangeAnalysis:\n    url: str\n    change_type: str        # pricing_update, feature_addition, etc.\n    severity: int           # 1-10 scale\n    change_definition: str  # 3-line business analysis\n    recommended_pages: List&#91;str]  # Related pages for context<\/code><\/pre>\n\n\n\n<p><strong>Severity Guidelines<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>1-3<\/strong>: Minor (typo fixes, date updates)<\/li>\n\n\n\n<li><strong>4-6<\/strong>: Moderate (new blog post, team change)<\/li>\n\n\n\n<li><strong>7-9<\/strong>: Significant (pricing change, new feature)<\/li>\n\n\n\n<li><strong>10<\/strong>: Critical (acquisition, major pivot)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Exponential Backoff for API Reliability<\/h3>\n\n\n\n<p>Gemini API has rate limits. The service implements robust retry logic:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>async def _call_with_backoff(self, prompt: str) -&gt; str:\n    for attempt in range(self.max_retries):  # 10 retries\n        try:\n            response = await asyncio.to_thread(\n                self.model.generate_content, prompt\n            )\n            return response.text\n        except Exception as e:\n            if 'rate' in str(e).lower() or '429' in str(e):\n                delay = min(self.base_delay * (2 ** attempt), 60)\n                jitter = random.uniform(0, delay * 0.1)\n                await asyncio.sleep(delay + jitter)\n            else:\n                raise<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Data Architecture: Hybrid Storage Strategy<\/h2>\n\n\n\n<p>The system uses a <strong>hybrid storage approach<\/strong> optimized for different access patterns:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Supabase (PostgreSQL) &#8211; Fast Queries<\/h3>\n\n\n\n<p>Normalized tables for dashboard queries:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>-- Sessions table\nCREATE TABLE competition_analysis_sessions (\n    id UUID PRIMARY KEY,\n    domain TEXT NOT NULL,\n    session_id TEXT NOT NULL,\n    comparison_type TEXT,\n    total_changes INTEGER,\n    created_at TIMESTAMPTZ\n);\n\n-- Changes table with indexes\nCREATE TABLE competition_analysis_changes (\n    id UUID PRIMARY KEY,\n    session_analysis_id UUID REFERENCES competition_analysis_sessions(id),\n    url TEXT NOT NULL,\n    change_type TEXT NOT NULL,\n    severity INTEGER CHECK (severity &gt;= 1 AND severity &lt;= 10),\n    page_type TEXT,\n    change_definition TEXT,\n    recommended_pages JSONB,\n    created_at TIMESTAMPTZ\n);\n\n-- Indexes for fast filtering\nCREATE INDEX idx_severity ON competition_analysis_changes(severity);\nCREATE INDEX idx_change_type ON competition_analysis_changes(change_type);<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">S3 (iDrive e2) &#8211; Complete Backups<\/h3>\n\n\n\n<p>Object storage for full content:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>competitor.com\/\n\u251c\u2500\u2500 sessions\/\n\u2502   \u251c\u2500\u2500 20250115_103000.json.gz   # Complete session data\n\u2502   \u2514\u2500\u2500 20250114_103000.json.gz\n\u251c\u2500\u2500 master\/\n\u2502   \u2514\u2500\u2500 state.json.gz             # Master state tracking\n\u2514\u2500\u2500 screenshots\/\n    \u2514\u2500\u2500 20250115_103000\/\n        \u251c\u2500\u2500 homepage.png\n        \u2514\u2500\u2500 pricing.png<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Master State: The Source of Truth<\/h3>\n\n\n\n<p>The master state tracks all pages across sessions:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"domain\": \"competitor.com\",\n  \"pages\": {\n    \"https:\/\/competitor.com\/pricing\": {\n      \"html_hash\": \"x9y8z7...\",\n      \"content_hash\": \"a1b2c3...\",\n      \"title\": \"Pricing Plans\",\n      \"ai_summary\": \"...\",\n      \"last_session\": \"20250115_103000\"\n    }\n  },\n  \"feed_state\": {\n    \"feeds\": {\n      \"https:\/\/competitor.com\/blog\": {\n        \"feed_type\": \"blog\",\n        \"discovered_urls\": &#91;\"...\", \"...\"],\n        \"last_url_count\": 50\n      }\n    }\n  },\n  \"sessions\": &#91;\n    {\"id\": \"20250115_103000\", \"pages_count\": 45},\n    {\"id\": \"20250114_103000\", \"pages_count\": 44}\n  ]\n}<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">URL Normalization: Preventing Duplicate Crawls<\/h2>\n\n\n\n<p>A subtle but critical feature: URLs with minor variations must be treated as identical:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>https:&#47;&#47;factors.ai\/pricing\nhttps:\/\/www.factors.ai\/pricing      # www subdomain\nhttps:\/\/factors.ai\/pricing\/         # trailing slash\nhttps:\/\/FACTORS.AI\/pricing          # case difference\nhttps:\/\/factors.ai\/pricing?ref=nav  # query params<\/code><\/pre>\n\n\n\n<p>The normalization function handles all these cases:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def normalize_url(self, url: str) -&gt; str:\n    parsed = urlparse(url)\n\n    # 1. Remove www subdomain\n    netloc = parsed.netloc\n    if netloc.startswith('www.'):\n        netloc = netloc&#91;4:]\n\n    # 2. Lowercase hostname\n    netloc = netloc.lower()\n\n    # 3. Remove trailing slash\n    path = parsed.path.rstrip('\/') or '\/'\n\n    # 4. Sort query parameters\n    if parsed.query:\n        params = sorted(parse_qsl(parsed.query))\n        query = urlencode(params)\n\n    # 5. Remove fragments\n    # 6. Normalize index files\n\n    return urlunparse((parsed.scheme, netloc, path, '', query, ''))<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Pagination URL Filtering<\/h2>\n\n\n\n<p>An interesting edge case: pagination URLs were causing exponential crawling:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/blog \u2192 discovers \/blog?page=1, \/blog?page=2, ...\n\/blog?page=1 \u2192 treated as new feed \u2192 discovers more pagination URLs\n\/blog?page=2 \u2192 same problem\n... (exponential explosion)<\/code><\/pre>\n\n\n\n<p><strong>Solution<\/strong>: Filter pagination URLs before they enter the queue:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pagination_patterns = &#91;\n    'page=', '_page=', 'p=', 'offset=', 'start=',\n    'pagenum=', 'pagenumber=', 'pageindex=', 'paged='\n]\n\ndef should_crawl(self, url: str) -&gt; bool:\n    if parsed_url.query:\n        query_lower = parsed_url.query.lower()\n        if any(pattern in query_lower for pattern in pagination_patterns):\n            return False  # Skip pagination URLs\n    return True<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Key Technical Decisions &amp; Trade-offs<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. Playwright vs. Puppeteer vs. Selenium<\/h3>\n\n\n\n<p><strong>Choice<\/strong>: Playwright<\/p>\n\n\n\n<p><strong>Why<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Best JavaScript rendering support<\/li>\n\n\n\n<li>Async API for concurrent crawling<\/li>\n\n\n\n<li>Cross-browser testing if needed<\/li>\n\n\n\n<li>Active development and great docs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. Gemini vs. GPT-4 vs. Claude<\/h3>\n\n\n\n<p><strong>Choice<\/strong>: Google Gemini<\/p>\n\n\n\n<p><strong>Why<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cost-effective for high-volume summarization<\/li>\n\n\n\n<li>Fast response times<\/li>\n\n\n\n<li>Sufficient quality for page analysis<\/li>\n\n\n\n<li>Easy to switch via <code>LLMFactory<\/code> abstraction<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. Supabase vs. Self-hosted PostgreSQL<\/h3>\n\n\n\n<p><strong>Choice<\/strong>: Supabase<\/p>\n\n\n\n<p><strong>Why<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Managed PostgreSQL with row-level security<\/li>\n\n\n\n<li>Real-time subscriptions for dashboard<\/li>\n\n\n\n<li>Built-in auth for multi-tenant future<\/li>\n\n\n\n<li>Edge functions for serverless compute<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. S3 vs. Database for HTML Storage<\/h3>\n\n\n\n<p><strong>Choice<\/strong>: Hybrid (both)<\/p>\n\n\n\n<p><strong>Why<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>S3: Cheap storage for large HTML\/screenshots<\/li>\n\n\n\n<li>Supabase: Fast queries for dashboard data<\/li>\n\n\n\n<li>Master state in S3: Single source of truth for crawl history<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Performance Characteristics<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Metric<\/th><th>Value<\/th><\/tr><\/thead><tbody><tr><td><strong>Concurrent Workers<\/strong><\/td><td>5 (configurable to 10)<\/td><\/tr><tr><td><strong>URLs per Minute<\/strong><\/td><td>~50-100 (depending on site speed)<\/td><\/tr><tr><td><strong>JavaScript Wait Time<\/strong><\/td><td>0.5s (fast) to 5s (SPAs)<\/td><\/tr><tr><td><strong>Success Rate<\/strong><\/td><td>96.7%<\/td><\/tr><tr><td><strong>AI Summaries per Minute<\/strong><\/td><td>~20 (with rate limiting)<\/td><\/tr><tr><td><strong>Storage per Crawl<\/strong><\/td><td>~2-10 MB compressed<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Lessons Learned<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1. Browser Context Isolation is Critical<\/h3>\n\n\n\n<p>Sharing browser contexts between concurrent workers causes race conditions and cryptic errors. Each worker needs its own isolated browser instance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. NetworkIdle is a Lie<\/h3>\n\n\n\n<p>Modern websites never reach &#8220;network idle&#8221; due to analytics, tracking, and WebSocket connections. Use <code>domcontentloaded<\/code> instead.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Change Detection Needs Two Hashes<\/h3>\n\n\n\n<p>Raw HTML hash catches technical changes. Text content hash catches meaningful changes. You need both for complete coverage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Feed Detection is Essential for Scale<\/h3>\n\n\n\n<p>Without intelligent feed handling, you&#8217;ll waste 80% of your crawl budget on blog posts and news articles that haven&#8217;t changed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. Validation Eliminates False Positives<\/h3>\n\n\n\n<p>A\/B tests, personalization, and dynamic content create fake &#8220;changes&#8221;. Re-scraping twice with a delay catches these.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What I Would Do Differently<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Start with headless Chrome service<\/strong>: Instead of managing Playwright browsers in-process, use a dedicated browser pool service (Browserless, etc.)<\/li>\n\n\n\n<li><strong>Event-driven architecture from day one<\/strong>: Use message queues (Redis, SQS) between crawl and analysis phases for better scaling<\/li>\n\n\n\n<li><strong>More aggressive caching<\/strong>: Cache AI summaries longer\u2014they rarely need regeneration if content hasn&#8217;t changed<\/li>\n\n\n\n<li><strong>Visual diffing earlier<\/strong>: Screenshot comparison could catch changes that text analysis misses<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The Market Reality<\/h2>\n\n\n\n<p>I built this system over several months, iterating through dozens of technical challenges. By the time it was production-ready, several well-funded competitors had entered the space:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Klue<\/li>\n\n\n\n<li>Crayon<\/li>\n\n\n\n<li>Kompyte<\/li>\n\n\n\n<li>Similarweb<\/li>\n<\/ul>\n\n\n\n<p>The market went from &#8220;interesting opportunity&#8221; to &#8220;saturated&#8221; faster than expected. But the technical work remains valuable\u2014both as a portfolio piece and as a foundation for future projects.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Building a competitive intelligence platform touches nearly every aspect of modern software engineering:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Distributed systems<\/strong>: Concurrent crawling, queue management<\/li>\n\n\n\n<li><strong>Web scraping<\/strong>: JavaScript rendering, anti-bot evasion<\/li>\n\n\n\n<li><strong>AI\/ML<\/strong>: LLM integration, prompt engineering<\/li>\n\n\n\n<li><strong>Data engineering<\/strong>: Hybrid storage, change detection<\/li>\n\n\n\n<li><strong>Database design<\/strong>: Normalized schemas, efficient indexing<\/li>\n\n\n\n<li><strong>API design<\/strong>: RESTful endpoints, real-time updates<\/li>\n<\/ul>\n\n\n\n<p>While the market timing didn&#8217;t work out for commercial launch, this project demonstrates that complex, production-grade systems can be built by a small team (or solo developer) using modern tools and cloud services.<\/p>\n\n\n\n<p>The code is functional, the architecture is sound, and the technical challenges were genuinely interesting to solve. Sometimes that&#8217;s the real value of a project\u2014not the business outcome, but what you learn along the way.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>If you&#8217;re interested in the technical details or want to discuss competitive intelligence systems, feel free to reach out. The lessons learned here apply to many other domains: price monitoring, content aggregation, research automation, and more.<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Technical Stack Summary<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Component<\/th><th>Technology<\/th><\/tr><\/thead><tbody><tr><td>Language<\/td><td>Python 3.12<\/td><\/tr><tr><td>Web Crawling<\/td><td>Playwright, BeautifulSoup, lxml<\/td><\/tr><tr><td>AI<\/td><td>Google Gemini API, LangChain<\/td><\/tr><tr><td>URL Discovery<\/td><td>Firecrawl API<\/td><\/tr><tr><td>Database<\/td><td>Supabase (PostgreSQL)<\/td><\/tr><tr><td>Object Storage<\/td><td>S3-compatible (iDrive e2)<\/td><\/tr><tr><td>API Framework<\/td><td>FastAPI, Uvicorn<\/td><\/tr><tr><td>Job Scheduling<\/td><td>Celery, Redis, APScheduler<\/td><\/tr><tr><td>Testing<\/td><td>pytest, pytest-asyncio<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>This project represents approximately 15,000+ lines of production-quality Python code, comprehensive documentation, and battle-tested solutions to real-world web scraping challenges.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How I built a fully functional competitor monitoring system that automatically crawls websites, detects changes, and provides AI-powered business insights Introduction In today&#8217;s fast-moving business landscape, keeping track of what&#8230;<\/p>\n","protected":false},"author":1,"featured_media":154,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,4],"tags":[],"class_list":["post-62","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learn-with-me","category-tech-posts"],"jetpack_featured_media_url":"https:\/\/balamurali.in\/blog\/wp-content\/uploads\/2026\/02\/competitive_intelligence.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/62","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/comments?post=62"}],"version-history":[{"count":1,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/62\/revisions"}],"predecessor-version":[{"id":64,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/62\/revisions\/64"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media\/154"}],"wp:attachment":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media?parent=62"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/categories?post=62"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/tags?post=62"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}