{"id":77,"date":"2023-12-14T17:53:00","date_gmt":"2023-12-14T17:53:00","guid":{"rendered":"https:\/\/balamurali.in\/blog\/?p=77"},"modified":"2026-02-23T14:27:38","modified_gmt":"2026-02-23T14:27:38","slug":"building-a-rule-based-engine-to-extract-founding-years-from-html","status":"publish","type":"post","link":"https:\/\/balamurali.in\/blog\/tech-posts\/building-a-rule-based-engine-to-extract-founding-years-from-html\/","title":{"rendered":"Building a Rule-Based Engine to Extract Founding Years from HTML"},"content":{"rendered":"\n<p>In the world of Private Equity, <strong>Vintage<\/strong> is everything.<\/p>\n\n\n\n<p>An investor looks at a company with $1M in revenue and asks: <em>&#8220;Did they reach this in 12 months, or 12 years?&#8221;<\/em> The answer determines if the startup is a rocket ship or a zombie.<\/p>\n\n\n\n<p>At Tracxn, we needed the <strong>Founding Year<\/strong> for millions of companies to calculate this vintage.<\/p>\n\n\n\n<p>The obvious solution was to scrape LinkedIn. But LinkedIn is a fortress\u2014expensive to proxy, guarded by anti-bot walls, and full of user-generated noise (employees guessing start dates).<\/p>\n\n\n\n<p>We realized the best source of truth wasn&#8217;t a third-party aggregator; it was the company&#8217;s own diary. Every startup loves to tell its story on its website. We just needed a way to read it.<\/p>\n\n\n\n<p>Instead of using expensive NLP models to &#8220;read&#8221; history, I built a script using <strong>Pure Python Heuristics<\/strong>. Here is the detailed breakdown of how we architected this system without a single GPU.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Core Philosophy: &#8220;The Source of Truth is the Site&#8221;<\/h2>\n\n\n\n<p>We operated on a simple assumption: <strong>If a company exists, they have an &#8220;About&#8221; page.<\/strong> And on that page, they will inevitably mention when they started.<\/p>\n\n\n\n<p>Our goal wasn&#8217;t to parse the whole internet. It was to find that one specific sentence\u2014<em>&#8220;Established in 2015&#8221;<\/em>\u2014and extract the integer.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 1: The Navigator (Smart Page Discovery)<\/h2>\n\n\n\n<p>Finding the founding year is harder than finding a legal name because it rarely lives on the homepage. You have to dig.<\/p>\n\n\n\n<p>We couldn&#8217;t crawl every page (too slow), so we built a <strong>Priority Queue Navigator<\/strong>.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Extract:<\/strong> Grab all hyperlinks (<code>&lt;a href><\/code>) from the homepage.<\/li>\n\n\n\n<li><strong>Filter &amp; Rank:<\/strong> We scored links based on keywords in the URL or anchor text.\n<ul class=\"wp-block-list\">\n<li><strong>Tier 1 (High Probability):<\/strong> <code>history<\/code>, <code>journey<\/code>, <code>timeline<\/code>, <code>milestones<\/code>, <code>legacy<\/code>.<\/li>\n\n\n\n<li><strong>Tier 2 (Medium Probability):<\/strong> <code>about<\/code>, <code>story<\/code>, <code>who-we-are<\/code>.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Visit:<\/strong> The script would visit the highest-scoring page first. If it found &#8220;Our Journey,&#8221; it ignored &#8220;Contact Us.&#8221;<\/li>\n<\/ol>\n\n\n\n<p>This targeted approach meant we usually found the data in just <strong>one HTTP request<\/strong> after the homepage.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2: The Scanner (The Year Regex)<\/h2>\n\n\n\n<p>Once we landed on the target page (e.g., <code>swiggy.com\/about-us<\/code>), we had to find the year. But searching for 4-digit numbers is dangerous. You\u2019ll catch zip codes, prices ($2000), and support hotlines.<\/p>\n\n\n\n<p>We used a strict <strong>Year Regex<\/strong> pattern:<br><code>r'\\b(19|20)\\d{2}\\b'<\/code><\/p>\n\n\n\n<p>This acted as our first filter:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Accept:<\/strong> 1998, 2015, 2023.<\/li>\n\n\n\n<li><strong>Reject:<\/strong> 5000 (too high), 1000 (too low).<\/li>\n<\/ul>\n\n\n\n<p>We also applied a <strong>Noise Filter<\/strong> to remove common corporate red herrings:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Current Year:<\/strong> If extracted year == <code>datetime.now().year<\/code>, it\u2019s likely a copyright date. <strong>Discard.<\/strong><\/li>\n\n\n\n<li><strong>ISO Standards:<\/strong> &#8220;ISO 9001&#8221; or &#8220;ISO 27001&#8221; are certification numbers, not dates. <strong>Discard.<\/strong><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3: The Context Engine (The &#8220;Sliding Window&#8221;)<\/h2>\n\n\n\n<p>This was the &#8220;No AI&#8221; magic. Finding &#8220;2014&#8221; isn&#8217;t enough. The text could say <em>&#8220;We won an award in 2014&#8221;<\/em> or <em>&#8220;Our founder graduated in 2014.&#8221;<\/em><\/p>\n\n\n\n<p>We needed to link the <strong>Year<\/strong> to the concept of <strong>Creation<\/strong>.<\/p>\n\n\n\n<p>Instead of training a neural network to understand sentence structure, we used a <strong>Sliding Window Algorithm<\/strong>.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Identify Keywords:<\/strong> We compiled a dictionary of &#8220;Birth Words&#8221;: <code>['founded', 'established', 'estd', 'incorporated', 'started', 'launched', 'inception', 'born']<\/code>.<\/li>\n\n\n\n<li><strong>Measure Distance:<\/strong> For every candidate year found, we looked at the 10 words before and after it.<\/li>\n\n\n\n<li><strong>Score:<\/strong>\n<ul class=\"wp-block-list\">\n<li><em>&#8220;<strong>Founded<\/strong> in <strong>2015<\/strong>&#8220;<\/em> -> Distance: 1 word. <strong>Strong Match.<\/strong><\/li>\n\n\n\n<li><em>&#8220;<strong>Established<\/strong> leaders\u2026 won an award in <strong>2015<\/strong>&#8220;<\/em> -> Distance: 6 words + punctuation barrier. <strong>Weak Match.<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>This simple proximity check filtered out 90% of false positives without parsing a single grammar tree.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 4: The &#8220;Min-Year&#8221; Heuristic<\/h2>\n\n\n\n<p>Startup &#8220;About&#8221; pages are often confusing.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>&#8220;Founded in 2010, we launched our app in 2012 and went global in 2015.&#8221;<\/em><\/p>\n<\/blockquote>\n\n\n\n<p>Our script would extract three valid years: 2010, 2012, 2015. Which one is the &#8220;Founding Year&#8221;?<\/p>\n\n\n\n<p>We applied the <strong>Min-Year Heuristic<\/strong>: In the context of a company history page, the <em>earliest<\/em> valid date associated with a &#8220;Birth Word&#8221; is almost always the founding date.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 5: Verification (The &#8220;Why&#8221;)<\/h2>\n\n\n\n<p>Data without proof is useless. If our database said &#8220;Founded in 2011,&#8221; an analyst might doubt it.<\/p>\n\n\n\n<p>Unlike a black-box AI model, our script returned a structured evidence object:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"founding_year\": 2011,\n  \"confidence\": \"High\",\n  \"source_url\": \"https:\/\/company.com\/our-story\",\n  \"context_snippet\": \"...our journey began in 2011 when...\"\n}<\/code><\/pre>\n\n\n\n<p>This allowed human verifiers to click a link and confirm the data in seconds, rather than searching the site from scratch.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>The goal of this project was to move from &#8220;Identity&#8221; (Who are they?) to &#8220;History&#8221; (When did they start?). By prioritizing raw algorithmic logic over expensive external databases like LinkedIn, we built a scalable solution that cost nearly zero to run.<\/p>\n\n\n\n<p>It reinforced a valuable lesson: sometimes you don&#8217;t need a Neural Network. You just need a dictionary, a regex pattern, and a well-tuned <code>while<\/code> loop.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the world of Private Equity, Vintage is everything. An investor looks at a company with $1M in revenue and asks: &#8220;Did they reach this in 12 months, or 12&#8230;<\/p>\n","protected":false},"author":1,"featured_media":157,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,4],"tags":[],"class_list":["post-77","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learn-with-me","category-tech-posts"],"jetpack_featured_media_url":"https:\/\/balamurali.in\/blog\/wp-content\/uploads\/2026\/02\/founding_year_featured.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/77","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/comments?post=77"}],"version-history":[{"count":1,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/77\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/77\/revisions\/79"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media\/157"}],"wp:attachment":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media?parent=77"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/categories?post=77"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/tags?post=77"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}