{"id":74,"date":"2024-11-04T17:41:00","date_gmt":"2024-11-04T17:41:00","guid":{"rendered":"https:\/\/balamurali.in\/blog\/?p=74"},"modified":"2026-02-23T14:27:29","modified_gmt":"2026-02-23T14:27:29","slug":"the-unreasonable-effectiveness-of-regex-how-we-mapped-the-indian-startup-ecosystem-without-a-single-gpu","status":"publish","type":"post","link":"https:\/\/balamurali.in\/blog\/tech-posts\/the-unreasonable-effectiveness-of-regex-how-we-mapped-the-indian-startup-ecosystem-without-a-single-gpu\/","title":{"rendered":"The Unreasonable Effectiveness of Regex: How We Mapped the Indian Startup Ecosystem Without a Single GPU"},"content":{"rendered":"\n<p>If you work in Venture Capital or Private Equity, you know <strong>Tracxn<\/strong>. It is often described as the &#8220;Bloomberg for Private Markets.&#8221; The platform tracks millions of startups and private companies globally, providing investors with deep data on funding rounds, valuations, and cap tables.<\/p>\n\n\n\n<p>But there is a fundamental problem in private market data: <strong>The Identity Gap.<\/strong><\/p>\n\n\n\n<p>A startup\u2019s <em>brand<\/em> is almost never its <em>legal identity<\/em>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You know the app as <strong>&#8220;Swiggy.&#8221;<\/strong><\/li>\n\n\n\n<li>The government knows it as <strong>&#8220;Bundl Technologies Private Limited.&#8221;<\/strong><\/li>\n\n\n\n<li>You know the grocery app as <strong>&#8220;Zepto.&#8221;<\/strong><\/li>\n\n\n\n<li>The government knows it as <strong>&#8220;KiranaKart Technologies Private Limited.&#8221;<\/strong><\/li>\n<\/ul>\n\n\n\n<p>This gap is critical. The &#8220;Brand&#8221; lives on the website, but the &#8220;Gold&#8221; (revenue, shareholding, board members) lives in government registries (like the MCA in India). If you cannot bridge the marketing website to the legal entity, you cannot access the financial data.<\/p>\n\n\n\n<p>At Tracxn, we had to bridge this gap for millions of companies. The easy answer would have been to spin up a cluster of GPUs, fine-tune a Named Entity Recognition (NER) model, and burn through thousands of dollars in compute costs.<\/p>\n\n\n\n<p>But we were frugal, and we needed speed. So, instead of Artificial Intelligence, I used <strong>Pure Python Heuristics<\/strong>. I didn&#8217;t build a brain; I built a sniper rifle.<\/p>\n\n\n\n<p>Here is how we architected a massive legal entity extraction engine using nothing but Regex, smart crawling, and &#8220;Backtracking Heuristics.&#8221;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Constraint: Speed and Cost<\/h2>\n\n\n\n<p>We had to process millions of websites. Running a heavy NLP model (like spaCy or BERT) on full HTML content is computationally expensive.<\/p>\n\n\n\n<p>My philosophy was simple: <strong>Don&#8217;t read the text. Read the structure.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 1: The Scout (Heuristic Navigation)<\/h2>\n\n\n\n<p>Many modern websites are minimalist. The homepage often contains nothing but a logo and a &#8220;Download App&#8221; button. If you only scrape the homepage, you fail.<\/p>\n\n\n\n<p>However, crawling the <em>entire<\/em> website (sitemap, blog, products) is too expensive. We needed a middle ground.<\/p>\n\n\n\n<p>We built a <strong>Keyword-Driven Navigator<\/strong>. Upon landing on the homepage, the script didn&#8217;t just look for text; it looked for <em>paths<\/em>.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Extract:<\/strong> Pull all <code>&lt;a><\/code> tags (hyperlinks) from the homepage.<\/li>\n\n\n\n<li><strong>Filter:<\/strong> Check the anchor text against a priority list: <code>['About Us', 'Contact', 'Legal', 'Terms', 'Privacy', 'Imprint']<\/code>.<\/li>\n\n\n\n<li><strong>Navigate:<\/strong> If a match was found, the crawler visited <em>only<\/em> those specific pages.<\/li>\n<\/ol>\n\n\n\n<p>We didn&#8217;t spider the whole site; we simulated a human looking for the &#8220;Fine Print.&#8221; This kept our request count low (usually 2-3 requests per domain) while maximizing the probability of finding the legal name.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2: The Locator (Targeted DOM Traversal)<\/h2>\n\n\n\n<p>Once &#8220;The Scout&#8221; landed on the right page (e.g., <code>swiggy.com\/terms-and-conditions<\/code>), we didn&#8217;t ingest the whole page.<\/p>\n\n\n\n<p>We used <code>BeautifulSoup<\/code> and <code>lxml<\/code> to perform surgical strikes on the DOM. We specifically targeted areas where lawyers force developers to put company names:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>&lt;footer><\/code><\/li>\n\n\n\n<li><code>&lt;div class=\"copyright\"><\/code><\/li>\n\n\n\n<li><code>&lt;small><\/code><\/li>\n\n\n\n<li>Text nodes appearing after the last <code>&lt;hr><\/code><\/li>\n<\/ul>\n\n\n\n<p>This reduced our processing volume by roughly <strong>99%<\/strong>. We weren&#8217;t searching a haystack; we were searching a handful of straw.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3: The Anchor (Suffix Detection)<\/h2>\n\n\n\n<p>How do you find a name if you don&#8217;t understand English? You look for the fingerprint of a corporation.<\/p>\n\n\n\n<p>I compiled a massive dictionary of legal suffixes: <code>['Private Limited', 'Pvt Ltd', 'LLP', 'Inc', 'Limited']<\/code>.<\/p>\n\n\n\n<p>The engine scanned the targeted text for these exact strings. This gave us our <strong>Anchor Index<\/strong>. If we found &#8220;Pvt Ltd&#8221; at character index 150, we knew the company name ended at 150. The question was: <em>where did it start?<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 4: The &#8220;Backtracking Heuristic&#8221;<\/h2>\n\n\n\n<p>This was the core of the &#8220;No AI&#8221; architecture. Since I couldn&#8217;t ask a model to predict the start of the entity, I had to calculate it using a &#8220;Walk Backwards&#8221; algorithm.<\/p>\n\n\n\n<p>The logic was akin to a car backing up until it hits a wall.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Start<\/strong> at the Anchor Index.<\/li>\n\n\n\n<li><strong>Walk left<\/strong>, character by character (or word by word).<\/li>\n\n\n\n<li><strong>Stop<\/strong> when you hit a &#8220;Delimiter Wall.&#8221;<\/li>\n<\/ol>\n\n\n\n<p><strong>The &#8220;Walls&#8221; were strict rules:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Punctuation:<\/strong> A pipe <code>|<\/code>, a comma <code>,<\/code>, or a full stop <code>.<\/code>.<\/li>\n\n\n\n<li><strong>Keywords:<\/strong> &#8220;Copyright&#8221;, &#8220;\u00a9&#8221;, &#8220;All rights reserved&#8221;, or &#8220;Powered by&#8221;.<\/li>\n\n\n\n<li><strong>Dates:<\/strong> A Regex pattern for years (e.g., <code>20\\d{2}<\/code>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example in Action:<\/h3>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Input:<\/strong> <code>\u00a9 2023 Swiggy Private Limited. All rights reserved.<\/code><\/p>\n<\/blockquote>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Locate:<\/strong> &#8220;Private Limited&#8221; (Anchor found).<\/li>\n\n\n\n<li><strong>Backtrack:<\/strong> &#8220;Swiggy&#8221; (Keep going).<\/li>\n\n\n\n<li><strong>Backtrack:<\/strong> &#8220;2023&#8221; (STOP! This matches our Year Regex).<\/li>\n\n\n\n<li><strong>Extract:<\/strong> &#8220;Swiggy Private Limited&#8221;.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Step 5: The &#8220;Frugal Match&#8221; (String Distance)<\/h2>\n\n\n\n<p>Once we had the candidate string (<code>Swiggy Pvt Ltd<\/code>), we had to match it to the massive MCA (Ministry of Corporate Affairs) database of registered companies (<code>Bundl Technologies Pvt Ltd<\/code> or <code>Swiggy India Pvt Ltd<\/code>).<\/p>\n\n\n\n<p>We used <strong>Jaccard Similarity<\/strong> and <strong>Levenshtein Distance<\/strong> (via the <code>fuzzywuzzy<\/code> library) to calculate the &#8220;cost&#8221; of transforming one string into another.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scenario A:<\/strong> <code>Swiggy Pvt Ltd<\/code> vs <code>Swiggy India Pvt Ltd<\/code>. High token overlap. <strong>Match.<\/strong><\/li>\n\n\n\n<li><strong>Scenario B:<\/strong> <code>Swiggy<\/code> vs <code>Bundl Technologies<\/code>. Zero overlap.<\/li>\n<\/ul>\n\n\n\n<p>For Scenario B, heuristics failed. But that was a feature, not a bug. By automating the &#8220;obvious&#8221; 80% with high precision, we freed up human analysts to handle the complex edge cases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why This Matters<\/h2>\n\n\n\n<p>In the era of Generative AI, it is tempting to throw an LLM at every problem. But this project taught me a valuable lesson in engineering economics:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Regex is O(n).<\/strong> It is blazing fast. Neural Networks are orders of magnitude slower.<\/li>\n\n\n\n<li><strong>Explainability.<\/strong> If my regex fails, I know exactly which rule caused it. If a Neural Network fails, it\u2019s a black box.<\/li>\n\n\n\n<li><strong>Cost.<\/strong> This entire architecture ran on standard, low-cost CPU instances.<\/li>\n<\/ol>\n\n\n\n<p>Sometimes, the best engineering isn&#8217;t about using the newest tools. It&#8217;s about using the <em>right<\/em> tools to solve the problem under constraint.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you work in Venture Capital or Private Equity, you know Tracxn. It is often described as the &#8220;Bloomberg for Private Markets.&#8221; The platform tracks millions of startups and private&#8230;<\/p>\n","protected":false},"author":1,"featured_media":156,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,4],"tags":[],"class_list":["post-74","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learn-with-me","category-tech-posts"],"jetpack_featured_media_url":"https:\/\/balamurali.in\/blog\/wp-content\/uploads\/2026\/02\/regex_featured.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/74","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/comments?post=74"}],"version-history":[{"count":1,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/74\/revisions"}],"predecessor-version":[{"id":76,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/74\/revisions\/76"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media\/156"}],"wp:attachment":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media?parent=74"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/categories?post=74"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/tags?post=74"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}