The Unreasonable Effectiveness of Regex: How We Mapped the Indian Startup Ecosystem Without a Single GPU

If you work in Venture Capital or Private Equity, you know Tracxn. It is often described as the “Bloomberg for Private Markets.” The platform tracks millions of startups and private companies globally, providing investors with deep data on funding rounds, valuations, and cap tables.

But there is a fundamental problem in private market data: The Identity Gap.

A startup’s brand is almost never its legal identity.

  • You know the app as “Swiggy.”
  • The government knows it as “Bundl Technologies Private Limited.”
  • You know the grocery app as “Zepto.”
  • The government knows it as “KiranaKart Technologies Private Limited.”

This gap is critical. The “Brand” lives on the website, but the “Gold” (revenue, shareholding, board members) lives in government registries (like the MCA in India). If you cannot bridge the marketing website to the legal entity, you cannot access the financial data.

At Tracxn, we had to bridge this gap for millions of companies. The easy answer would have been to spin up a cluster of GPUs, fine-tune a Named Entity Recognition (NER) model, and burn through thousands of dollars in compute costs.

But we were frugal, and we needed speed. So, instead of Artificial Intelligence, I used Pure Python Heuristics. I didn’t build a brain; I built a sniper rifle.

Here is how we architected a massive legal entity extraction engine using nothing but Regex, smart crawling, and “Backtracking Heuristics.”

The Constraint: Speed and Cost

We had to process millions of websites. Running a heavy NLP model (like spaCy or BERT) on full HTML content is computationally expensive.

My philosophy was simple: Don’t read the text. Read the structure.

Step 1: The Scout (Heuristic Navigation)

Many modern websites are minimalist. The homepage often contains nothing but a logo and a “Download App” button. If you only scrape the homepage, you fail.

However, crawling the entire website (sitemap, blog, products) is too expensive. We needed a middle ground.

We built a Keyword-Driven Navigator. Upon landing on the homepage, the script didn’t just look for text; it looked for paths.

  1. Extract: Pull all <a> tags (hyperlinks) from the homepage.
  2. Filter: Check the anchor text against a priority list: ['About Us', 'Contact', 'Legal', 'Terms', 'Privacy', 'Imprint'].
  3. Navigate: If a match was found, the crawler visited only those specific pages.

We didn’t spider the whole site; we simulated a human looking for the “Fine Print.” This kept our request count low (usually 2-3 requests per domain) while maximizing the probability of finding the legal name.

Step 2: The Locator (Targeted DOM Traversal)

Once “The Scout” landed on the right page (e.g., swiggy.com/terms-and-conditions), we didn’t ingest the whole page.

We used BeautifulSoup and lxml to perform surgical strikes on the DOM. We specifically targeted areas where lawyers force developers to put company names:

  • <footer>
  • <div class="copyright">
  • <small>
  • Text nodes appearing after the last <hr>

This reduced our processing volume by roughly 99%. We weren’t searching a haystack; we were searching a handful of straw.

Step 3: The Anchor (Suffix Detection)

How do you find a name if you don’t understand English? You look for the fingerprint of a corporation.

I compiled a massive dictionary of legal suffixes: ['Private Limited', 'Pvt Ltd', 'LLP', 'Inc', 'Limited'].

The engine scanned the targeted text for these exact strings. This gave us our Anchor Index. If we found “Pvt Ltd” at character index 150, we knew the company name ended at 150. The question was: where did it start?

Step 4: The “Backtracking Heuristic”

This was the core of the “No AI” architecture. Since I couldn’t ask a model to predict the start of the entity, I had to calculate it using a “Walk Backwards” algorithm.

The logic was akin to a car backing up until it hits a wall.

  1. Start at the Anchor Index.
  2. Walk left, character by character (or word by word).
  3. Stop when you hit a “Delimiter Wall.”

The “Walls” were strict rules:

  • Punctuation: A pipe |, a comma ,, or a full stop ..
  • Keywords: “Copyright”, “©”, “All rights reserved”, or “Powered by”.
  • Dates: A Regex pattern for years (e.g., 20\d{2}).

Example in Action:

Input: © 2023 Swiggy Private Limited. All rights reserved.

  1. Locate: “Private Limited” (Anchor found).
  2. Backtrack: “Swiggy” (Keep going).
  3. Backtrack: “2023” (STOP! This matches our Year Regex).
  4. Extract: “Swiggy Private Limited”.

Step 5: The “Frugal Match” (String Distance)

Once we had the candidate string (Swiggy Pvt Ltd), we had to match it to the massive MCA (Ministry of Corporate Affairs) database of registered companies (Bundl Technologies Pvt Ltd or Swiggy India Pvt Ltd).

We used Jaccard Similarity and Levenshtein Distance (via the fuzzywuzzy library) to calculate the “cost” of transforming one string into another.

  • Scenario A: Swiggy Pvt Ltd vs Swiggy India Pvt Ltd. High token overlap. Match.
  • Scenario B: Swiggy vs Bundl Technologies. Zero overlap.

For Scenario B, heuristics failed. But that was a feature, not a bug. By automating the “obvious” 80% with high precision, we freed up human analysts to handle the complex edge cases.

Why This Matters

In the era of Generative AI, it is tempting to throw an LLM at every problem. But this project taught me a valuable lesson in engineering economics:

  1. Regex is O(n). It is blazing fast. Neural Networks are orders of magnitude slower.
  2. Explainability. If my regex fails, I know exactly which rule caused it. If a Neural Network fails, it’s a black box.
  3. Cost. This entire architecture ran on standard, low-cost CPU instances.

Sometimes, the best engineering isn’t about using the newest tools. It’s about using the right tools to solve the problem under constraint.

Leave a Comment

Your email address will not be published. Required fields are marked *