Building a Rule-Based Engine to Extract Founding Years from HTML – Bala Murali

In the world of Private Equity, Vintage is everything.

An investor looks at a company with $1M in revenue and asks: “Did they reach this in 12 months, or 12 years?” The answer determines if the startup is a rocket ship or a zombie.

At Tracxn, we needed the Founding Year for millions of companies to calculate this vintage.

The obvious solution was to scrape LinkedIn. But LinkedIn is a fortress—expensive to proxy, guarded by anti-bot walls, and full of user-generated noise (employees guessing start dates).

We realized the best source of truth wasn’t a third-party aggregator; it was the company’s own diary. Every startup loves to tell its story on its website. We just needed a way to read it.

Instead of using expensive NLP models to “read” history, I built a script using Pure Python Heuristics. Here is the detailed breakdown of how we architected this system without a single GPU.

The Core Philosophy: “The Source of Truth is the Site”

We operated on a simple assumption: If a company exists, they have an “About” page. And on that page, they will inevitably mention when they started.

Our goal wasn’t to parse the whole internet. It was to find that one specific sentence—“Established in 2015”—and extract the integer.

Step 1: The Navigator (Smart Page Discovery)

Finding the founding year is harder than finding a legal name because it rarely lives on the homepage. You have to dig.

We couldn’t crawl every page (too slow), so we built a Priority Queue Navigator.

Extract: Grab all hyperlinks (<a href>) from the homepage.
Filter & Rank: We scored links based on keywords in the URL or anchor text.
- Tier 1 (High Probability): history, journey, timeline, milestones, legacy.
- Tier 2 (Medium Probability): about, story, who-we-are.
Visit: The script would visit the highest-scoring page first. If it found “Our Journey,” it ignored “Contact Us.”

This targeted approach meant we usually found the data in just one HTTP request after the homepage.

Step 2: The Scanner (The Year Regex)

Once we landed on the target page (e.g., swiggy.com/about-us), we had to find the year. But searching for 4-digit numbers is dangerous. You’ll catch zip codes, prices ($2000), and support hotlines.

We used a strict Year Regex pattern:
r'\b(19|20)\d{2}\b'

This acted as our first filter:

Accept: 1998, 2015, 2023.
Reject: 5000 (too high), 1000 (too low).

We also applied a Noise Filter to remove common corporate red herrings:

Current Year: If extracted year == datetime.now().year, it’s likely a copyright date. Discard.
ISO Standards: “ISO 9001” or “ISO 27001” are certification numbers, not dates. Discard.

Step 3: The Context Engine (The “Sliding Window”)

This was the “No AI” magic. Finding “2014” isn’t enough. The text could say “We won an award in 2014” or “Our founder graduated in 2014.”

We needed to link the Year to the concept of Creation.

Instead of training a neural network to understand sentence structure, we used a Sliding Window Algorithm.

Identify Keywords: We compiled a dictionary of “Birth Words”: ['founded', 'established', 'estd', 'incorporated', 'started', 'launched', 'inception', 'born'].
Measure Distance: For every candidate year found, we looked at the 10 words before and after it.
Score:
- “Founded in 2015“ -> Distance: 1 word. Strong Match.
- “Established leaders… won an award in 2015“ -> Distance: 6 words + punctuation barrier. Weak Match.

This simple proximity check filtered out 90% of false positives without parsing a single grammar tree.

Step 4: The “Min-Year” Heuristic

Startup “About” pages are often confusing.

“Founded in 2010, we launched our app in 2012 and went global in 2015.”

Our script would extract three valid years: 2010, 2012, 2015. Which one is the “Founding Year”?

We applied the Min-Year Heuristic: In the context of a company history page, the earliest valid date associated with a “Birth Word” is almost always the founding date.

Step 5: Verification (The “Why”)

Data without proof is useless. If our database said “Founded in 2011,” an analyst might doubt it.

Unlike a black-box AI model, our script returned a structured evidence object:

{
  "founding_year": 2011,
  "confidence": "High",
  "source_url": "https://company.com/our-story",
  "context_snippet": "...our journey began in 2011 when..."
}

This allowed human verifiers to click a link and confirm the data in seconds, rather than searching the site from scratch.

Summary

The goal of this project was to move from “Identity” (Who are they?) to “History” (When did they start?). By prioritizing raw algorithmic logic over expensive external databases like LinkedIn, we built a scalable solution that cost nearly zero to run.

It reinforced a valuable lesson: sometimes you don’t need a Neural Network. You just need a dictionary, a regex pattern, and a well-tuned while loop.