LangEx PDF | Arush Sharma

Problem Statement

Anyone who's worked in healthcare knows the pain: critical patient data locked inside PDFs that no two hospitals format the same way. You can throw OCR at it, but OCR doesn't understand context. You can write regex, but regex breaks the moment someone changes a form layout. Neither approach scales.

Hypothesis

LLMs are great at understanding messy documents — until they hallucinate a patient's date of birth. Regex never hallucinates but can't handle layout variation. What if you ran both? LLM first, regex as safety net. And instead of trying to parse every document type under the sun, focus on three and do them really well.

Solution

LangEx PDF is a full-stack web application for extracting structured data from healthcare PDFs using AI-powered semantic extraction with intelligent fallback:

Multi-provider AI extraction — supports Gemini, OpenAI, and Anthropic as selectable LLM providers. Users bring their own API key at runtime — keys never persisted or logged
Intelligent regex fallback — if AI fails or is disabled, pattern-based extraction kicks in automatically. Healthcare regex is production-grade; others are best-effort
AI Insights card — every extraction returns confidence level, text quality score, structure score, language detection, and processing notes
Type-specific schemas — output maps to healthcare (patient info, diagnoses, medications), contract, or extended care agreement structures
Drag-and-drop UI — modern responsive interface with real-time extraction progress, color-coded confidence, and JSON/CSV export

Technical Architecture

AI providers selectable at runtime (Gemini / OpenAI / Anthropic). Ephemeral API keys — never stored. SQLite metadata + 24h temp files. CSRF + strict validation.

Key Product Decisions

3 document types, not 30 — narrowing scope to healthcare, contracts, and extended care agreements increased extraction accuracy dramatically. General-purpose extractors sacrifice quality for breadth
AI-first with deterministic fallback — LLMs handle the 80% of documents with layout variation. Regex handles the structured 20% where patterns are reliable. Together, coverage approaches 100%
Ephemeral API keys — keys live only in browser memory for the duration of the request. Never stored, never logged, never sent to our backend. This makes the tool usable in regulated environments without compliance review
Confidence scoring as a product feature — surfacing extraction confidence, text quality, and structure scores lets users make informed decisions about trusting the output. Transparency builds trust in AI-powered tools
WCAG 2.1 AA accessibility — semantic markup, keyboard navigation, and screen reader support. Healthcare tools must be accessible by default, not as an afterthought

Impact & Metrics

3 LLMs

Gemini, OpenAI, Anthropic — selectable at runtime

AI + Regex

Dual extraction pipeline with automatic fallback

Open Source

View on GitHub ↗

Lessons Learned

LLMs are powerful but not reliable alone. Semantic extraction works brilliantly 85% of the time — and fails unpredictably the other 15%. The regex fallback isn't a crutch; it's what makes the product production-viable. AI + deterministic systems > AI alone
Scope discipline is a superpower. The temptation was to support every document type. Narrowing to 3 types made extraction quality measurably better and development 3x faster. Ship narrow, expand with data
Healthcare compliance starts at the architecture level. Ephemeral keys, no document persistence, temp file cleanup, CSRF protection — these aren't features, they're table stakes. Designing for HIPAA from day one is cheaper than retrofitting later

Links

GitHub Repository ↗

← All Products Desktop Authenticator →