▶ Classic

LangEx PDF

Personal Product May 2025
LLMs Python PHP Healthcare PDF Extraction RAG Gemini OpenAI Anthropic

Problem Statement

Anyone who's worked in healthcare knows the pain: critical patient data locked inside PDFs that no two hospitals format the same way. You can throw OCR at it, but OCR doesn't understand context. You can write regex, but regex breaks the moment someone changes a form layout. Neither approach scales.

Hypothesis

LLMs are great at understanding messy documents — until they hallucinate a patient's date of birth. Regex never hallucinates but can't handle layout variation. What if you ran both? LLM first, regex as safety net. And instead of trying to parse every document type under the sun, focus on three and do them really well.

Solution

LangEx PDF is a full-stack web application for extracting structured data from healthcare PDFs using AI-powered semantic extraction with intelligent fallback:

  • Multi-provider AI extraction — supports Gemini, OpenAI, and Anthropic as selectable LLM providers. Users bring their own API key at runtime — keys never persisted or logged
  • Intelligent regex fallback — if AI fails or is disabled, pattern-based extraction kicks in automatically. Healthcare regex is production-grade; others are best-effort
  • AI Insights card — every extraction returns confidence level, text quality score, structure score, language detection, and processing notes
  • Type-specific schemas — output maps to healthcare (patient info, diagnoses, medications), contract, or extended care agreement structures
  • Drag-and-drop UI — modern responsive interface with real-time extraction progress, color-coded confidence, and JSON/CSV export

Technical Architecture

AI providers selectable at runtime (Gemini / OpenAI / Anthropic). Ephemeral API keys — never stored. SQLite metadata + 24h temp files. CSRF + strict validation.

Key Product Decisions

  • 3 document types, not 30 — narrowing scope to healthcare, contracts, and extended care agreements increased extraction accuracy dramatically. General-purpose extractors sacrifice quality for breadth
  • AI-first with deterministic fallback — LLMs handle the 80% of documents with layout variation. Regex handles the structured 20% where patterns are reliable. Together, coverage approaches 100%
  • Ephemeral API keys — keys live only in browser memory for the duration of the request. Never stored, never logged, never sent to our backend. This makes the tool usable in regulated environments without compliance review
  • Confidence scoring as a product feature — surfacing extraction confidence, text quality, and structure scores lets users make informed decisions about trusting the output. Transparency builds trust in AI-powered tools
  • WCAG 2.1 AA accessibility — semantic markup, keyboard navigation, and screen reader support. Healthcare tools must be accessible by default, not as an afterthought

Impact & Metrics

3 LLMs
Gemini, OpenAI, Anthropic — selectable at runtime
AI + Regex
Dual extraction pipeline with automatic fallback
Open Source

Lessons Learned

  • LLMs are powerful but not reliable alone. Semantic extraction works brilliantly 85% of the time — and fails unpredictably the other 15%. The regex fallback isn't a crutch; it's what makes the product production-viable. AI + deterministic systems > AI alone
  • Scope discipline is a superpower. The temptation was to support every document type. Narrowing to 3 types made extraction quality measurably better and development 3x faster. Ship narrow, expand with data
  • Healthcare compliance starts at the architecture level. Ephemeral keys, no document persistence, temp file cleanup, CSRF protection — these aren't features, they're table stakes. Designing for HIPAA from day one is cheaper than retrofitting later