LangEx PDF
Problem Statement
Anyone who's worked in healthcare knows the pain: critical patient data locked inside PDFs that no two hospitals format the same way. You can throw OCR at it, but OCR doesn't understand context. You can write regex, but regex breaks the moment someone changes a form layout. Neither approach scales.
Hypothesis
LLMs are great at understanding messy documents — until they hallucinate a patient's date of birth. Regex never hallucinates but can't handle layout variation. What if you ran both? LLM first, regex as safety net. And instead of trying to parse every document type under the sun, focus on three and do them really well.
Solution
LangEx PDF is a full-stack web application for extracting structured data from healthcare PDFs using AI-powered semantic extraction with intelligent fallback:
- Multi-provider AI extraction — supports Gemini, OpenAI, and Anthropic as selectable LLM providers. Users bring their own API key at runtime — keys never persisted or logged
- Intelligent regex fallback — if AI fails or is disabled, pattern-based extraction kicks in automatically. Healthcare regex is production-grade; others are best-effort
- AI Insights card — every extraction returns confidence level, text quality score, structure score, language detection, and processing notes
- Type-specific schemas — output maps to healthcare (patient info, diagnoses, medications), contract, or extended care agreement structures
- Drag-and-drop UI — modern responsive interface with real-time extraction progress, color-coded confidence, and JSON/CSV export
Technical Architecture
AI providers selectable at runtime (Gemini / OpenAI / Anthropic). Ephemeral API keys — never stored. SQLite metadata + 24h temp files. CSRF + strict validation.
Key Product Decisions
- 3 document types, not 30 — narrowing scope to healthcare, contracts, and extended care agreements increased extraction accuracy dramatically. General-purpose extractors sacrifice quality for breadth
- AI-first with deterministic fallback — LLMs handle the 80% of documents with layout variation. Regex handles the structured 20% where patterns are reliable. Together, coverage approaches 100%
- Ephemeral API keys — keys live only in browser memory for the duration of the request. Never stored, never logged, never sent to our backend. This makes the tool usable in regulated environments without compliance review
- Confidence scoring as a product feature — surfacing extraction confidence, text quality, and structure scores lets users make informed decisions about trusting the output. Transparency builds trust in AI-powered tools
- WCAG 2.1 AA accessibility — semantic markup, keyboard navigation, and screen reader support. Healthcare tools must be accessible by default, not as an afterthought
Impact & Metrics
Lessons Learned
- LLMs are powerful but not reliable alone. Semantic extraction works brilliantly 85% of the time — and fails unpredictably the other 15%. The regex fallback isn't a crutch; it's what makes the product production-viable. AI + deterministic systems > AI alone
- Scope discipline is a superpower. The temptation was to support every document type. Narrowing to 3 types made extraction quality measurably better and development 3x faster. Ship narrow, expand with data
- Healthcare compliance starts at the architecture level. Ephemeral keys, no document persistence, temp file cleanup, CSRF protection — these aren't features, they're table stakes. Designing for HIPAA from day one is cheaper than retrofitting later