Parsr

Most enterprise document pipelines in 2018 were brittle: proprietary OCR vendors, bespoke parsing scripts per document type, and no shared tooling across a 160,000-person organisation. Parsr was conceived as AXA Group Operations' first open-source product — a modular pipeline for cleaning, parsing, and structuring input documents (PDF, image, docx) into machine-readable formats for downstream AI use cases.

Ingests PDFs, images, and Office documents and outputs structured JSON, Markdown, CSV, or plain text with document hierarchy preserved
Modular pipeline architecture: each processing stage (OCR, table detection, hierarchy reconstruction, output formatting) is an independently configurable module
Deployed across 10 AXA entities in 9 countries including Malaysia, Belgium, UK, France, Hong Kong, and Brazil
Grew to 2,000+ GitHub stars with external collaborators including ABBYY and the University of New South Wales
Adopted as a foundation for downstream AI use cases across the AXA ecosystem including claims processing, contract analysis, and compliance document review

Modular Node.js pipeline where each stage emits a typed document model consumed by the next. OCR is handled via pluggable backends (Tesseract, ABBYY, AWS Textract) with a normalised output contract, allowing backend substitution without pipeline changes.

Hierarchy reconstruction uses a rule-based system trained on AXA document corpora: font size, whitespace, and positional clustering determine heading levels, list membership, and table boundaries. Output is a typed AST that downstream modules query rather than raw text.