← Back to blog
Build in Public Mar 11, 2026

LangExtract — How I Turned Messy Text Into Clean JSON for Free

Matteo Lombardi
Mar 11, 2026

One of my Telegram bots found this tool for me.

I have 7 bots running on my Mac Mini via NanoClaw. Riko is the research one — I send her links (changelogs, blog posts, releases) and she tells me if something matters for our stack.

A few weeks ago I sent her an article about structured extraction. She came back with: “This fits your RSS pipeline. Replace the raw prompts with example-based schemas. Consistent output, zero cost.”

She was right. That tool was LangExtract.

The problem it solves

I run 11 n8n workflows. Many of them process text that isn’t structured: RSS articles, competitor pages, invoices, prospect websites. Before LangExtract, each workflow had a Gemini or Claude node with a long prompt: “Extract the following fields… return JSON… don’t add extra keys…”

It worked 80% of the time. The other 20%: wrong field names, hallucinated data, inconsistent formats. Every edge case meant a longer prompt. Longer prompts meant more edge cases. Classic loop.

LangExtract flips the approach. Instead of describing what you want in prose, you show it 2-3 examples of input → output. It learns the pattern and applies it consistently.

My setup

FastAPI server on Mac Mini, localhost:8765. Starts at boot via launchd. One endpoint:

POST /extract
{ "text": "...", "schema_name": "invoice" }

7 schemas loaded at startup, each a JSON file with 2-3 real examples:

SchemaWhat it doesWhere it’s used
article-enrichmentRSS articles → topic, insight, action itemn8n RSS pipelines (ITA + BCN)
prospect-auditCompany website → ICP fit, team size, funding, pain signalsprospect-audit.py CLI → Crono
invoiceInvoice text → supplier, VAT, amount, categoryparse-invoice.py CLI
competitor-profileCompetitor page → pricing, features, positioningCI Agent n8n pipeline
rss-fundingFunding news → company, round, amount, investorsABM Signal Monitor
linkedin-post-analysisPost text → hook type, CTA style, engagement patternContent strategy
content-repurposeBlog/newsletter → key quotes, stats, social snippetsContent Repurposing Pipeline

Model: Gemma 3 4B running locally via Ollama. No API key, no rate limits, no cost. Gemini 2.0 Flash as fallback if the local model is down.

Real examples from my schemas

RSS article → structured insight

I feed it an article like “HubSpot just launched Breeze, their new AI layer that sits across the entire CRM. Breeze Copilot for in-app AI, Breeze Agents for automation, Breeze Intelligence for enrichment…”

It returns:

{
  "topic": "CRM AI Integration",
  "category": "product_launch",
  "key_insight": "HubSpot embedding AI across entire CRM stack — copilot, agents, and data enrichment bundled together",
  "relevance_score": "high",
  "action_item": "Analyze Breeze feature gaps vs our custom stack — potential content angle"
}

This goes straight into a Google Sheet. The n8n RSS pipeline processes articles from 9 subreddits and 20+ RSS feeds — all structured, all filterable.

Prospect website → ICP fit score

Before a sales call, I run python3 prospect-audit.py https://factorial.co. It scrapes the homepage + /about page, sends the text to LangExtract, and returns:

{
  "company_stage": "Series A",
  "team_size": "25",
  "location": "Barcelona",
  "funding": "€5M from Nauta Capital",
  "icp_fit": "strong",
  "pain_signals": "Expanding from SMB to enterprise — needs growth infrastructure",
  "outreach_angle": "Enterprise expansion playbook + positioning for upmarket move"
}

Same script also catches weak fits — a local web agency in Milan with 8 people and WordPress gets icp_fit: "weak" and outreach_angle: "Not ICP — skip". Saves me the call.

Spanish invoice → accounting data

I’m an autonomo in Spain. Every quarter: categorize expenses, map VAT rates, send to accountant. Invoices from Netlify, Qonto, Google, OpenAI — all different formats, some in English, some in Spanish.

LangExtract parses them all. A Qonto invoice:

{
  "supplier": "Qonto S.A.S.",
  "invoice_number": "INV-2026-02-8847",
  "amount_total": "10.89",
  "currency": "EUR",
  "tax_rate": "21%",
  "category": "Software & SaaS"
}

The parse-invoice.py script adds a VAT mapping layer on top — it knows Google and LinkedIn are intracomunitario (casilla 10/11), while Netlify and Anthropic are extracomunitario. My accountant gets a clean spreadsheet instead of a pile of PDFs.

Why examples beat prompts

After running both approaches across hundreds of extractions:

Prompts give you ~80% consistency. You spend the other 20% patching edge cases with more instructions, which create new edge cases.

Examples give you ~95%+ consistency. When you hit an edge case, you add one example. The old ones still work. No prompt drift.

The schema files are version-controlled. I can see exactly what each extraction is supposed to return. When something breaks, I compare the output to the example — the debugging surface is tiny.

Three access patterns

Everything calls the same server:

  1. n8n workflows → HTTP Request node to localhost:8765/extract. RSS enrichment, CI Agent, content repurposing — all hit the same endpoint
  2. Python scriptsprospect-audit.py, parse-invoice.py. CLI tools for batch operations
  3. Claude Code via MCP → ad-hoc extraction when I need something quick without building a schema

One server, 7 schemas, three ways to call it. Total cost: $0.


I write about the systems I build every week in Inference. If you’re building with AI — not just prompting — this is for you.