From Cloudflare's Markdown for Agents to a Universal HTML→Markdown Extractor

(github.com)

1 points | by frumu 8 hours ago ago

1 comments

$frumu 8 hours ago

I ran into Cloudflare’s Markdown for Agents and thought it was exactly what I needed for LLM web research. Then I realized it only helps when a site is on Cloudflare and has it enabled, so it doesn’t solve “open web” extraction.
I built a simple HTML→Markdown pipeline in Rust that works on any public URL (strip scripts/styles/boilerplate, preserve structure + links). On a 100-URL set it reduced input size by ~70–80% (often close to 80%).
Benchmark on the same 100 URLs:
Rust server mode: p50 ~0.4s, p95 ~1.3s, memory ~100MB stable
Node baseline (JSDOM + Turndown): p50 ~1.2s, p95 ~50s, memory grew into hundreds of MB to GBs
Scripts + methodology are in the repo: <link>
Curious what others use for boilerplate removal and how you keep p95 tails under control when parsing nasty pages.