The web-scraping-into-RAG failure pattern
The naive pipeline is: requests.get(url), BeautifulSoup, strip a few tags, chunk by character count, embed. This works for the first 10 sites you try and falls apart on the 11th, which uses a different DOM convention. The chunks end up stuffed with "Skip to main content", "Subscribe to our newsletter", "© 2026 Some Company", and section headers from the unrelated sidebar. Every embedding gets pulled toward boilerplate.
Convert each URL to Markdown first — through MDisBetter's web tool for one-offs, or through a self-rolled OSS pipeline (Trafilatura, html2text, Readability.py) for automation. Either way, you skip the per-site DOM-wrangling and end up with clean prose plus real headings.
Two paths: web tool or OSS
One-off ingestion (a handful of URLs at a time): paste each URL into /convert/url-to-markdown, click Convert, save the .md file, run it through your chunker. We don't currently expose a programmatic API — for batch automation you'll want to roll your own with the OSS tools below.
Recommended pipeline
URL list → extract main content (Trafilatura is the best-in-class OSS extractor) → convert to Markdown (html2text or markdownify) → chunk on H2/H3 headings (header-aware splitter) → embed → store in vector DB with the source URL and heading path as metadata. For PDF sources, see PDF to Markdown for RAG.