Why HTML noise destroys semantic quality
Modern embedding models (OpenAI text-embedding-3-large, Cohere embed-v4, Voyage voyage-3, the open-source GTE/BGE family) are trained to encode the meaning of whatever text you give them. They have no notion that "Skip to main content" or "© 2026 Acme Inc" is boilerplate. Every token gets averaged into the final vector. A web page where 70% of the visible text is chrome produces an embedding that is 70% chrome. Across a corpus, that means similarity searches surface pages that share the same sidebar template instead of pages that share the same topic.
Sentence-level vs paragraph-level chunking, applied correctly
Once the input is clean Markdown, chunking finally works the way the literature says it should. Paragraph-level chunks (split on blank lines or H3 boundaries, target 200–400 tokens) cluster on themes and work well for retrieval-augmented generation. Sentence-level chunks (target 30–80 tokens) work better for fine-grained similarity matching — finding the exact claim that supports a query. Both strategies break down on raw HTML because the "paragraph" boundaries aren't paragraphs at all; they're div boundaries with no semantic meaning.
The end-to-end pipeline
Workflow for one-off embedding work: open URL to Markdown, paste the URL, click Convert, download the .md file. Run it through your chunker (MarkdownHeaderTextSplitter for paragraph-level, a sentence-aware splitter like nltk or spaCy for sentence-level), embed with your model of choice, upsert into the vector DB. For batch automation, roll your own with Trafilatura plus the same chunkers — MDisBetter ships a web tool, not a programmatic API. For PDF embeddings see PDF to Markdown for RAG; same principle, different source format.