What WebBaseLoader actually does (and why it disappoints)
Under the hood, WebBaseLoader does requests.get(), runs BeautifulSoup, and returns the page text. No JavaScript execution, no readability heuristics, no boilerplate stripping beyond what you configure manually with bs_kwargs. The output is unfiltered DOM text — usable, but you spend the rest of your pipeline cleaning it up.
The alternative is a pre-processing step: extract main content with a real readability library (Trafilatura, Readability.py, jusText), convert to Markdown (html2text, markdownify), persist the .md, and use TextLoader from then on. Your loader becomes deterministic, your output is human-inspectable, and your splitter can be MarkdownHeaderTextSplitter (which respects real document structure). For one-off URLs that don't justify a custom pipeline, paste them into mdisbetter.com/convert/url-to-markdown and feed the downloaded .md to TextLoader.
Pair with MarkdownHeaderTextSplitter
The chunker is where the win compounds. MarkdownHeaderTextSplitter chunks on real headings — your chunks correspond to article sections, the heading path lives in metadata, and your synthesis prompts get free structural context.