What gets extracted from a typical product page
Title: the product name as # H1. Description: the long-form description as the body content, with sub-headings preserved where the merchant used them. Specs table: the technical specifications block converted to a GFM table (or a definition list if the source was a list of key-value pairs). Feature list: bullet-pointed feature highlights as a Markdown list. Price: kept as a labelled line near the top (**Price:** $199) — useful for snapshotting at a point in time, since prices change. Stripped: reviews, ratings widgets, related-product carousels, image carousels (image URLs preserved as Markdown image references), footer link soup, cookie banners, social-share buttons.
Three high-leverage use cases
Competitive analysis: convert a dozen competitor product pages, feed to an LLM, ask "compare these along the dimensions X, Y, Z; flag where each one wins or loses". Stripped of UI noise, the LLM compares actual products instead of getting confused by template variation. Internal product comparisons: for purchase decisions (vendor selection, hardware purchases, software comparisons), converted spec tables paste cleanly into a comparison spreadsheet. RAG over a catalogue: index your own catalogue's product pages so an internal AI assistant can answer "do we sell anything with a USB-C port and 16GB of RAM under $1500" from the actual page content. If the manufacturer ships data sheets as PDFs, also try PDF to Markdown.