Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

URL to Markdown for SEO Competitive Analysis

Every SEO content brief starts the same way: pull the top 10 SERPs for your target keyword, read what's already ranking, identify what's missing, write something better. The bottleneck has never been the analysis — it's the ingestion. Pasting 10 URLs into ChatGPT and asking "what content gaps do you see?" gets you a polite, generic answer because the model is choking on cookie banners, ad slots, related-posts widgets, and footer link soup. The same prompt against 10 clean Markdown files of the actual page content gets you a usable content brief in one shot. URL-to-Markdown is the missing step in every modern content workflow.

Why HTML scraping fails the modern SEO workflow

The promise of AI-assisted content production is simple: feed competitor pages to an LLM, get a content brief that identifies gaps, headings to cover, entities to include, questions to answer, and an outline that beats the SERP. In practice, the promise breaks at step one because raw HTML is hostile to LLMs in three ways.

Token waste. A typical 2,000-word article on a modern WordPress site is wrapped in 15,000+ tokens of navigation, sidebar widgets, related-posts modules, newsletter popups, footer link farms, schema markup, and inline CSS. You're paying API costs to send the model your competitor's cookie banner.

Signal dilution. Even if cost weren't an issue, the model's attention gets distributed across noise. Asking "what's the H2 structure?" returns confused answers because the model can't easily distinguish article H2s from sidebar H2s from footer H2s in raw HTML.

Brittle parsing. Every site has different DOM structure. A scraping pipeline that handles WordPress correctly breaks on Webflow, Shopify, and headless CMS sites. Maintaining a scraper-per-site is engineering work no SEO team should be doing.

URL-to-Markdown solves all three at once. The converter does the readability extraction, drops the chrome, and gives you the article body in clean Markdown — the exact substrate every LLM is trained on.

The content brief workflow, end to end

Suppose your target keyword is "best CRM for small business" and you want a content brief for a piece you'll commission this week. Here's the workflow that takes about 20 minutes and produces a brief good enough to hand to a freelancer.

Step 1: Pull the SERP

From your SERP tool of choice (Ahrefs, Semrush, SE Ranking, or a manual Google search) extract the top 10 ranking URLs. For pages that show featured snippets, People Also Ask boxes, or AI Overviews, capture those URLs too — they're often outside the top 10 organic but inside the top 10 in user attention.

Step 2: Convert each URL to Markdown

Drop the 10 URLs into our URL-to-Markdown converter (batch mode, one per line). Output is 10 clean .md files, each containing only the article body, headings preserved, with frontmatter capturing title and source URL. Total runtime: under 30 seconds.

Step 3: Build the entity and heading map

Open all 10 Markdown files and ask Claude or GPT-5 the following:

I'm writing a content brief for the keyword "best CRM for small business".
Attached are the top 10 ranking pages, converted to Markdown.

1. Extract every H2 and H3 across all 10 files. Group semantically.
2. List every CRM product mentioned more than once, with frequency.
3. List every feature dimension compared (pricing, integrations, ease of use, etc.).
4. Identify questions answered (look for question-style H2/H3 and FAQ blocks).
5. Note any topical sub-clusters competitors cover that look like content-gap opportunities.

The output is a content brief skeleton: the heading structure that the SERP collectively suggests, the entities you must mention to be topically complete, the comparison dimensions readers expect, the questions you must answer to win PAA boxes.

Step 4: Identify content gaps

Now the gap analysis. Ask the model:

Looking at all 10 articles together, what is consistently missing or underdeveloped?
Look for: questions raised but not answered, product categories mentioned but not compared,
use cases referenced but not detailed, objections users would have but no article addresses.

This is where Markdown earns its keep. The model can actually see the full content of all 10 pages simultaneously. It can reason across them. The answer is no longer "some pages talk about pricing and some talk about features" — it's "none of the top 10 pages address how to migrate from a free CRM to a paid one, which is a high-intent secondary query in your keyword cluster." That's a brief insight you can actually act on.

Step 5: Write the brief

Combine the heading map, entity list, gap analysis, and PAA questions into a one-page brief. A typical structure:

20 minutes from SERP pull to finished brief. The freelancer gets a brief that's actually based on the SERP rather than on vibes.

Beyond briefs: ongoing competitive monitoring

The same pipeline scales to recurring competitive monitoring. Pick your top 20 commercial keywords, the top 5 ranking URLs each, and re-convert weekly. Diff the new Markdown against last week's. Any time a competitor materially updates a page, you see it as a clean Markdown diff — not a noisy HTML diff.

This catches a few moves your competitors make that are otherwise invisible:

For accounts in competitive verticals (finance, SaaS, legal, health) this monitoring layer is a meaningful information advantage.

Feeding the corpus to your AI writing tool

If you're using one of the AI writing platforms (Frase, Surfer, Clearscope, MarketMuse, Letterdrop), they all have an "upload competitor content" or "analyze SERP" feature. Most of them work better when you upload pre-converted Markdown than when you let them scrape — same reasoning as above. The output of URL-to-Markdown is exactly what these tools want, just without the noise.

For custom AI writing pipelines (you're running your own GPT-5 or Claude prompt chain), the Markdown corpus becomes the source-of-truth context. Standard pattern: top 10 SERPs as Markdown + your brand voice doc as Markdown + your existing content inventory as Markdown → outline → draft → optimize. Every step in the chain works on Markdown. No HTML touches the LLM.

Concrete example: building a topical authority cluster

You're building a topical cluster around "vector databases". The pillar page is /vector-databases-guide; you need 12 supporting articles. Workflow:

  1. For each of 12 sub-keywords (e.g. "vector database vs traditional database", "chroma vs pinecone", "how vector embeddings work"), pull the top 5 SERPs.
  2. Run all 60 URLs through batch URL-to-Markdown. ~3 minutes.
  3. For each sub-keyword, run the brief workflow above. Output: 12 briefs.
  4. Cross-analyze all 60 Markdown files at once: "What internal linking patterns do the top performers use? Which entities appear across multiple sub-topics? What's the consistent pillar→cluster relationship?" This informs your own internal linking strategy.
  5. Stack rank the 12 briefs by SERP difficulty and commercial intent; sequence the production calendar.

That's a coherent 13-page content cluster planned in an afternoon, with every brief grounded in actual SERP data rather than assumptions.

Internal linking analysis at the cluster level

One of the more underrated uses of a clean Markdown competitor corpus is internal-link extraction. Every [anchor](url) in the converted output is a competitor's editorial decision about what topics belong adjacent to this query. Aggregate the anchor text across the top 10 SERPs for a single keyword and you get a high-signal map of the topical neighborhood that already ranks. A 20-line script over the Markdown corpus:

import re
from collections import Counter
from pathlib import Path

link_re = re.compile(r'\[([^\]]+)\]\(([^)]+)\)')
anchors = Counter()
for md in Path('serp_corpus/best-crm-small-business').glob('*.md'):
    for anchor, url in link_re.findall(md.read_text(encoding='utf-8')):
        if 'yourcompetitor.com' in url:
            continue  # focus on cross-domain links
        anchors[anchor.strip().lower()] += 1
for anchor, n in anchors.most_common(50):
    print(f'{n:3d}  {anchor}')

The anchors with highest frequency across multiple competitors are the entities and sub-topics the SERP collectively believes are the natural internal-link neighborhood for this query. Use them to plan the cluster pages your pillar should link to next.

SERP volatility tracking

For your highest-priority commercial keywords, the SERP itself changes — sometimes weekly. A page that was #2 last month is now #7; the new #1 is a 6-month-old article that Google suddenly decided to elevate. The standard rank-tracker tells you the position changed; what it doesn't tell you is what's different about the winner. Re-run the brief workflow whenever a SERP shifts meaningfully. The Markdown corpus from each timepoint becomes a longitudinal record of how the SERP's content profile is evolving — which entities are gaining mentions, which content formats Google is rewarding, which schema types appear among winners. This is the data your annual content strategy should be built on, and the cost of capturing it is essentially zero once URL-to-Markdown is wired into your monitoring loop.

Why this matters more in the AI Overview era

Google's AI Overviews are increasingly synthesizing answers from the top SERPs. If your page isn't in that synthesis, you lose the click even if you rank #4. The way to be cited in AI Overviews is to be the page with the most semantically complete, well-structured answer to the query — which is exactly what the brief workflow above produces. Markdown is the format both your content team and Google's models think in. Working from clean Markdown competitor analysis aligns your content production with how the SERP is actually being read in 2026.

Frequently asked questions

Does this work for JavaScript-heavy pages where curl returns nothing useful?
Yes. Our converter renders pages in a headless browser before extraction, so single-page apps (Next.js, Nuxt, SvelteKit, anything client-side rendered) come through with the post-hydration content. Sites that gate content behind auth or aggressive bot detection are exceptions — for those, use the browser extension which works against your authenticated session.
How do I avoid rate limits when converting 50+ competitor URLs?
Batch mode handles concurrency for you with built-in pacing — typical batches of 50 URLs complete in 2-4 minutes. For larger ongoing monitoring jobs (hundreds of URLs daily), use the API with explicit rate-limit headers; the docs cover the recommended request cadence per plan tier.
Can I diff a competitor page over time to detect content updates?
Yes — that's the standard monitoring pattern. Convert weekly, check the Markdown into Git, and let your normal git diff workflow surface changes. The signal-to-noise on Markdown diffs is dramatically better than on HTML diffs because layout changes, ad-rotation changes, and analytics changes don't show up — only actual content changes.