Convert GitHub Documentation to Local Markdown Files
GitHub documentation lives in three different shapes — raw README files (already Markdown), rendered docs sites (HTML, often built from MkDocs/Docusaurus/Sphinx), and the GitHub Wiki tab (also HTML, but a separate repo under the hood). If you're an AI engineer who wants to feed a project's full docs to Claude or build a local knowledge base, you need clean Markdown for all three. This guide walks through every shape end-to-end, using the MDisBetter web tool for one-off pages and pointing you at OSS for batch automation.
Why bother converting at all?
You can copy/paste from a GitHub-rendered page, but you'll get the file path breadcrumb, the "Edit on GitHub" button text, and any sidebar navigation mixed inline. For a single page that's annoying; for a 200-page docs site that's poison for a RAG pipeline. The cleaner pattern is to convert each page through a tool that strips the chrome and emits semantic Markdown, then save the .md file into a local folder structure that mirrors the project.
The same workflow gets you: an offline reading copy of a project's docs (great for plane rides), a Markdown corpus you can drop into Obsidian as a vault, and an AI-ingestion-ready dataset for ChatGPT or Claude.
Three shapes of GitHub documentation
1. Plain README files (already Markdown)
Example: https://github.com/octocat/Hello-World. The README is rendered from a README.md file at the repo root. The honest path here is to grab the raw Markdown directly — no conversion needed. Click the README, click "Raw," copy the URL (it'll start with raw.githubusercontent.com), and either save it directly or paste it into your editor.
If you want it converted anyway (e.g., to strip GitHub-specific extensions like collapsible <details> blocks, or to normalize whitespace for embedding), paste the rendered URL into /convert/url-to-markdown and you'll get a normalized version back.
2. Rendered documentation sites
Example: https://docs.github.com/en/rest, https://docs.python.org, https://nextjs.org/docs. These are built by static site generators (Docusaurus, MkDocs, Mintlify, Sphinx) and live on subdomains, not on github.com itself. They are HTML pages and need conversion to Markdown.
Workflow for one page:
- Open mdisbetter.com/convert/url-to-markdown
- Paste the page URL (e.g.,
https://docs.github.com/en/rest/repos/repos) - Click Convert
- Click Download to save as a
.mdfile
Repeat for each page you need. For a small docs site (under 30 pages) this is fast and gives you the cleanest possible output.
3. GitHub Wiki pages
Example: https://github.com/some-org/some-repo/wiki/Getting-Started. Wikis are stored in a sibling Git repo (repo.wiki.git) and rendered as HTML in the GitHub UI. Two paths:
- Clone the wiki repo:
git clone https://github.com/some-org/some-repo.wiki.git— you get all wiki pages as raw Markdown files instantly. This is the right path if you want everything. - Convert single pages via the web tool: paste the wiki page URL into /convert/url-to-markdown if you only want one or two pages and want the rendered output (with resolved cross-links) rather than the raw wiki source.
Organizing converted files locally
For a docs site you've fully converted, the file layout matters. The pattern that works best for AI ingestion and human browsing alike:
~/docs-corpus/
github-rest-api/
README.md # what this folder is, source URL, converted-on date
repos.md # converted from /en/rest/repos/repos
issues.md # converted from /en/rest/issues/issues
pulls.md # converted from /en/rest/pulls/pulls
...
python-docs/
README.md
tutorial/
classes.md
modules.md
library/
json.md
asyncio.md
...The README.md in each folder is your provenance log — what site, what date you converted, what URL each file maps to. Future-you will thank past-you.
Edge cases worth knowing about
Code blocks with language hints
GitHub-rendered Markdown uses fenced code blocks with language labels (```python, ```javascript). After conversion through MDisBetter, language hints are preserved — important for downstream tools that do syntax highlighting, and for LLMs that benefit from knowing the code is, say, a YAML config vs a Bash command.
Tables in API reference docs
API reference pages typically have request/response parameter tables. Markdown supports tables, and the converter preserves them as pipe-delimited blocks. Some long tables (more than 5 columns) wrap awkwardly in pure Markdown — render them in any Markdown viewer first to confirm they look right before feeding to an LLM.
Embedded SVG diagrams
Mermaid diagrams in GitHub-flavored Markdown render as SVGs. After conversion, the diagram becomes a code block with the Mermaid source intact — usable by any Mermaid-aware tool (Obsidian, VS Code, GitHub itself).
The GitHub "View Markdown" trick
For any rendered Markdown file on github.com, append ?plain=1 to the URL to see the raw Markdown source instead of the rendered HTML. Example: https://github.com/octocat/Hello-World/blob/master/README?plain=1. Useful when you want the original source verbatim.
Scaling to many pages: the OSS automation path
MDisBetter's web tool is intentionally one-URL-at-a-time — drop a URL, get a file. For a docs site with 200 pages you do not want to do this 200 times. The honest answer is: use a script. The web tool is for ad-hoc, the script is for batch.
The recipe most people land on:
import requests
from pathlib import Path
import trafilatura
from octokit import Octokit # for repo + wiki listing
# 1. Use Octokit to list pages from a wiki repo, or scrape the docs site sitemap
octokit = Octokit()
# (For docs sites, fetch /sitemap.xml and parse it)
urls = [
'https://docs.github.com/en/rest/repos/repos',
'https://docs.github.com/en/rest/issues/issues',
# ... more URLs
]
out_dir = Path('./github-rest-api')
out_dir.mkdir(exist_ok=True)
for url in urls:
html = requests.get(url, timeout=30).text
md = trafilatura.extract(
html,
output_format='markdown',
include_links=True,
include_tables=True,
)
if md:
slug = url.rstrip('/').rsplit('/', 1)[-1]
(out_dir / f'{slug}.md').write_text(
f'\n\n{md}',
encoding='utf-8',
)
print(f'Saved {slug}.md')
This is the pattern from scrape a website to Markdown for RAG — same building blocks (Trafilatura, requests, file output). For JS-rendered docs (Mintlify, some Stoplight setups, anything that fetches an empty shell first), add a Playwright headless-browser step before passing to Trafilatura. The end result is identical to running the URLs through the web tool one by one, just at any scale.
What about repos that ship documentation as PDFs?
Some projects publish their formal docs as PDFs in a docs/ folder or attach them to GitHub Releases (whitepapers, RFCs, design docs). For those, see the PDF to Markdown tool — same web-tool pattern, paste the file, get Markdown back. A complete docs corpus often mixes both: web-rendered API references converted via URL-to-Markdown, plus PDF whitepapers converted via PDF-to-Markdown, all dropped into the same folder.
Three concrete real-world examples
Example 1: Archiving a small project's docs site
Goal: take a 12-page Mintlify docs site for an OSS library and save it locally for offline reference. Method: paste each page URL into the web tool, save each .md file, organize by section. Time: about 8 minutes for 12 pages. Output: a folder you can grep, feed to an LLM, or open in Obsidian.
Example 2: Building an AI assistant for a company's internal wiki
Goal: take a 200-page private GitHub Wiki, convert to Markdown, embed for retrieval. Method: clone the repo.wiki.git repo (raw .md files, no conversion needed), then chunk and embed using the recipe in scrape a website to Markdown for RAG. Time: 30 minutes end-to-end. Output: a queryable knowledge base.
Example 3: One-off README extraction
Goal: grab the README of a repo you're evaluating, paste into ChatGPT, ask "what does this project do?" Method: paste the README URL into /convert/url-to-markdown, copy the result into the LLM. Time: 30 seconds. Or, equivalently, click "Raw" on the README on github.com itself.
Common mistakes to avoid
- Saving the rendered HTML instead of the converted Markdown. The HTML version is 5-10x larger in tokens and includes all the GitHub UI chrome.
- Forgetting the source URL in each saved file. Add it as a comment at the top so you can re-fetch later.
- Mixing wiki pages and rendered docs in one folder without provenance. They have different update cadences and different formats; keep them in sibling folders.
- Treating the conversion as the whole job. Conversion is one step. The next steps are usually chunking, embedding, retrieval — see the URL-to-Markdown for RAG guide.
Working with GitHub Pages sites
Many OSS projects deploy their documentation as GitHub Pages — static sites built from a docs/ folder or a gh-pages branch. Examples: https://docs.docker.com, https://kubernetes.io/docs, project sites at https://<user>.github.io/<project>.
These behave like any other static documentation site for conversion purposes — paste the URL into the web tool, get clean Markdown back. The advantage with GitHub Pages specifically: the source repo is also accessible. If the docs are originally written in Markdown (most are), you can clone the repo and grab the source files directly with no conversion at all. Worth checking before you start clicking through every rendered page.
Handling GitHub-flavored Markdown extensions
GFM (GitHub Flavored Markdown) extends standard Markdown with task lists (- [ ]), tables, strikethrough, autolinking, and HTML blocks. After conversion via MDisBetter, all of these are preserved as standard Markdown where possible. Two specific cases:
- Mermaid diagrams: stay as fenced code blocks tagged
mermaid. Renderable in Obsidian, VS Code, and on GitHub itself. - Collapsible sections (
<details>): preserved as HTML blocks since standard Markdown has no equivalent. Most Markdown renderers handle them fine because Markdown allows inline HTML.
Recommendation
For under 30 pages, paste each URL into the MDisBetter web tool and save manually. The output quality is the highest you'll get for arbitrary GitHub-hosted docs, and you don't need to write a line of code. For 30+ pages, write a 30-line Python script using Trafilatura — the recipe is in the RAG tutorial and the same code handles GitHub-rendered docs sites, GitHub Pages, and most other static documentation. For wikis, just clone the wiki repo. For mixed corpora that include PDF whitepapers, see best free PDF to Markdown converters for the PDF half.