Your Company Has Thousands of Word Docs Nobody Can Find
Walk into any 100-person company and ask the operations lead how many Word documents the company owns. The answer is usually "I have no idea — thousands? Tens of thousands?" Then ask how many of those documents have been opened in the last six months. The answer is roughly the same: "I have no idea." The shared drive is a graveyard. Nobody is reading the documents. Nobody can find them. The institutional knowledge they contain is functionally lost — and the format is most of the reason why.
The corporate document graveyard
The conservative estimate from internal-knowledge audits across mid-sized companies: 60-70% of internal Word documents are accessed once after creation, and another 15-20% are never accessed at all after their initial creator stops working with them. Less than a fifth of the documents in a typical SharePoint or Google Drive instance are doing any active work.
The pattern is consistent: someone writes a runbook, a process doc, a meeting summary, a project retrospective. They share it with a Slack message or a calendar invite. The intended audience reads it once (maybe). The document then drifts into the directory tree where it's structurally invisible — wrong folder, ambiguous filename, no tags, no metadata. Six months later, when someone needs the same information, they can't find it. So they write a new document. The cycle repeats. The graveyard grows.
The annual cost is meaningful. McKinsey-style research has put the average knowledge worker at 1.8-2.5 hours per day spent on information seeking, and a substantial chunk of that is searching for documents that exist somewhere on the company drive. At a fully-loaded cost of $80/hour for a senior individual contributor, that's $40,000-60,000 per person per year of search overhead. Across a 100-person team, the number gets uncomfortable.
Why search doesn't work on DOCX
The intuition most people have is that "the company has search" — Google Workspace search, SharePoint search, Confluence search, Slack file search. So how can the documents be findable but unfindable at the same time?
The answer is in how those search systems handle DOCX content. Three structural problems:
1. DOCX is a binary blob to most search systems. A .docx file is technically a ZIP archive of XML files (covered in Word documents are AI-hostile). Many enterprise search systems index only the filename, the path, and a small subset of metadata — not the full text content of the document. Even systems that do extract text often do so on a slow background indexer that runs hours after the document is uploaded, and that index goes stale every time the document is edited.
2. Headings and structure are lost in indexing. Even when full-text indexing works, most search systems treat the extracted text as a flat document. The semantic structure — H1, H2, list, table, code block — is dropped. So when you search for "Q3 OKR" the search can't tell whether the term appears as a heading (high relevance) or as a passing mention (low relevance). Ranking suffers.
3. Filename culture is broken. Word documents are rarely named for searchability. Real filenames from a typical share drive: Final.docx, Notes-v3.docx, Meeting 2024.docx, Untitled-1.docx, JS-comments-final.docx. None of these surface in a meaningful search result. The titular bar inside the document might be useful — if anyone bothered to write one.
The combined effect: the document exists, the content is there, but the chance of finding it through search is low enough that people stop trying. They re-create the content from scratch. The shared drive grows; the findable knowledge stays flat.
Convert to Markdown — full-text search everything
Markdown changes the search dynamic structurally. Three reasons:
Markdown is plain text by default. Every search system in the world — including the lowest-tech ones, including grep from the command line — reads Markdown natively. There's no binary blob, no extraction step, no indexing latency. The content is the file.
Headings are explicit and unambiguous. A line that starts with ## Q3 OKR is unambiguously a section heading. Modern documentation search systems (MkDocs, Docusaurus search, Algolia DocSearch, Backstage TechDocs) all read Markdown heading hierarchy and rank matches in headings higher than matches in body text. Suddenly your search results are not "any document containing 'OKR' somewhere" — they're "the section actually about Q3 OKRs".
The content lives in Git, not on a share drive. Once your knowledge base is Markdown, it's natural to keep it under version control alongside your code. Every internal engineer can search it with grep in milliseconds. Every documentation tool reads it. Every AI tool reads it. The structural barrier to findability disappears.
Building a searchable knowledge base from a Word graveyard
The honest workflow for turning a Word graveyard into a searchable knowledge base:
- Triage first. Don't migrate everything — most documents on the share drive are dead weight. Run a usage report from your file system or DMS to identify the documents accessed in the last 12 months. That's your migration target. The rest can stay archived in place.
- Convert in waves. Use /convert/word-to-markdown for the high-priority documents one at a time, or for true bulk migration use Pandoc locally with a shell loop:
for f in *.docx; do pandoc -f docx -t gfm "$f" -o "${f%.docx}.md"; done. Honest answer: web tool for the curated 50-200 documents you actually want polished, Pandoc local for the bulk pass on everything else. - Establish a directory convention. A flat
knowledge/directory of Markdown files indexed by topic beats a deep folder tree. Use front-matter metadata (title:,tags:,owner:,last-reviewed:) to make documents discoverable without the directory tree. - Pick a search interface. Options range from a simple GitHub repo with built-in search, to a documentation site (Docusaurus, MkDocs, Outline), to a self-hosted wiki with Markdown ingestion (BookStack, Wiki.js). Pick the one that fits your team's existing tooling.
- Adopt a 'kill the duplicate' rule. Once content is in the Markdown knowledge base, kill the corresponding Word document on the share drive (or move it to an archive folder). Two sources of truth is worse than zero sources of truth.
Honest scope: what mdisbetter does and doesn't
The web tool at /convert/word-to-markdown processes one file at a time. Drag, click, download. It produces clean Markdown with proper headings, lists, tables, and link extraction. It's the right tool for: the 20-100 high-value documents you want polished, the daily one-offs, the per-team migration that runs over weeks.
What it is not: a mass-migration platform. There's no "upload your entire share drive and get back a knowledge base" workflow. For a 5,000-document corpus that you want flattened into Markdown overnight, the right tool is Pandoc on a local machine running in a loop, plus a custom script for any post-processing. We're transparent about this — automated mass migration of an entire enterprise share drive is a different category of product, and using OSS locally for that job is the honest recommendation.
What changes after the migration
The shifts that show up consistently in companies that have done this migration:
Search becomes useful. Engineers can grep for what they need. Non-engineers can use the documentation site search. Either way, the answer is in the first three results, not buried thirty pages deep.
AI tools become useful on internal content. The same Markdown corpus that powers your documentation site can power your internal AI assistant. Embedding quality is materially better on Markdown than on extracted DOCX content (covered in you can't feed 500 Word docs to AI). The internal Q&A bot suddenly works.
Onboarding accelerates. New team members can read their way into the company's institutional memory. The document graveyard had the same content, but it was inaccessible. The Markdown knowledge base makes the same content navigable.
Duplication drops. When the existing answer is findable, fewer people write the same document for the third time. The corpus stops growing exponentially.
The cross-format pattern
The Word graveyard is the document version of a broader pattern: high-value content trapped in a delivery format that's hostile to search and reuse. The same dynamic affects PDFs (covered in how to make PDF searchable), audio recordings (covered in you can't search audio recordings), and even web archives. The fix in every case is the same: convert to a structural-first text format that every search system reads natively. Markdown is the obvious choice.
What about SharePoint and OneDrive built-in search?
Microsoft's enterprise search has improved meaningfully over the last few years — full-text indexing on DOCX is reliable, ranking has gotten better, and the new Microsoft Search interface is competent. The remaining problems are the structural ones: headings aren't ranked higher than body text, results are still document-level rather than section-level, and the index can't help with cross-document questions like "what are all the policies that mention third-party data sharing". For a team that lives entirely inside Microsoft 365, the SharePoint search may be acceptable. For everyone else, the Markdown route is dramatically more flexible.
The summary
The Word graveyard is one of the most common, most expensive, most invisible knowledge-management failures in modern companies. The root cause is that DOCX is a delivery format that resists search, structure, and reuse. Convert the high-value documents to Markdown (web tool for the curated set, Pandoc local for the bulk pass), put them in a search-friendly structure, and the same institutional knowledge that was effectively lost becomes findable in milliseconds. The cost is a few weeks of focused migration work; the benefit is permanent, and it compounds with every new tool — AI assistant, search, documentation site — that natively reads the resulting Markdown.