Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

How to Extract Text from a Word Document (5 Methods Compared)

"Just give me the text" sounds like the simplest possible task. In practice, extracting text from a Word document spans five different tools depending on what you mean by text and what you plan to do with it. Here's the honest breakdown — what each method actually returns, where each method falls down, and which one to pick for your real use case.

Method 1: Word's Save As .txt (the lossy default)

The most obvious method, available on every Word installation: open the document, File → Save As, choose Plain Text (.txt) as the format. Word writes a UTF-8 (or ASCII, your choice) text file containing the body content of the document.

How to use:

  1. Open the .docx file in Word.
  2. File → Save As.
  3. Choose Plain Text (*.txt) from the file type dropdown.
  4. Click Save. Word may prompt for encoding (UTF-8 is the right answer for almost any modern use).
  5. Click OK.

What you get: the visible text of the document, paragraph by paragraph, with line breaks where paragraphs ended in the original.

What you lose: headings (no longer distinguishable from body text), bold and italic, lists (number prefixes are lost; bullet markers may or may not survive), tables (cells are flattened, often with mangled column alignment), images, footnotes, hyperlinks (URLs are dropped, only link text survives), comments, formatting of any kind.

Best for: the absolute minimum case — you need to see the words, you don't care about structure, and you're going to feed the result into a tool that wants pure plain text. Search indexers and the simplest AI tools fall into this bucket.

Worst for: any case where structure matters. Anything you plan to read later. Anything fed to a modern AI tool that benefits from heading and list structure.

Method 2: Copy-paste (broken in subtle ways)

Open the document in Word, select all (Ctrl-A or Cmd-A), copy (Ctrl-C or Cmd-C), paste somewhere else.

What you get: depends on the destination. If you paste into a plain-text editor (Notepad, TextEdit in plain mode, vim) you get the body text similar to Method 1. If you paste into a rich-text editor or web form, you get HTML representation of the document with all of Word's formatting metadata attached.

The trap: copy-paste from Word into web forms, CMSes, and rich-text fields is the source of the formatting nightmares covered in the Word-to-CMS formatting nightmare. The pasted output carries Mso classes, inline styles, font declarations, non-breaking spaces, and (often) tracked-change residue.

Best for: grabbing a quick paragraph or two when you need plain text fast and your destination accepts plain text.

Worst for: pasting into anything that renders HTML (web forms, CMSes, Notion, most rich-text editors). The output looks fine at first and breaks the destination's styling subtly.

Method 3: Pandoc CLI (best for power users)

Pandoc is the gold-standard open-source document converter, free and available on every operating system. It's the right answer for users comfortable with the command line who want flexible, scriptable text extraction.

How to use:

# Install (macOS)
brew install pandoc

# Install (Windows via choco)
choco install pandoc

# Install (Linux via apt)
sudo apt-get install pandoc

# Extract to plain text
pandoc -f docx -t plain input.docx -o output.txt

# Extract to Markdown (preserves structure)
pandoc -f docx -t gfm input.docx -o output.md

# Extract to HTML
pandoc -f docx -t html input.docx -o output.html

# Bulk: convert every docx in a directory to Markdown
for f in *.docx; do pandoc -f docx -t gfm "$f" -o "${f%.docx}.md"; done

What you get: highly configurable — choose your output format (plain text, Markdown, HTML, RST, AsciiDoc, ePub, LaTeX, and many others). The Markdown output preserves headings, lists, tables, links, footnotes, images (extracted to a media folder).

Best for: developers, technical writers, anyone doing bulk conversion, anyone who wants to script the extraction. Pandoc's -t plain output preserves more structure than Word's Save As .txt — paragraphs are separated by blank lines, lists are kept as lines, headings are visually distinguishable. Pandoc's -t gfm output is the highest-fidelity text extraction available among free CLI tools.

Catches: requires CLI comfort. Equations, complex tables, and embedded objects need extra flags or post-processing for best results. Documentation is comprehensive but dense.

Method 4: Mammoth.js (developers, programmatic)

Mammoth is a JavaScript library specifically designed to convert Word documents into clean HTML or plain text, with the explicit goal of producing simpler output than Word's own HTML export. Used heavily in CMS import workflows.

How to use:

# Install
npm install mammoth

# Use (Node.js)
const mammoth = require("mammoth");

// Extract plain text
mammoth.extractRawText({ path: "input.docx" })
  .then(result => console.log(result.value));

// Extract clean HTML
mammoth.convertToHtml({ path: "input.docx" })
  .then(result => console.log(result.value));

What you get: the cleanest HTML extraction in the open-source world, plus a simple plain-text mode. Mammoth deliberately ignores Word's visual formatting and outputs semantic HTML — proper <h1>, <ul>, <table> tags, no inline styles, no Mso classes.

Best for: developers building CMS import features, web applications that accept Word uploads, anywhere clean HTML output matters more than feature completeness. The clean-HTML output is excellent input for downstream Markdown conversion (HTML-to-Markdown via Turndown or similar).

Catches: Mammoth deliberately drops some Word features (page layout, custom styles without explicit mappings, complex equation formatting). The tradeoff is intentional — cleaner output for the common case at the cost of edge-case fidelity.

Method 5: mdisbetter web tool (no setup, structured Markdown)

For users who want structured text extraction without installing Python, Node.js, or Pandoc, the web tool at /convert/word-to-markdown handles single-file conversion in the browser.

How to use:

  1. Open /convert/word-to-markdown.
  2. Drag the .docx file into the upload area.
  3. Click Convert.
  4. Download the resulting .md file.

What you get: structured Markdown — headings as ##, lists as native Markdown lists, tables as Markdown tables with intact column structure, links as [text](url), bold and italic preserved.

Best for: non-developers, individual contributors, anyone doing one-off conversions, anyone who wants Markdown output for downstream AI use without setting up tooling. The honest scope: one file at a time, browser-based. For batch automation, Pandoc local is the right tool.

Catches: not a batch tool. Files are processed one at a time. For 500 documents, use Pandoc.

Comparison table

MethodSetupFidelityOutput formatsBest for
Word Save As .txtNone (with Word)LowestPlain textMinimum case
Copy-pasteNoneVariablePlain text or messy HTMLQuick grabs
Pandoc CLIInstallHighestMany (text, MD, HTML, more)Power users, bulk
Mammoth.jsnpm installHigh (clean HTML)HTML, plain textDevelopers, CMS imports
mdisbetter webNoneHigh (Markdown)MarkdownSingle files, no setup

Decision tree

What about other libraries worth knowing?

A few additional tools fit specific niches:

Cross-format pattern

The text-extraction question for Word is structurally similar to the same question for PDF (covered in how to extract text from PDF) and for web pages (covered in how to extract article from webpage). The recurring pattern: built-in OS tools work for the minimum case, OSS CLIs work for power users, web tools work for non-developers, and the highest-quality output usually requires choosing a structured intermediate (Markdown) rather than flat plain text.

For the broader case for Markdown over flat text in AI workflows, see Word documents are AI-hostile and best format for LLM input.

The summary

Five methods, each with a real use case. Word Save As is for the minimum case. Copy-paste is for casual quick grabs (and is dangerous in CMSes). Pandoc is the power-user choice with the best fidelity and scripting story. Mammoth is the developer choice for clean HTML. The mdisbetter web tool is the no-setup choice for individual users who want structured Markdown without touching the command line. Pick the one that matches your setup and your destination.

Frequently asked questions

Can I extract text from a password-protected Word document?
Most extraction tools require the document to be unlocked first. Word can save a password-protected document as an unprotected copy after you've entered the password. Pandoc, Mammoth, and python-docx all need the password removed before they can read the document. There is no legitimate way to extract text from a Word document whose password you don't know — encryption is doing its job in that case.
What's the difference between extracting text and extracting structured content?
Plain text gives you the words in reading order with no metadata about what each piece is — heading, body, list item, table cell, footnote all become indistinguishable lines. Structured extraction (Markdown, HTML, JSON) preserves the type and hierarchy of each element, which makes the result usable for downstream tools that need to know which lines are headings versus body. For AI consumption and CMS rendering, structured extraction is materially more useful.
Will any of these methods extract text from images embedded in the Word document?
No — image text requires OCR, which is a separate step. After extracting the document, run the embedded images through an OCR tool (Tesseract is the open-source standard) to get text out of them. Some all-in-one tools wire OCR into the extraction pipeline, but the basic Word-text-extraction tools listed here treat embedded images as opaque image references.