Coming soon — what we're building
Text-to-speech (TTS) is the inverse of transcription: paste written text, get audio output (MP3 or WAV) with a natural-sounding voice. We're building this as a complement to our existing audio-to-Markdown direction. Use cases: turning long documents into listenable audio for commutes, generating voice-over for video content from a written script, accessibility (audio version of written content for users who prefer or need audio), language learning (hearing pronunciation of written text), podcast-style audio generated from blog posts.
What's the timeline?
Honest answer: we're prioritising the audio→Markdown direction because that's where most of the demand we see lives (transcription, content repurposing, AI-ready text from existing audio). Markdown→audio is on the roadmap but secondary. If you have a strong use case for it now, the OSS alternatives below cover the gap.
OSS alternatives if you need TTS today
- Coqui TTS — open-source TTS library, runs locally, supports many voices and languages, MIT-licensed. Best self-hosted option.
- Mozilla TTS — Mozilla's open-source TTS engine, similar capabilities to Coqui (which is the spiritual successor).
- ElevenLabs — best-in-class commercial TTS with very natural voices, paid API. The current quality leader for production voice-over.
- OpenAI TTS API — competitive natural-sounding voices, paid per character, good for occasional production use.
For most needs the OSS path (Coqui locally) gives you free TTS without per-use costs; for production voice-over quality, ElevenLabs or OpenAI TTS are worth the API fee.
For the audio→Markdown direction (which we DO support)
If you're thinking about the inverse problem — you have audio and want the text out of it — that's our existing tool: Audio to Markdown. Upload any audio file, get structured Markdown back with speakers labelled, topics as H2 sections, timestamps inline. The full transcription workflow we've built for podcasters, journalists, researchers, students, and many others.