What "Speech to Text" means here
Spoken words become written text. The pipeline: upload audio file → speech recognition (Whisper-class model, 50+ languages auto-detected) → punctuation and capitalisation restoration → paragraph break insertion → flat plain-text output. No structural markers, no speaker labels, no timestamps in the output. Just the words, in paragraphs, ready to paste anywhere.
What it works on
Voice memos from your phone. Recorded interviews. Podcast episodes. Lecture recordings. Voicemails. Conference talks. Recorded video calls (audio extracted automatically). Single-speaker dictation for note-taking. Multi-speaker conversations (without speaker labels — for those use the Markdown variant). Anything with audible spoken word.
What it doesn't do well
Music transcription (the model is trained on speech, not melody). Singing (close to speech but lyrics often get garbled). Heavy crosstalk where multiple people speak simultaneously (single speaker comes through, others get clipped). Extremely noisy environments where signal-to-noise is poor. For these cases, the right tool is a dedicated audio-cleanup pass first (Adobe Podcast, Krisp, Auphonic), then transcription.