How "Video to Text" actually works here
Audio is auto-extracted from your video file, then run through Whisper-class speech recognition, then formatted as plain text with paragraph breaks. Same speech recognition pipeline as audio-only conversion; the video-specific step is just the audio extraction up front. Output is flat plain text — no structural markers, no speaker labels, no timestamps. Copy-paste-ready, search-indexable, ready for any tool that wants UTF-8 string input.
Format support
MP4, MOV, MKV, WebM, AVI, WMV, FLV — every common video container. The video file is processed for audio extraction; the visual content of the video is ignored (mdisbetter doesn't do scene detection or visual analysis, just speech recognition). File size limit on the free tier handles typical interview / lecture / webinar lengths; Pro tier handles multi-hour video in a single pass.
Audio quality matters more than video quality
A 4K video with bad audio transcribes worse than a 480p video with good audio. The speech recognition operates entirely on the audio track; video resolution is irrelevant. For best results: prefer videos with single-mic close-to-source audio (lavalier mics, USB mics close to speaker) over conference-room ceiling mics or distant cameras with built-in audio. For challenging video audio, run an audio cleanup pass first (extract audio with ffmpeg, clean with Adobe Podcast Enhance / Krisp / Auphonic, then upload the cleaned audio).