I’m interviewing some folks and want to capture and transcribe the audio. I did this last year successfully with the Parrot app and a hand install of Whisper. But it was pretty clunky, I don’t like Parrot. This time around I’m trying new stuff.
Summary: ASR Voice Recorder to record (on Android) and aTrain to transcribe.
Audio recording
ASR Voice Recorder. It’s a modern recorder app, the key thing is it stores files in a place I can find them with other Android apps or USB download. I think the free version is sufficient but I subscribed for $2.50 for a year. I set it up to store things in the Recordings folder. It also is uploading to Google Drive for me. I’m not confident about the microphone gain: there’s a manual setting I’ve left at 0, I’m hoping it auto-adjusts.
Simple transcribing
How about a simple text transcript from the audio file?
Running Whisper locally is an option. ASR also has integration with the OpenAI whisper service, so you can transcribe audio for $0.36 an hour. Unfortunately individual files to analyze are limited to 25 MB. Not sure if ASR is smart enough to chunk the file to work around that when transcribing: ChatGPT thinks it might be but it could be wrong.
Another option would be to break a big audio file up myself and upload it in chunks to the Whisper API. The problem here is you only want to break in silences, or even better when a speaker switches.
A third option is to install Whisper and run it locally. I’ve done this twice and it worked OK.
Diarization
A simple transcription just gives words. Much better is diarization: the tool also detects who is speaking and annotates the transcript. Whisper does not do this itself. ChatGPT had lots of ideas for me.
I first tried whisperx, which adds PyAnnote to Whisper to diarize. I couldn’t get it to run: lots of problems with drivers, even if I tried to make it CPU-only. Was dumping core :-( This may be a WSL problem, I didn’t try very hard.
I had much more luck with aTrain. I installed it via the Windows Store: the 5GB download from their own server didn’t work. Installed easily, no setup required, it’s all preloaded to use my NVidia GPU. It transcribed 2 hours of audio in maybe 20 minutes? The output is pretty good, I imagine this is just Whisper for the transcript. The diarizing looked fairly good but not perfect, lots of cases of several speakers being lumped into one. But it’s still helpful. Just will still need hand editing and review. It’s a nice product, very easy to use.










