Skip to main content

Scott Schopieray

Tips and Options for Captioning/transcript generation

5 min read

As I’m working with video for my own courses and others, ways to think about caption/transcript generation often come up. The first place people seem to go on this is to think about either listening to the video and then transcribing themselves, or hiring out captioning companies and paying per minute. While these are certainly options that will work, I use a combination of other methods that I have never taken the time to outline in writing. This is by no means a comprehensive list of alternatives to self transcription or hiring out, but rather a starting point for developing your own methods and procedures.

Generally speaking, I find there are 3 types of videos that people create... those that are planned meticulously and scripted, those that are outlined but generally not scripted, and those that are totally unscripted and minimally planned. I fall into the last category most often, so my options for generating a transcript are different than those who might be more prepared when starting to talk. Those of us working in curriculum development, instructional technology and related fields also often find themselves working with videos produced before their involvement in the project and therefore with unavailable scripts even if they had been generated at a prior time. 

I find that whenever you are doing a recording, speaking slowly and intentionally allows auto captioning to do a better job than if you talk at a normal or fast pace. If the pace of your speaking is too slow, the listener can always speed up the pace of the video if it’s too slow for them. If you are creating screencasts that show how to do something technical, speaking more slowly and intentionally has the added benefit of it being paced properly for following along while doing it on your own computer.

If you are one of those who does plan and script ahead of time, you might not even need to worry about generating a transcript later, you might have already made it when you wrote your script, but for those of us who do a bit more "off the cuff" speaking I use the following in combination.

Google Voice Typing

Perhaps the easiest way to generate a pretty good transcript of an existing video or audio file is to use Google's Voice Typing feature. Built into Google Docs, you can easily turn it on by going to the "Tools" tab and then selecting Voice Typing as the option. this will give you a grey microphone that turns red when the feature is active. For simply voice typing notes, etc. you would activate the microphone and start talking. To use it for transcript generation I've found it works best to use an external microphone and to play the video on a second device (such as your phone). The video playing into your computer allows the Google Doc to use Voice Typing as if it were listening to you simply talking into the microphone. An advantage to Google Voice Typing over other options I've worked with is that it does a pretty good job of transcribing multiple voices on a panel or in a video that has more than one person. 

Media Space Captioning

MSU has the Kaltura Mediaspace system in place for hosting video on campus, which includes an automatic captioning feature. The benefit of this is that it automatically does it and also creates the captions that are aligned with the video and working. It’s relatively simple to use once you learn the media space interface, and has a built in editor for correcting any mistakes. Unlike the other options listed in this post, the captioning here happens without the need for tending to multiple devices with microphones connected. You can simply submit your video for captioning on the system and move to another task. The accuracy of the transcript with this particular product is not as high as other options. 

iOS Voice Typing

I find the iPhone is quite good at doing voice typing. Sometimes I use it to generate a script before create a video, but most often if I’m using it to generate a transcript I’ll just hold it up near the speaker and play the video I’m trying to generate the script for. More often than not it gets everything, doesn’t really care if the voice changes, and does pretty well. Downside to it is that it will only record for a short period of time, so there is a lot of stop, start, stop, start using this method. 

Apple Voice Typing

Enabling the accessibility feature of voice transcription on a mac can also work, though I’ve not been entirely happy with the pace it works at (it's quite slow to do the actual recognition). The advantage over the iOS version here is that it will keep transcribing without a need to stop. To really do this well it requires two devices so that one can play the audio/video back and the other can be recording/typing the text.