Code Review Videos > Developer Productivity > My Experience Technical Blogging With Whisper AI

My Experience Technical Blogging With Whisper AI

In this post, I’m going to explore the options available for someone who wants to use their computer primarily by voice. The reason for this exploration is that, last Thursday (at the time of ‘writing’), I had the misfortune of falling off my bike and breaking my collarbone.

This is the first time in my life I’ve ever broken a bone, and, as a result, it’s the first time I’ve truly experienced what it’s like to use a computer—and handle every other task in life—without the use of both hands. In my case, fortunately, I still have the use of my left hand, though it’s accompanied by a fair bit of pain.

My Interpretation Of Technical Blogging (It may not be yours)

I think it’s important to clarify what I was hoping to achieve with these pieces of software.

Essentially, I’m trying to write blog posts in the same style as every other post on this blog. These are, in my view, technical blog posts.

While I’m not fully versed in the professional field of technical writing, in my mind, this qualifies as a form of technical writing. It’s somewhat similar to documentation, but in a (far) less formal manner.

I wasn’t expecting to write code using this approach.

In fact, I follow the same process I always do:

  1. I work on the code first without any consideration of how I will write it up.
  2. Once I finish the coding portion of the task, I then look back at the code and come up with a write-up that becomes the blog post.

The second step of this process is a rather unusual but almost always a thorough and informative code review. I typically learn much more than I would have if I had simply written the code and never gone back to reflect on it.

🤑💰 What A Business Opportunity 💰🤑

It says something about me—and perhaps not something particularly flattering—that one of the first thoughts that crosses my mind when I realise I’ll have to use my computer by voice, due to the loss of use of my arms, is that there might be some money to be made from this situation.

From what I’ve researched so far into the available software options—or perhaps lack thereof—it doesn’t feel like there’s a particularly good solution to this problem. So, my initial reaction is that there must be a fairly large underserved market here.

My Ideal Outcome

I’d definitely say I set out on this process fairly naively.

I confess, I thought I might be able to control my entire desktop without using my hands at all.

Let me tell you, from what I’ve seen, we’re a long way off from that.

After my initial foray into what was available, I brought my expectations down slightly, thinking it still might be possible to find software that would allow me to speak and have that text typed into any app or text area.

So far, I’ve definitely not found that to be the case either.

And then if you use SwiftKey on your mobile phone, something like that with a regular keyboard would be really helpful if you’re having to type consistently using only one hand. The closest thing I’ve actually found—and was a bit shocked by—is that I use a MacBook Pro, the very last Intel model with the awful touch bar. I finally found a use for it: auto-completing words. #amazing.

But the truth is, sadly, from what I’ve seen and experienced so far, is that the ideal world viewpoint I had—of a brain-to-computer interface—just isn’t close to reality.

So, temper your expectations.

Just A Few Of The Possible Software Options

Being that this is 2024, my first approach to solving the immediate problem was to ask ChatGPT for some example software that may serve my needs.

Here’s an overview of the available options for voice control on Windows, Linux, and Mac:

Windows

  • Windows Speech Recognition
    Built into Windows, it offers basic voice commands for navigation, dictation, and simple control over applications. It supports text input and common system commands, though the accuracy and flexibility are somewhat limited compared to more advanced tools.
  • Dragon NaturallySpeaking (Nuance)
    A highly popular and accurate voice recognition software. It’s known for its precision in dictation and ability to control many applications via voice. However, it’s a paid solution, and the cost can be quite high.
  • VoiceBot
    VoiceBot allows for voice-activated macros and can be used to control games or any other software via custom voice commands. It’s great for users who need to create specific workflows, but it requires setup and some technical knowledge.
  • Microsoft Dictate (Office 365)
    Part of Microsoft Office, this tool is mainly focused on dictation rather than full control of the computer, but it works well for writing and formatting text.

Mac

  • Apple Voice Control
    Built into macOS, Voice Control offers robust dictation features and the ability to navigate the entire operating system. It provides full control over app navigation, text input, and more. It’s a strong option for Mac users and is quite well-integrated into the system.
  • Dragon Dictate for Mac (Discontinued)
    Previously, Dragon offered a Mac version of its software, but it was discontinued. However, some users still use older versions, though it’s no longer officially supported.
  • Google Docs Voice Typing (via Chrome)
    Though not a full system control tool, Google Docs offers a voice typing feature that works within Chrome on macOS. It’s primarily for dictation within the document editor but can be useful in certain contexts.

Linux

  • Simon
    Simon is an open-source speech recognition software aimed at being flexible for Linux users. While it provides voice control and dictation, its setup can be complex, and it may require custom configurations to work as needed.
  • Julius
    An open-source large-vocabulary speech recognition engine. It’s not exactly user-friendly out of the box but can be powerful with proper setup. It’s aimed more at developers or advanced users who want to build custom solutions.
  • Kaldi
    Kaldi is a speech recognition toolkit that is widely used for research and development purposes. Like Julius, it’s aimed at developers and researchers rather than end users looking for out-of-the-box solutions.
  • Google Cloud Speech-to-Text API (Linux)
    While not a direct Linux tool, using the Google Cloud Speech-to-Text API can allow for custom voice control setups on Linux. It requires some development knowledge to integrate.

Additional Cross-Platform Solutions:

  • Talon Voice
    A popular option among developers and power users, Talon Voice offers precise control of the computer via voice and can work across platforms. It’s designed for advanced users who want to navigate and even code using voice commands.
  • OpenVoiceOS
    An open-source platform designed for voice interfaces that can be adapted for different operating systems, including Linux, macOS, and Windows. It’s still in development but is aimed at offering a community-driven voice interface solution.
  • Vosk
    An offline speech recognition toolkit that supports multiple languages and works on Windows, Linux, and Mac. Like other open-source solutions, it requires technical knowledge to configure but offers flexibility for developers.

My Shortlist

That’s the list that ChatGPT came up with, and I investigated some of them, dismissing others quickly.

I’d already played with Dragon Dictate quite extensively in the past, and I know it’s transitioned into some kind of SaaS offering these days, which wasn’t very appealing—somewhat ironic, given that I’ve just said it would make a nice business idea.

Here’s my old video about Dragon:

I’ll come back to Dragon in a minute when we look at Google Docs below.

But the list I came up with for testing, given that I’d be using predominantly Linux and macOS (in that order), includes:

  • Siri
  • Google Docs
  • An open-source piece of software called Buzz
  • Slack (which is a bit left-field, and I’ll get onto that in a minute)
  • Whisper AI from ChatGPT

Let’s take a closer look at each.

Siri

Siri wasn’t actually the first piece of software that I tried, but it was the first one I thought of when switching over to my MacBook. I predominantly use Linux as my day-to-day desktop, but because I’d also need to do this speech-to-text task on my work computer, I thought about Siri.

On the positive side, Siri is built into macOS and also my iPhone. However, I’ve never thought to use it in either case. In fact, I have it disabled by default.

  • Pros of Siri:

    • Immediate availability in the Apple ecosystem
    • Built-in to both macOS and iPhone, so no need for additional setup
  • Downsides of Siri:

    • I have it disabled because I find it intrusive, like spyware in the worst possible sense
    • I couldn’t figure out how to use it effectively

I tried saying “Hey Siri,” and while it responded with its usual chirp, when I asked it to transcribe some text, I got no output. Even now, I’m unsure why it didn’t work.

Rating: 1 / 5

Google Docs

Here’s the thing with Google Docs: it’s pretty good.

A few years ago, prior to ChatGPT being a thing, Google Docs transcription felt cutting-edge.

  • Comparing to Dragon Dictate:
    When I first tested Dragon Dictate (or Dragon NaturallySpeaking, as it was called back then), one of the first things you had to do was train it. You had to sit at the computer and read a story or passage aloud for the software to understand your voice. After that, it would improve over time, but occasionally, you’d need to retrain it with new pieces of text. The more you trained it, the better it got at recognising your voice.

  • Google Docs’ advantage:
    Google Docs blew that whole process out of the water. Instead of individual training, Google clearly leveraged hundreds or thousands of hours of audio from a diverse set of speakers. Right off the bat, it can understand your accent, which at the time was pretty mind-blowing.

  • Downside today:
    However, nowadays, Google Docs feels quite literal in how it transcribes, and compared to what ChatGPT offers with Whisper AI, it feels really out of date. While Google Docs was groundbreaking before, the advancements made by Whisper AI show how much transcription has evolved.

I think the video demonstration kind of shows this too. It just works—somewhat ironically, in the way that Hey Siri doesn’t.

It’s not super great at eliminating errors. The grammar it produces isn’t brilliant, which means there’s still quite a lot of manual editing required.

It doesn’t really understand where to put punctuation. While telling it where to place punctuation isn’t that unintuitive, it still feels like an extra step that shouldn’t be necessary.

Here’s the output I received, warts and all:

 that’s it I just click to speak and then off it goes it’s quite quick but it’s also not great with punctuation. so what I would say about Google Docs is that lasted just works and the price is great as in it’s free and you don’t need to train it it’s also very very literal so it’s not going to help me with anything like formatting or improving what I say it just takes what I say I’m puts in the document albeit somewhat inaccurately

My Google Docs transcription output

These days, it just feels like it’s lacking something. It works, but in comparison to newer tools, it doesn’t feel as advanced as it used to.

Rating: 3 / 5

Buzz

Buzz is a piece of software I hadn’t heard of until ChatGPT recommended it to me, and it’s got quite a lot going for it. It’s open source, which is great, and it’s cross-platform, so it works on Mac, Linux, and Windows, although with some caveats.

The Linux version worked by copy / pasting the readme installation steps.

The Mac version requires installation through Homebrew. While this works, it comes with a deprecation warning that it’s going to be pulled at some point in 2025. It advises that you download it from the Mac App Store, where it is now a paid product, as far as I can see. I think it’s around £10.

However, on the Mac, it just didn’t work as expected. I tried it on the new Apple Silicon, using both the internal Macbook microphone and my external mic, pressed record, but it just wouldn’t pick anything up.

I haven’t tried it on my Intel Mac, because it works fine on Linux, so for work purposes, it didn’t work on the work laptop, but for home purposes, I would use Linux anyway.

Maybe things are different if you get the paid version, but it’s not great that the open-source equivalent didn’t work on the Mac.

It’s definitely got a clunky User Interface on Linux. I don’t think that comes across in the video.

It works offline, which was quite a big selling point for me.

The speech you input into the app never leaves your computer, so it works even if you’re disconnected from Wi-Fi or the internet. However, it does require that you download at least one of the Whisper models, or one of the other models it supports before it can work offline. It only does this when you first try to record something. These models range in size from a few hundred megabytes upwards, and the largest ones can be quite big (tens of gigabytes).

I don’t think it was GPU offloaded—at least, I never had to set anything like that up. But it worked really quickly, almost in real time.

I never tested it on Windows.

It worked fine enough on Linux, but it was also outputting some weird results, and that definitely comes across in the video. It would sometimes get stuck in a hallucination or loopback mode, where it kept spamming out the same text repeatedly.

The nicest thing about Buzz, though, was that it convinced me that Whisper AI has a lot of potential.

Pros:

  • Open source and cross-platform (Mac, Linux, Windows)
  • Works offline, meaning data privacy is maintained
  • Requires no internet connection to transcribe speech
  • Fast, almost real-time transcription
  • Whisper models available, including smaller ones to download

Cons:

  • Clunky UI, especially on Linux
  • Mac version is transitioning to a paid app, with the free version set for deprecation
  • Doesn’t work reliably on Mac, especially on Apple Silicon
  • Occasional output errors, including hallucination/loopback issues
  • No GPU offloading, which could limit performance on some setups

Rating: 3 / 5

I feel like I should explain why I gave a rating of 3 out of 5 for Buzz, which is the same as the one for Google Docs, even though they’re really different products. The reason is the polish that Google Docs has, which bumps its rating up, whereas it’s the sort of, dare I say, lack of polish that Buzz has that drops it down.

That sounds really harsh, and I don’t mean it as a diss. I think Buzz is a pretty great product, just with some usability flaws at the moment.

Slack

Slack is perhaps an unusual entrant on this list, but it works kind of well. I actually had no idea Slack had a transcription feature until one of the people on my team at work, who really likes sending voice messages instead of typing, used it. I’m more of a reader than a listener, so I prefer to read the transcripts that Slack sends automatically.

One really nice feature about Slack transcription is that it won’t lose your longer messages. That’s a downside we’ll see shortly with the ChatGPT approach. However, the caveat with Slack is that in order to do the transcription, it records everything you say, creates some kind of WAV file (or similar), and then ships it off to Slack’s servers where the transcription happens. After that, it sends the transcription back to you. This, of course, means your audio is leaving your personal computer, which raises privacy concerns. It’s also very much not real time, with transcripts becoming available “at some point” (30 seconds+) later on.

Another aspect is that Slack adds transcript timestamps to everything. While that’s fine, it’s also one more thing to delete if you want to use the text elsewhere. Maybe there’s a way around that, but I haven’t explored it.

I only thought about using Slack because it’s immediately available on my work computer, and like I said, it works ok. However, better options exist. Also, I assume you have to pay for Slack—I don’t know, as I’ve never personally paid for it myself.

Pros:

  • Automatically transcribes voice messages
  • Handles longer messages without losing content
  • Convenient if Slack is already in use
  • Readable transcripts for users who prefer text over audio

Cons:

  • Audio is sent to Slack’s servers, raising privacy concerns
  • It’s not close to real time
  • Transcripts include timestamps, which need to be removed if using the text elsewhere
  • Better transcription options exist
  • Likely requires a paid Slack plan (depending on usage)

Rating: 2 / 5

Whisper AI (ChatGPT)

Ah, ChatGPT—is there anything it can’t do? Well, yes, as we’ll get to in a sec, but this is probably the best of the bunch by quite some margin.

One thing that’s really astounding to me, and it didn’t immediately stand out as a feature, is that talking to ChatGPT is not literal dictation. It sounds obvious when you say it, but as we’ve seen with tools like Google Docs, Slack, and even Dragon NaturallySpeaking back in the day, those products will transcribe whatever you say verbatim—sometimes with spelling mistakes or incorrect word assumptions.

But ChatGPT doesn’t just transcribe more accurately; it interprets what you’ve said and can craft a better message.

It automatically, and sensibly, adds in correct punctuation and frequently fixes my poor grammar.

Much like Google Docs, the Whisper AI model is trained on 680,000 hours of multilingual audio data from a diverse range of voices. While the exact number of individuals isn’t specified, it’s incredibly good at understanding my accent, which some have said is quite strong.

However, it’s not all roses. There are some things that prevent it from being a 5 out of 5 product.

Firstly, as best as I understand it, the voice input currently only works on the iOS and Android apps. It’s probably possible to use Whisper AI from Linux or Mac through ChatGPT’s API, but that would involve some custom code.

Secondly, I’ve also had issues with it randomly translating my speech into Welsh, and I’ve found a thread on Reddit where others have experienced this as well. I think this might be related to location, as I live semi-near the north of Wales. According to the Reddit thread, this mainly happens to people in the north-west of England. Very bizarre, and quite annoying.

Another really frustrating aspect of it—and fortunately, it has only happened to me once, because I changed my workflow as a result—is that you can’t record too much speech at once.

What I found was that after recording something that took me about 5 minutes to speak, it responded with, “I’m sorry, I didn’t quite catch that,” and then the audio was lost. Naturally, I’d also lost my train of thought on what I’d been speaking about for 5 minutes. After that incident, I adjusted my workflow to record about a minute—maximum 2 minutes—before clicking the send button to let it get transcribed.

When using the transcription feature, the audio is sent to ChatGPT servers for transcription. The transcription itself does not take place locally on your iPhone; instead, the audio file is processed on the server side, where it is converted into text using the Whisper AI model, and then returned to your device. This happens extremely quickly – 1 to 2 seconds for 2 minutes of audio.

Pros:

  • ChatGPT interprets and refines speech rather than just providing literal dictation.
  • Automatically adds correct punctuation and fixes grammar.
  • Whisper AI is trained on 680,000 hours of multilingual audio, making it excellent at understanding diverse accents, including strong ones.
  • Accurate and efficient transcription.

Cons:

  • Voice input only works on iOS and Android apps; using it on Linux or Mac requires custom code via the API.
  • Audio is sent to Open AI’s servers, raising privacy concerns
  • Occasional issue of speech being randomly translated into Welsh, likely due to location-based quirks.
  • Cannot handle long recordings—speech over 2 minutes risks being lost, leading to frustration and workflow adjustments.

Rating: 4 / 5

The Winner

To be honest, it’s not even really that close—ChatGPT takes this by quite some margin.

It’s not a 5 out of 5 experience because it doesn’t do a lot of the things I would like it to do.

At least not yet.

But the potential is pretty mind-blowing. And even where it’s at today, it’s certainly way ahead of its competition.

My Workflow

My workflow is pretty straightforward, really.

I wouldn’t say it’s particularly streamlined, and it could definitely be improved.

But maybe I would continue to use this or a very similar process even when my arm is healed and I’ve got full dexterity back, because it does save quite a lot of time—at least in the ‘writing’ phase.

My process is to first think about the article outline. I’ll do this by creating a bullet point list of things that I want to cover in the article, and I’ll spend some time ordering this list inside an editor like VS Code.

Then I open up the ChatGPT app on my iPhone and create a new chat.

I then open the exact same chat on my laptop in the browser and keep them both open while writing.

I’ll start the audio input and say my piece covering whatever is my first bullet point from the VS Code document. As I mentioned earlier, I generally keep it to about one to two minutes in length before hitting the tick button to convert my speech to text.

Even though the transcription involves sending my audio to the ChatGPT server, it doesn’t actually print out what I’ve said and save it inside the ChatGPT session until I hit the send button on the transcribed text. That might sound a little strange, but it works quite intuitively.

What I’ve done is instruct ChatGPT to either just say “OK” when it accepts my transcription—if I want to use it verbatim—or I ask it to process it in a specific way. For example, I might tell it that I’m writing a technical blog post and to edit the text accordingly within certain guidelines.

I then have to go back to the browser window on my laptop and refresh the entire chat session in order to see the newly submitted text. From there, I can copy and paste it into my blog post, making any necessary edits.

I found it easiest to ask ChatGPT to work with Markdown, as it provides a nicely isolated content box. This box comes with a little icon at the top right, which makes it really easy to copy and paste the text directly from the browser into your blog.

Repeat until all the VS Code bullet points are covered.

I’ve used this process now to write both this blog post and the previous one. And as I say, while it’s not perfect, it’s pretty good.

Potential Improvements & Final Thoughts

It would be really cool if there was more operating system-level integration with this stuff.

Things like renaming a file, opening a new tab, performing common actions like saving files, or following small but commonly repeatable processes—maybe resizing images to specific width / height combos as an example.

It would be really nice to be able to do that kind of stuff natively from inside the apps.

Right now, there’s definitely a lot of back and forth. I can do it all one-armed, or one-handed, but it’s still… not really happening at the speed of thought.

I know I sometimes come across as overly cynical about things like this, but I have to say—even if it hasn’t come through in the post—I’m genuinely optimistic about how cool and transformative this technology can be. I think it has the potential to benefit everyone, but especially those who experience any kind of physical or mobility challenges, whether temporary or long-term.

I also recognise that it can be incredibly frustrating to navigate a world that doesn’t always account for diverse needs. I fully admit that I hadn’t considered this approach before, and after just a week of adapting, I’ve already felt the frustration. I can only imagine how it feels for those who face these challenges every day.

One thing I’d really like to get out of this blog post is to hear back from you. If you use these tools—whether they’re similar, different, or involve some unique setup—I’d love to know how you’re finding it. Also, any tips and tricks you’ve discovered to make this process easier would be greatly appreciated.

Thanks for reading.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.