LESSWRONG
LW

28

[ Question ]

How did you integrate voice-to-text AI into your workflow?

20th Nov 2023

1 min read

28

ChatGPT seems to have really awesome voice-to-text ability. However, it seems to only record within ChatGPT itself so can't be used to create notes or type in other programs and it's unclear to me how to best take advantage of the increased technological capabilities.

I'd love to hear about how people integrated the newest voice-to-text capabilities into their workflow.

AutomationProductivitySoftware ToolsPractical

28

How did you integrate voice-to-text AI into your workflow?

6the gears to ascension

4the gears to ascension

4the gears to ascension

1bvbvbvbvbvbvbvbvbvbvbv

1Milli | Martin

1bvbvbvbvbvbvbvbvbvbvbv

New Answer

New Comment

5 Answers sorted by
top scoring

the gears to ascension

Nov 20, 2023

60

I don't use the openai capabilities. I use talon voice recognition. I can control my computer well enough to code by voice, and {edit: in a prerelease patreon version,] the dictation mode is now based on whisper. It's pretty amazing.

[-]MondSemmel11mo20

Are you talking about this website? It seems rather sparse on details. Can you elaborate a bit on what the tool is, who it is for (only power users?), what you use the tool for, how well it works, limitations, etc.?

4the gears to ascension11mo

https://talon.wiki/ * what it is: command-focsed voice recognition for computer control and especially programming by voice. freeware written in rust by a voice-only dev funded by patreon for use by other voice-only devs. * who is it for: anyone who wants to control their computer by voice or do dictation, but especially ones who want to input a lot of symbols and control commands. * what I use it for: general computer control. search when doing something else. sometimes switching applications. dictation when my hands hurt. coding when my hands hurt. * how well it works: better than google voice keyboard, which is quite good. the core is closed source freeware, the configs are in python and a simple custom language and are generally open source. the config api is really nicely done. you can make custom voice commands easily. I have it set up so saying "computer, <command>" does the command from sleep mode, and "computer, wake" wakes, "computer, sleep" sleeps. eg, "computer, google search lesswrong voice text ai". I am not using it now; my hands are still faster when they work. however, it's head and shoulders better than dragon, which was for a long time the best command voice recognition. (edit: this part is prerelease only apparently, but) since it integrated whisper for dictation voice recognition I think its place as best option is uncontested.

2ChristianKl11mo

I downloaded it and selected the W2L Conformer engine. On https://talon.wiki/speech_engines/ it does not say anything about using Whisper. It seems much worse than what ChatGPT does. Did you load another engine to get Whisper to work?

4the gears to ascension11mo

oh hmm, I might have a pre-release version. Sorry to mislead. It'll be out eventually, and exists, but it's likely because I have the patreon version.

2MondSemmel11mo

The changelog indeed mentions Whisper as a "0.4.0 beta-only feature".

Nov 20, 2023

52

My favorite part from the Getting Things Done book is the idea to capture 100% of your ideas, and to only process them after the fact. Rather than, say, trying to only write down good ideas. On LW this philosophy is known as Babble and Prune.

So for years, I've wished for the ability to record voice notes anytime I want, and to then get an accurate transcript automatically. Almost exactly one year ago, I bought a Pixel 7 phone for this very reason, hoping that their advertised AI chip and Recorder app could provide just that. They couldn't; the Recorder app prioritizes live transcription over accuracy, and the transcript is not usable without listening to the recording, which defeats the point.

However, due to Whisper, I can now indeed record voice memos via my phone or, newly, my smartwatch; then upload the file to cloud storage (e.g. Google Drive); and then immediately and automatically receive a Whisper AI transcript (awesome in its accuracy) and ChatGPT summary etc. (so far irrelevant for me) in my Notion workspace.

This is implemented by following this step-by-step automation guide by Thomas Frank, and only requires an OpenAI account incl. API key (costing $0.40 per hour of audio), a free Pipedream account (which is like Zapier but allows arbitrary code blocks), free cloud storage account, and free Notion account.

In principle the Notion part is unnecessary, and someone who wanted to take the time to manually adjust the automation could have the transcript output instead be an email or text file or whatever.

[-]bvbvbvbvbvbvbvbvbvbvbv10mo10

I bought a cheap watch : twatch 2020 that has wifi and a microphone. The goal is to have an easily accessible langchain agent connected to my localai.

I'm a bit stuck for now because of a driver in C while I know mostly python but I'm getting there.

Apr 10, 2024

30

I liked this extension (https://chrome.google.com/webstore/detail/whispering/oilbfihknpdbpfkcncojikmooipnlglo), which I use for long messages. I press a shortcut, it starts recording with Whisper, then repress and it puts the transcript in my clipboard.

Nov 20, 2023

10

I can now get real-time transcripts of my zoom meetings (via a python wrapper of the openai api) which makes it much easier to track the important parts of a long conversation. I tend to zone out sometimes and miss little pieces otherwise, as well as forget stuff.

Nov 20, 2023

10

ChatGPT is using Whisper for speech to text, which is open source and available through OpenAIs APIs.

I personally tried to use more text to speech on my phone, but was annoyed by it and went back to typing.

I've heard Whisper is a definite step-up, especially when mixing English and German.

https://openai.com/research/whisper
https://github.com/openai/whisper
https://platform.openai.com/docs/models/whisper

€: This used to say text to speech.

[-]bvbvbvbvbvbvbvbvbvbvbv10mo10

You meant speech to text instead of text to speech. They just added the latter recently but we don't know the model behind it afaik