I currently control my rhythm stage setup with a USB keyboard. This works, except that my hands are often busy playing instruments when I want to change something. I already have a microphone in front of my mouth running to a computer, which I use for my whistle-controlled bass synthesizer: could I use speech recognition?

Another way to look at this is that a lot of my exploration here has been finding some way to play bass and mandolin at the same time: whistle bass, bass pedals, bass and drum pedals. With speech recognition I could call chords to the computer!

I'm currently running on a Raspberry Pi 3B. I tried OpenAI's Whisper and then the C++ port (which should work on the 4B), and then the much older software Julius but my Pi 3B couldn't handle real time with any of these. Is there some existing software that would be a good fit for this? People were running speech recognition back when this computer would have been top of the line.

I'm also considering giving up on the idea of using speech: all I really need is some sound I can make consistently enough that the computer can recognize. I'm already running a system that can mostly decode whistling, so maybe I should figure out how to phrase my commands as simple whistle patterns? I'm a bit nervous about doing this while playing, though, since unless I choose patterns that can sound musical in any key it's going to be pretty hard to combine with playing something else. Maybe alternating low and high notes, but I can choose which notes in response to the key?

Comment via: facebook, mastodon

New Comment
6 comments, sorted by Click to highlight new comments since:

I hear the problem statement as "in a noisy environment, do a minimally acoustically intrusive thing with the mouth/breath/voice which sends a clear and precise signal to a computer".

My biases about signal processing and sensors say that audio processing is about the hardest possible way to tackle the challenge at hand, because you have so little control over the acoustic environment in the settings where I expect you'll want to use the system.

Here are some things that I expect would be easier than voice recognition, and why:

  • Eye tracking. Point a camera at your face, shine an adequate light on your face if needed, and use something like opencv to catch patterns of eye movement while looking at the camera. This is silent and, depending on camera position, could avoid interfering with you looking out at the room when not cueing the system.

  • Muscle measurement of something you're not already using when sitting and playing, if there are any such muscles available. https://www.sparkfun.com/products/21265 kind of thing can talk to an arduino or pi. Could be tricky to find an appropriate spot for it and an intuitive way to handle input from it, though.

  • Accelerometer and gyro on the head could let you nod in a specific direction to send a signal to the system. Impervious to noise and lighting, and gesturing with the head is a pretty natural cue to use when your hands are full, so probably easy to learn.

  • If you have any range of motion available that you're not already using (knees? elbows?), an IR rangefinder or array of them could cue on the distance from the sensor to your body. This might be granular enough to select from several presets, chords, etc. Not unlike a digital theremin.

  • If you don't play in windy environments, an array of small fans that you blow on could be used as sensors that function independent of the background noise. I just checked with a cheapy little brushless 5V fan and blowing on it absolutely generates a few millivolts to a multimeter, more for blowing harder to spin it faster, exactly as the laws of physics predicted. Some rough guesstimation with a tape measure indicates that an array of ~1.5" fans at 6-8" from the face could be controlled pretty precisely this way, which suggests the possibility of a silent-ish breath-only version of the chord buttons on an accordion. This has the added benefit that a moving fan offers visual feedback that the "button" was "pressed".

  • Take a page out of the Vim book and add a single switch that, when active, recontextualizes one of your existing digital input methods to give the inputs different meanings. The drawback is greater cognitive load; the benefit is minimal hardware complexity.


Thanks! Your model of the problem I'm trying to solve is good.

It's true that I have limited control over the acoustic environment, and a noisy stage can be unpredictable. On the other hand, my mouth is right next to a high quality microphone, which gives me a very good signal to noise ratio. So, for example, whistling to control a bass synthesizer has worked well for me.

Thoughts on your suggestions:

  • Eye tracking: I think this is plausible, though it would definitely need to still let me look around the stage when I wasn't actively giving a command. Possibly some pattern of looking at ~four different spots in order might be enough? Some stages are dark, though, which makes me nervous about anything visual.

  • Muscle sensing: I can't really think of a good place to put one. I'm already using my hands and arms to play the piano or mandolin and my legs and feet to play drums. Something on my face would be possible, but kind of intrusive?

  • Accelerometer and gyro on the head: I built one of these early on in my explorations here and it does work. I stopped using it, though, because it would give me a sore neck.

  • I think range finding runs into the same issue as muscle sensing: all the obvious candidates are in use.

  • The array of fans sounds interesting. I could put them in a ring around my microphone and blow into/on them. It's rare that I play in windy environments, and it's okay if this is a component I can't use there. Slightly nervous about reliability, since this seems kind of fragile?

  • Loads: this is something I already do a lot, and I'm pretty happy with. For example, I can switch my foot pedals between drums, bass, or a bit of both. But while it lets me to more different things, it doesn't let me do more things at once, which is important when trying to make a full sound as a duo.


Thanks for explaining!

Eye tracking could also mean face/expression tracking, too. I figure there are probably some areas (stage, audience) where it's important for you to look without issuing commands, and other areas (floor? above audience?) where you won't gain useful data by looking. It's those not-helpful-to-look areas where I'm wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to "select" it, blink or do a certain mouth movement to "click" it, etc.

Your confidence in the quality of your mic updates my hope that audio processing might actually be feasible. The lazy approach I'd take to finding music-ish noises which can be picked out of an audio stream from that mic would be to play some appropriate background noise and then kinda freestyle beatbox into the mic in a way that feels compatible with the music, while recording. I'd then throw that track into whatever signal processing software I was already using to see whether it already had any filters that could garner a level of meaning from the music-compatible mouth-noises. A similar process could be to put on background music and rap music-compatible nonsense syllables to it, and see what speech-to-text can do with the result.

(As a listener, I'm also selfish in proposing nonsense noises/sounds over English words, because my brain insists on parsing all language in music that I hear. This makes me expect that some portion of your audience would have a worse time listening to you if the music you're trying to play was mixed with commands that the listeners would be meant to ignore. )

I expect that by brute forcing the "what can this software hear clearly and easily?" problem in this way, you'll discover that the systems you're using do well at discerning certain noises and poorly at discerning others. It's almost like working with an animal that has great hearing in some ranges that we consider normal and poor hearing in others. When my family members who farm with working dogs need to name a puppy, they actually test lists of monosyllabic names in a similar way to make sure that no current dog will confuse the puppy's name for its own. before teaching the puppy what its name is.

After building your alphabet of easy-to-process sounds, you can map combinations of those sounds to commands in any way that you like, and never have to worry about stumbling across a word that the text-to-speech just can't handle in the noisy context.

The less lazy way, of course, would be to choose your vocabulary of commands and then customize the software until it can handle them. That's valid and arguably cooler; it just strikes me as a potentially unbounded amount of work.


I'm wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to "select" it, blink or do a certain mouth movement to "click" it, etc

Maybe! This would definitely be nice if it worked. Probably better for switching the system between modes than triggering sounds in real time, though?

This makes me expect that some portion of your audience would have a worse time listening to you if the music you're trying to play was mixed with commands that the listeners would be meant to ignore.

When using the mic in this mode I wouldn't be sending it out to the hall. It wouldn't be audible offstage.

see what speech-to-text can do with the result

I do think that's worth doing, though only if I get far enough along to have speech-to-text running at all. Right now I think I probably am just trying to use hardware that isn't up to the task.


Talon Voice, which I use for voice-commanding my computer, is very fast and has a linux version. I don't know if it would run on a small machine, but it seems worth a shot. And it seems perfect for your command-centered use-case. To get more information, I would ask in their slack.


Looks like Talon uses (their own) W2L Conformer as their recognition engine. I'll poke at it!