Some questions:
Re: overall bandwidth
128 kbps audio sounds fine and video quality is much less important than audio. A typical video call uses 720p video at 30 fps, which Twitch says you can stream at 3 Mbps (and pro streamers probably care more about quality than most people do). I basically wouldn't worry about bandwidth unless you use a physical whiteboard or otherwise need really good video quality.
Re: latency and WiFi
Most sources talk about one-way latency, even though round-trip is what actually matters (how long it takes for you to react to something you heard and for the other person to hear your reaction). I'm guessing round-trip is technically harder to measure since it includes the human-thinking delay.
Twilio says users start to notice one-way latency above 100 ms, and VOIP providers target under 150 ms. Traditional calls are below 20 ms though (similar latency to talking to someone across a large room). As a lower-bound, musicians get thrown of by ~30 ms of latency.
Note that people can adapt to latency but they do that by having less productive conversations: If you can't naturally do things like interrupt each other, you'll have a less-interactive conversation. I suspect 150 ms is too optimistic.
Bluetooth's AptX codec adds ~40 ms if you're lucky (they market this as "low latency" since the older SBC codec adds up to 200 ms of latency). If I'm understanding things right, two people on a cross-country call using bluetooth headsets are already hitting 140 ms in the best case. I don't know if there's a good way to measure this.
WiFi is harder to quantify since it can add relatively small delays, but the problem is that it's inconsistent (because of interference and being too busy). If audio packets show up inconsistently, the software needs to add buffers to keep everything showing up at the same time. I don't remember the details, but when I last worked on low-latency applications, Also if you chain WiFi routers together you get multiple channels of possible interference and a new layer where you can lose packets. I would expect coffee shop WiFi networks to be bad because they're frequently overloaded and have tons of interference (if they're in a dense area). Home WiFi might be ok in a low-density area.
Gotcha, thanks for the investigation and info.
Yeah it seems plausible to me that sources recommending things like 100ms or 150ms latency are being conservative in a sense, and that there are meaningful gains to be had with lower latency.
And I definitely buy that high enough latency that leads to interruptions is annoying. As an anecdote, I've been listening to The Prancing Pony Podcast recently. The co-hosts interrupt each other unintentionally all the time, I suspect because of poor latency. It's really bad.
So with wifi, it sounds like you should be good if the routers are positioned such that there isn't much interference, and if there's plenty of capacity. Like at home my girlfriend and I don't have too many devices straining the router, but if we had a party with 15 people around then it'd be problematic?
As for coffee shops, I work from coffee shops a lot (but almost always avoid taking video calls there). A lot of them are pretty calm and don't have too many people there using the wifi. And the bigger ones with lots of people on their laptops, that's the demographic they're targeting so I suspect that they pay for good internet stuff. I've definitely been to coffee shops where the connection is bad though.
For WiFi, the biggest issue is that if two devices transmit at the same time, they'll interfere ("collide") and both packets will get dropped and need to be retransmitted. Unlike phone networks, on WiFi there's basically no coordination, so this interference is random and increases birthday-problem style as the network has more devices connected or has more traffic. There's an exponential random backoff protocol to prevent infinite interference, but exponential backoff means exponentially increasing latency.
You can also get interference from devices connected to other WiFi networks on same channel (so just being in a busy part of town or an apartment building can add significant interference).
WiFi's base speed is also limited to the slowest device on the channel, which has to do with the oldest supported protocol version, hardware, and distance. On a public network, you have a fairly high probability that at least one device is old and/or really far from the router, which drops the speed for everyone and makes the interference problem worse (since slower speed means each packet takes longer to send and therefore has more time when interference can disrupt it).
There's a lot of stuff that interacts, so it's possible to have 15 (or even more) people on calls on the same WiFi network, but you'd need:
Spaces that really care about this will use a bunch of high speed short-range access points (wired together) coupled with software to drop slow devices. It's common-ish at conference centers, but not coffee shops, and even then they're usually targeting acceptable latency/bandwidth for web browsing, not calls.
But yeah, in some cases a voice call on WiFi will work fine even with some other people on the network, but I wouldn't trust all of the necessary stars to align consistently on a public network.
Consider a separate microphone. I started using a Blue Snowball mic when I had a job making instructional videos, and stuck with it for casual videoconference use too. The audio is much clearer than a headset mic or webcam built-in mic.
Re: your headphones
I don't know much about non-headset mics. I don't like them because they pick up background/room noise while a mic right in front of your face can filter to just your voice better. I imagine some of them sound fine in a quiet room though.
My guess is that your headphones just don't have a good mic. I'm picky about my headset since most mics are an afterthought.
I've been working remotely since before it was cool, and one thing I wish more people paid attention to is meeting equipment. It's annoyingly common to join a remote meeting with someone on flaky WiFi, with a barely-understandable microphone, and a camera where they show up as a shadowy blob.
All of this is fixable, and if you work remotely it's worth spending a little bit of money to do it. Remote meetings where you can see and (more importantly) hear each other clearly are much nicer, and lead to more natural and collaborative conversations.
Since the fundamentals of lighting and audio technology haven't changed in a long time, I'll mostly be giving ancient used suggestions to save money. There's probably nicer current-gen equipment if you go looking for it, but it's not really necessary.
Stop taking calls on crappy coffee shop WiFi networks. Just stop!
For a natural conversation, you need extremely low latency[1] and for everything to transmit perfectly with no garbling[2]. WiFi performs poorly as the number of networks and devices in an area goes up, and public networks are rarely optimized for low-latency.
The best network option is to plug your laptop into ethernet wired directly to a fiber or cable modem. I do this when I'm working at my desk.
Admittedly, this is annoying, so getting a high-quality WiFi router is also an option. There are two rules here:
There are WiFi routers with range equivalent to multiple cheap routers chained together, so if you need long-range I recommend that. You can also set up multiple access points wired together if necessary.
I have a TP-Link AX6000 from 2020 (~$55 on ebay), although it's likely overkill for most people (it's a relatively fancy WiFi 6 router). The bare minimum is a router that supports 802.11ac ("WiFi 5") and has sufficient range. WiFi 5 came out in 2013 so there are good deals to be had here if you don't care about the latest tech.
Headphones make way more of a difference than you'd expect, especially your microphone. This is hard to notice since you usually don't hear yourself, but consider how hard it is to understand your coworkers sometimes. Your audio also sounds that bad to them.
The two things you're looking for to improve this are a high-quality microphone and a connection that doesn't ruin it. I'm not going to talk about playback quality on your side, since every headset on the market is good enough for meetings, and you'll know if you want audiophile-quality headphones.
Most headset reviews don't talk about microphones, but the heuristic is that gaming headsets tend to have good microphones, since gamers care about their team mates being able to hear them well.
I recommend a headset with a microphone instead of a podcasting-style microphone. Bigger/fancier headphones are good for top-quality recording, and work much better if you need to pick up multiple people, but they're harder to set up. A microphone on your desk[3] will loudly pick up typing, and they're less convenient if you want to move around.
One advantage of a separate mic is if you want to use bluetooth headphones: Bluetooth headsets are complete garbage, but in headphone-only mode they're fine.
Bluetooth audio quality has improved a lot in the last decade, but unfortunately that's mostly unidirectional audio. The bidirectional codecs are a mess.
This means if you want to listen to music on bluetooth headphones, it will usually sound great, but once you enable your mic, sound quality will drop, and the quality of your microphone will sound terrible, no matter how good it is physically. Also, even "low latency" bluetooth codecs like AptX add 40+ ms of latency[4].
Allegedly this has improved somewhat over the last few years, and if you're lucky you might be able to find a bluetooth headset + computer + operating system combination that uses a better bidirectional codec, but in my opinion it's still easier to use a wired headset. A wire also has no interference, so you'll get latency and quality which is impossible to achieve wirelessly.
I have the Sennheiser EPOS Game ONE Gaming Headset that a previous employer bought me and then let me keep when I left. I bought mine in 2016, but it's a wired headset so nothing important has changed in headset technology since then. You can get these for $25 on ebay now.
My advice here is out of date, so there might be acceptable bluetooth headsets (assuming you pick based on compatibility with your OS), but I don't know what they are. Good luck if you look for that.
Finally, we get to the somewhat-less important things. Looking nice on a call is less important than being understood, but if you're going to be on calls all the time, it's worth improving your lighting. You can trick people into liking you more if you look nice.
The two main problems with video call lighting are:
This costs nothing. Just turn the lights in your room on. Cameras need light to work.
Unfortunately, room lights are usually in the wrong place to light your face. Ceiling lights point down (causing your face to be shadowed) and people usually don't put their desk directly against a window, so window light either causes the entire front of your body to be shadowed (if it's behind you) or to have harsh shadows from one side.
You can partially solve this with More Dakka. I have 14,000 lumens of light around the ceiling in my office, so I actually look ok with just my ceiling lights.
But I look even better if I add a light in front of me.
And it's even more drastic if you have bad room lighting.
Any light that's not behind or above you will help. Putting a lamp in either corner of your room next to your desk will help. The best option is a key light mounted on your monitor, although you may find it annoying if you don't like light in your face.
I have the Elgato Key Light Neo ($50 on ebay), although honestly it's overkill. Just put any light on or near your desk.
This is what prompted me to write this article. Surprisingly, almost all webcams are trash. The Logitech C920 was most sites' top recommended webcam for years, and look at those pictures above. They're ok, but I wouldn't call them good. And that's a top rated webcam. Most webcams have tiny sensors and barely work better than the flip phone camera I had in high school.
That said, most webcams will look good enough if you have good lighting.
I swear Insta360 didn't pay me to write this, but their Insta360 Link webcam is just so good. Compare the "good shot" from the C920 with h shot by an Insta360 Link in the dark.
Sure, my skin is still a weird pink tone, but that image is still surprisingly good. I had to dig my old C920 out for the lighting section because all of the examples looked fine with my new camera.
And if I actually give it some light to work with..
This thing does cost $120 on ebay though, and it's not really that important, so if you're going to skip anything, skip this.
Some other options are:
Fix your WiFi, get a decent mic, and turn some lights on, and your coworkers will love you. Maybe get a fancy camera if you're vain like I am.
Then get back to work.
One-way latency of 150 ms is "acceptable" on a call, in the sense that customers won't complain about it, but latency starts to impact the flow and interactive feel of a call before that point.
I plan to write a whole post about this, but I'd target sub-100 ms latency for natural/interactive conversation, and I suspect there's further (smaller) benefits to latency reduction past that point.
For comparison, musicians get thrown off by ~30 ms latency, and an in-person conversation has ~10 ms of latency.
The latency requirements for calls are too short to retransmit dropped packets, so any packet loss means loss of clarity.
There are some tricks to fill in lost packets, but using them introduces more latency.
You can use a boom arm to fix this, but they're huge and get in the way.
AptX adds 40-60 ms of latency. The older SBC codec (which gets used surprisingly often) adds up to 200 ms of latency.