In the bucket brigade singing project I talked about yesterday, the server is Python. We wrote it in a very straightforward "it won't be fast but at least it's correct" style. In simple stress testing, I found that it could handle about 10 clients, which is definitely not enough.

My first thought was to switch it to C with libmicrohttpd, and for a program which is just a web interface to a circular buffer C isn't terrible. But how far can we get if we do the hard stuff in Numpy?

In the initial version there were four parts that did work in proportion to either the input or the whole buffer:

  • Clearing the buffer: this is a circular buffer, and as we wrap around we want to be following the lead singer. We should not hear things from the last time around. So we need to zero out the buffer ahead of the lead person.

  • Incoming audio: we receive a large block of samples from the network, and need to sum them with what's already there at the right part of the circular buffer.

  • Outgoing audio: we need to send a section of the buffer out to the client to play.

  • Computing metadata: We draw a debugging chart at the bottom of the page, and it needs to know what parts of the queue have audio data. We send information about which frames (128 samples) are non-zero.

If this were a large project, I would start with profiling, but this was small enough that I was pretty sure we were just going to want to convert everything to an efficient implementation. David and I pair-programmed for a couple hours and turned this naive code into this numpy-using code.

The refactor brought a single request from 218ms to 74ms. This was ok, but still not great. The problem was that computing metadata was really very slow. The other three operations are in proportion to the size of a request, but the metadata one was in proportion to the size of the queue. And worse, it involved a Python loop over 128-sample frames.

Since the metadata was only used to populate a debugging chart, I measured the effect of removing it. In the pure Python version this brought us from 218ms to 23ms, while with Numpy it brought us from 74ms to 0.06ms. I won't credit Numpy with the effect of removing the chart, but 23 ms to 0.06 ms is still an excellent 383x speedup.

I doubt this is enough to move us from 10 clients to 3830 clients, but it helps enough that I expect encoding and decoding Opus to be the major factor, once we get that added.

Comment via: facebook

New Comment
10 comments, sorted by Click to highlight new comments since:

If you make a tree of users based on location, then some users can act as hubs, collecting data from nearby users and passing it on.

If you store the data in terms of a sparsely encoded FFT, then you don't need to reverse the FFT to add the signals (FFT is a linear transformation)

Suppose you have an array of length 128 representing sound data. You Fast Fourier Transform it (FFT)

(Use numpy.fft.fft) In this format, most of the values are about 0 (the signal consists mostly of a few frequencies) You bin each value into one of a finite number of bins, and then use Huffman coding.

To add multiple signals, just remove the hufmann coding, add the signals and reencode, no need to use FFT.

Asking some of the clients to do this means that your server no longer needs to combine 10 data streams. (And doesn't need 10 audio streams worth of bandwidth) Do it right, and the server could just share code and IP addresses.

The goal here is for this to feel like group singing as much as possible. I think in your proposal there will be many users where a doesn't hear b and b also doesn't hear a? And not just because they're singing at the same wall time?

I also don't see why you're proposing the FFT? Yes, that lets you accumulate frequency information, but you lose timing information in the process. Since our goal is to transmit full audio, I'm not sure how you see the FFT fitting in.

The idea involved everyone hearing everyone. If you have 6 singers, arranged into 2 groups of 3, then each group of 3 can have one person combine 3 streams of audio into one, and then send that off to combine into a single stream. Instead of alice sending alices singing to the server, and bob sending bobs singing, alice can send her singing to bob, and bob can send (alice+bob) to the server.

The idea with Fourier transforms was a form of compression in which it was quick to sum 2 compressed signals into a single compressed signal.

Why are you putting your six singers into two groups of three? The ideal, from the perspective of everyone hearing as many people as possible, is to order your singers a, b, c, d, e, f. Each person hears the audio from those ahead of them. If you have really very large numbers of people, such that arranging them in a full chain gives an end to end latency that is too high, then you can use some sort of chain of groups, for example a, b + c, d + e, f + g.

If you have any sort of chain that is reasonably long, then you want to be resilient to losing a link. That's much easier to do when you have a server that everyone is sending and receiving audio from. Our current design can recover smoothly from someone having a network hiccup because all that happens is you lose a bit of audio data and then resume. Key to this is that people downstream from the one having a network problem don't have their audio interrupted, beyond losing audio from the person who is no longer connected.

In theory a peer to peer approach could offer slightly lower latency, but I expect the game there is minimal. Sending a packet from a to b, versus sending a packet from a to a high-connectivity well-placed central server to b, isn't actually that different.

With the FFT, I think you may be effectively reinventing lossy audio compression? I think we'll likely get much better results using opus or another modern codec.

Everyone hears audio from everyone.

Suppose you have loads of singers. The task of averaging all the signals together may be to much for any one computer, or might require too much bandwidth.

So you split the task of averaging into 3 parts.

np.sum(a)==np.sum(a[:x]) + np.sum(a[x:])

One computer can average the signals from alice and bob, a second can average the signals from carol and dave. These both send their signals to a third computer, which averages the two signals into a combination of all 4 singers, and sends everyone the results.

sends everyone the results

I think that's where you're imagining this differently than I am. In the approach I am describing, everything is real time. The only time you hear some thing is when you were singing along to it. You never hear a version of the audio includes your own voice, and includes the voices of anyone after you in the chain. The goal is not to create something and everyone listen back to it, the goal is to sing together in the moment.

By "send everyone the results" I was thinking of doing this with a block of audio lasting a few milliseconds.

Everyone hears everyones voices with a few milliseconds delay.

If you want not to echo peoples own voices, then keep track of the timestamps, every computer can subtract their own signal from the total.

The amount of latency, even in an extremely efficient implementation, will be high enough to keep that approach from working. Unless everyone has a very low latency audio setup (roughly the default on macs, somewhat difficult elsewhere, impossible on Android), a wired internet connection, and relatively low physical distance, you just can't get low enough latency to keep everything feeling simultaneous. A good target there is about 30ms.

The goal with this project is to make something feel like group singing, even though people are not actually singing at the exact same time as each other.

I'm not sure if it's easier than a full rewrite, but another option is to write a C Python module to do the slow piece (maybe write a function to do the metadata processing in C?):