This analysis seems right to me, but I think you're wrong to compare the sperm whales to Koko. It seems a priori much more likely that a given animal's sounds carry meaning, than it is that any given animal can be made to use human language.
Whale noises must communicate some information, or they wouldn't evolve to make those noises, so we're just haggling over the information density. Conversely, if gorillas and chimps were capable of learning complex sign language for communication, we'd expect them to evolve/culturally develop such a language.
(Though whales might just be saying "I am large and over here" so the information density may be very low)
This is a great point with regards to how plausible the various scientific animals communications claims actually are, but to my eye, the CETI people look motivated in the same way as the Koko people or your local crazy cat lady.
It seems to me that there's a strong case for audibly distinguishable sounds, and that they can't be formed by high-speed muscle movements, but I'm not convinced that they couldn't be volitional via another mechanism such as beamforming or other control of interference patterns. Humans can do overtone/throat singing with some practice, for example, and I can imagine something similar plus a click...
I was at a party a few years ago. It was a bunch of technical nerds. Somehow the conversation drifted to human communication with animals, Alex the grey parrot, and the famous Koko the gorilla. It wasn't in SF, so there had been cocktails, and one of the nerds (it wasn’t me) sort of cautiously asked “You guys know that stuff is completely made up, right?”
He was cautious, I think, because people are extremely at ease imputing human motives and abilities to pets, cute animals, and famous gorillas. They are simultaneously extremely uneasy casting scientific shade on this work that’d so completely penetrated popular culture and science communication. People want to believe even if dogs and gorillas can’t actually speak, they have some intimate rapport with human language abilities. If there’s a crazy cat lady at the party, it doesn’t pay to imply she’s insane to suggest Rufus knows or cares what she’s saying.
With the advent of AI, the non-profit Project CETI was founded in 2020 with a charter mission of understanding sperm whale communications, and perhaps even communicating with the whales ourselves. Late last year, an allied group of researchers published Begus et al.: “Vowel- and Diphthong-Like Spectral Patterns in Sperm Whale Codas”.
The paper takes a novel approach. Instead of trying to analyze whale click counts and duration and other straightforward avenues of analysis, it uses the spectral properties of sperm whale codas, sequences of clicks used in social settings, to get at another potential dimension of whale communications. And they provide actual code and data!
Quick Background
Practically all cetaceans make sounds. Humpback whale song is thought to be a part of ritualistic courtship. Many species including sperm whales use echolocation in the same manner as bats, but the sperm whale coda is something different. Each coda is composed of a sequence of distinct clicks. There are varying types of codas of varying lengths of clicks, and interestingly the coda types seem to be tied to matrilineal whale families or “clans”. You can listen to a sperm whale bout here.
In the whale language research world, a “dialogue” between whales is made up of “bouts” between whales, a “bout” is made up of a sequence of “codas”, a “coda” is made up of distinct “clicks”, and according to these authors’ analysis the clicks can take on different flavors which they choose to call “vowels”. The vowel types are “a” and “i”, which strictly make up a naming convention and have nothing else to do with our human vowels.
The bulk of the data elements come from 14 sperm whales they’ve managed to individually bug with listening devices. They’re bugging the whales! The Great Stagnation is over!
There’s a lot of data to go through, and I made a shiny app to help myself understand what’s going on here better. I’ve deployed the app here if you want to try it out. You can see the various spectral peaks at different frequencies and the authors’ preferred vowel identification per click and per coda.
The Vowels
When I started out reading this paper, I was sure the authors were speaking metaphorically. Surely they’re not suggesting these spectral differences captured by Fast Fourier Transforms actually constituted different vowel-like sounds analogous to what humans can willfully articulate right?
But they are! They state:
And they go further, naming the clicks themselves with 1 distinct spectral peak as “a” and those with 2 or more spectral peaks as “i”.
I read this with an incredibly skeptical eye, but these patterns hold both across coda types and across different whales.
I noticed how the detected spectral peaks at the click level relied on several important hyperparameters like the minimum height of a candidate peak and how close together the peaks were. Determined to show the analysis could not possibly be robust to changes in those hyperparameters, I ran an extensive gridsearch across different values and was horrified to discover that the results were surprisingly consistent across such methodological changes.
In more than half the hyperparameter choices, 90% or more of the peak counts stayed the same as what the authors did. When I scrolled through all the spectral data by hand, there really did look like 2 types of situations, “a” and “i” that a regular person might denote as different.
What in the name of Captain Ahab’s prosthetic is going on here? Can sperm whales really control the sounds induced by their phonic lips in the way the authors mean?
Articulatory Control
The reason that the paper is incorrect comes down to biological plausibility, a very careful look at the data, and a little math. I’m reasonably certain the described coda vowel patterns are in fact physical artifacts, and not volitional whale utterances.
In their discussion section, Begus et al. point out
“Articulatory control” is a term of art in linguistics describing how specific motor controls cause intentional sounds to be made. A source makes up raw acoustical energy and this energy passes through a filter to appropriately modify it. Obviously, many animals have this ability. I was a birding dork growing up and a bit of a math dork if you can believe it and actually read large parts of Mindlin and Laje’s wonderful monograph The Physics of Birdsong in college. I revisited it to see how plausible the articulatory control proposition in sperm whales actually is.
In these whales, the acoustical pulse source for the whale codas is called the phonic lips which sit near the blowhole. Nasal air is used to flap these open and closed quickly which forms the acoustic pulse. This single pulse is filtered and shaped as it travels through to the back of the whale’s head reverberating off the distal air sac, and the sound you hear creates a wavefront in the water in front of the whale. The sac, the spermaceti, and the rest of the whale head form the filter.
Birds also have such a source-filter system, and so do humans. The human system is astonishing and we very clearly have the most sophisticated articulatory control among all animals, but that of songbirds is in some sense much more impressive. Mindlin and Laje cite Elemans et al. 2004 entitled Bird song: Superfast muscles control dove’s trill which states
The problem here is that the click-level sperm whale coda data that shows multiple spectral peaks each consists of 5 ms at most. Below is a typical “i”-type vowel 3 + 1 + 1 coda consisting of 5 clicks.
However the whale filtered the pulse from the phonic lips, it would need to involve movements at time granularities much, much smaller than even the fastest known acoustic control systems in vertebrates. This is not biologically plausible, probably even if the articulatory control in question is at the coda level and not the click level. The whales are not in control of these clicks in the way the authors suppose.
What’s actually going on here?
The multiple peak pattern is real, but the above argument shows this is clearly not under the whale’s volitional control. Some pretty decent clues surface in 22% of the codas under study: when clicks making up the codas have different spectral signatures. Look at this one.
The authors label this one as an “i” coda, but you can see how close the secondary peak is to the primary peak across clicks. These intermediate codas suggest there really is only a single peak at a fixed frequency, and the secondary peak is a beaming artifact.
Look at the cartoon in Figure 4 above. As the acoustical energy bounces off the air sac and flows through the whale’s head, the “i” type vowels you’re seeing are simply an interference pattern. The broadband impulse which originates from the phonic lips exits the head along a direct path and a delayed reflected path off the distal air sac. There’s a pretty detailed wikipedia page documenting this phenomenon.
There’s another piece of data driven evidence confirming this interpretation in Figure 2 above. Notice how the top peaks on the “i” vowels and the peaks on the “a” vowels all cluster around 6 kHz. There’s not necessarily a reason to expect this unless the second lower frequency “i” peaks are reflective artifacts derived of the higher frequency peaks which also make up all the “a” peaks.
Why is this happening? The authors do a little bit of argumentation about why the whale pitch, hydrophone placement, and depth are unlikely to produce artifacts like this one, but I didn’t find it very persuasive.
Conclusion
People at CETI: I’m on your side! It would be amazing to talk to the whales. I want us to have to answer questions in a tribunal about the crimes of the whaling industry with sperm whale prosecutors. This project needs much more data and a very disciplined approach I hope they have the conviction to undertake. However, I fear they’ve fallen into what appears to be a classical failure mode: endowing animals with human abilities.