I am told that unsupervised machine translation is a thing. This is amazing. I ask: Could we use it to understand dolphin language? (Or whales, perhaps?)

I don't currently see a convincing reason why not. Maybe dolphins aren't actually that smart or communicative and their clicks are mostly just very simple commands or requests, but that should just make it really easy to do this. Maybe the blocker is that dolphins have such a different set of concepts than English that it would be too hard?

New Answer
Ask Related Question
New Comment

1 Answers sorted by

The approach of the linked article tries to match words meaning the same thing across languages by separately building a vector embedding of each language corpus and then looking for structural (neighborhood) similarity between the embeddings, with an extra global 'rotation' step mapping the two vector spaces on one another.

So if both languages have a word for "cat", and many other words related to cats, and the relationship between these words is the same in both languages (e.g. 'cat' is close to 'dog' in a different way than it is close to 'food'), then these words can be successfully translated.

But if one language has a tiny vocabulary compared to the other one, and the vocabulary isn't even a subset of the other language's (dolphins don't talk about cats), then you can't get far. Unless you have an English training dataset that only uses words that do have translations in Dolphin. But we don't know what dolphins talk about, so we can't build this dataset.

Also, this is machine learning on text with distinct words; do we even have a 'separate words' parser for dolphin signals?

So the blocker I mentioned. OK, thanks. Well, maybe we could make a translator between whales and dolphins then.

Or we could make a translator between a corpus of scuba diver conversations and dolphins.

We might be able to parse dolphin signals into separate words using ordinary unsupervised learning, no?

Why does the relative size of the vocabularies matter? I'd guess it would be irrelevant, the main factor would be how much overlap the two languages have. Maybe the absolute (as opposed to relative) sizes would matter.

If Language A and Language B have word embeddings that partially overlap and partially don't, that doesn't necessarily mean it's impossible to match the part that does overlap. After all, that always happens to some extent, even between English and French (i.e. not every English word corresponds to a single French word or vice-versa), but the matching is still possible. It would obviously be a much much more extreme non-overlap for English vs dolphin, and that certainly makes it less likely to work, but doesn't prove it impossible. (It might require changi

... (read more)
I think the disparity in number of words is proportionally so large that this method won't work. The (small) hypothetical set of dolphin words wouldn't match to a small subset of English words, because what's being matched is really the (embedded) structure of the relationship between the words, and any sufficiently small subset of English words loses most of its interesting structure because its 'real' structure relates it to many words outside that subset. Support that dolphins (hypothetically! counterfactually! not realistically!) use only 10 words to talk about fish, but humans use 100 words to do the same. I expect you can't match the relationship structure of the 10 dolphin words to the much more complex structure of the 100 human words. But no subset of ~10 English words out of the 100 is a meaningful subset that humans could use to talk about fish.
3Daniel Kokotajlo3y
Thanks, I found that explanation very helpful.

New to LessWrong?