Human minds form various abstractions over our environment. These abstractions are sometimes fuzzy (too large to fit into working memory) or leaky (they can fail).
Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory) and completely non-leaky (never fail). And also the study of which abstractions can do that.
I think this is a good metaphor, but note that it is still very possible to be a dick, hurt other people, etc. while communicating in NVC style. It's not a silver bullet because nothing is.
It might be important for AI strategy to track approx how many people have daily interactions with AI boyfriends / girlfriends. Or, in more generalized form, how many people place a lot of emotional weight and trust in AIs (& which ones they trust, and on what topics).
This could be a major vector via which AIs influence politics, get followers to do things for them, and generally cross the major barrier. The AIs could be misaligned & scheming, or could be acting as tools of some scheme-y humans, or somewhere in between.
(Here I'm talking about AIs which have many powerful capabilities, but aren't able to act on the world themselves e.g. via nanotechnology or robot bodies — this might happen for a variety of reasons.)
If any post ever deserved the "World modeling" tag it's this one.
LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace 'communism' with 'plans to take over.' Humans have also produced a literature on this topic. Shouldn't we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I'm more confident in. Now an argument I'm less confident in.
'The literature' would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like "learning to play the persona that does X" rather than "learning to do X." So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very 'stuck in its ways' i.e. very dependent on 'traditional' ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they're flawed.
Per both arguments, if you're training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
A Berkeley professor speculates that LLMs are doing something more like "copying the human mind" than "learning from the world." This seems like it would imply some things we already see (e.g. fuzzily, they're "not very creative"), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I'm finding it hard to concretize this.
That is, if LLMs are trained with a simple algorithm and acquire functionality that resembles that of the mind, then their underlying algorithm should also resemble the algorithm by which the mind acquires its functionality. However, there is one very different alternative explanation: instead of acquiring its capabilities by observing the world in the same way as humans, LLMs might acquire their capabilities by observing the human mind and copying its function. Instead of implementing a learning process that can learn how the world works, they implement an incredibly indirect process for scanning human brains to construct a crude copy of human cognitive processes.
If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it's trying to write code that does bad stuff? Some sub-questions towards answering that question:
What was the purpose of using octopuses in this metaphor? Like, it seems you've piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said "AIs."
EDIT: Is it gradient descent vs. evolution?
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
since this seems both important and likely to be popular.
(This is a brainstorm-type post which I'm not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)
I don't mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like "the good" which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like "group" or "4" whose properties are immediately apparent, and these are what's involved in proofs. Call these fuzzy / non-fuzzy concepts.
(Philosophers often reflect on their concepts, like "the good," and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you're supposed to use a new definition.)