As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable".
Judging in an informal and biased way, I think some impact is in the public debate being marginally a bit more sane - but this is obviously hard to evaluate.
To what extent more informed public debate can lead to better policy is to be seen; also, unfortunately, I would tend to glomarize over discussing the topic directly with policymakers.
There are some more proximate impacts like we (ACS) are getting a steady stream of requests for collaboration or people wanting to work with us, but we basically don't have capacity to form more collaborations, and don't have capacity to absorb more people unless exceptionally self-guided.
It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3. Someone can try that with ' petertodd' and GPT-J. Or, you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now).
I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense
davinci-instruct-beta, T=0:
Add more examp...
I don't know / talked with a few people before posting, and it seems opinions differ.
We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.
Also as Kaj wrote, the problem also isn't how to get self-unaligned
Some speculative hypotheses, one more likely and mundane, one more scary, one removed
1. Nature of embeddings
Do you remember word2vec (Mikolov et al) embeddings?
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it's more something like "the contextless essence of femininity". Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vecto...
Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.
It would be enlightening to rerun the above experiments with different choices of v:
Etc.
Thanks for the links!
What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'.
I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.
I don't this the self-alignment problem depends of notion of 'human values'. Also I don't think the "do what I said" solves it. Do what I said is roughly "aligning with the output of the aggregation procedure", and
Note that this isn't exactly the hypothesis proposed in the OP and would point in a different direction.
OP states there is a categorical difference between animals and humans, in the ability of humans to transfer data to future generation. This is not the case, because animals do this as well.
What your paraphrase of Secrets of Our Success is suggesting is this existing capacity for transfer of data across generations is present in many animals, but there is some threshold of 'social learning' which was crossed by humans - and when crossed, lead to cultural...
Thanks for the comment.
I do research on empirical agency and it's still surprises me how little the AI-safety community touches on this central part of agency - namely that you can't have agents without this closed loop.
In my view it's one of the results of AI safety community being small and sort of bad in absorbing knowledge from elsewhere - my guess is this is in part a quirk due to founders effects, and also downstream of incentive structure on platforms like LessWrong.
But please do share this stuff.
...I've been speculating a bit (mostly to myself)
This whole just does not hold.
(in animals)
The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.
This is clearly false. GPT4, can you explain? :
While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:
I think OP is correct about cultural learning being the most important factor in explaining the large difference in intelligence between homo sapiens and other animals.
In early chapters of Secrets of Our Success, the book examines studies comparing performance of young humans and young chimps on various congnitive tasks. The book argues that across a broad array of cognitive tests, 4 year old humans do not perform singificantly better than 4 year old chimps on average, except in cases where the task can be solved by immitating others (human children crushe...
Mostly yes, although there are some differences.
1. humans also understand they constantly modify their model - by perceiving and learning - we just usually don't use the world 'changed myself' in this way
2. yes, the difference in human condition is from shortly after birth we see how our actions change our sensory inputs - ie if I understand correctly we learn even stuff like how our limbs work in this way. LLMs are in a very different situation - like, if you watched thousands of hours of video feeds about e.g. a grouphouse, learning a lot about how the i...
This seems the same confusion again.
Upon opening your eyes, your visual cortex is asked to solve a concrete problem no brain is capable or expected to solve perfectly: predict sensory inputs. When the patterns of firing don't predict the photoreceptor activations, your brain gets modified into something else, which may do better next time. Every time your brain fails to predict it's visual field, there is a bit of modification, based on computing what's locally a good update.
There is no fundamental difference in the nature of the task.
Where the...
I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error.
- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less diff...
While the claim - the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence - is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong.
Let’s start from the end and try to figure out what goes wrong.
...GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
And since the task that GPTs are being trained on is different from a
I don't mind the post was posted without much editing or work put into formatting but I find it somewhat unfortunate the post was probably written without any work put into figuring out what other people wrote about the topic and what terminology they use
Recommended reading:
- Daniel Dennett's Intentional stance
- Grokking the intentional stance
- Agents and device review
This is great & I strongly endorse the program 'let's figure out what's the actual computational anatomy of human values'. (Wrote a post about it few years ago - it wasn't that fit in the sociology of opinions on lesswrong then).
Some specific points where I do disagree
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older ...
I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.
I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.
- pivotal acts vs. pivotal processes
-- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal a...
Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.
Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of pro...
To be clear we are explicitly claiming it's likely not the only pressure - check footnotes 9 and 10 for refs.
On the topic thinking about it for yourself and posting further examples as comments...
This is GPT4 thinking about convergent properties, using the post as a prompt and generating 20 plausibly relevant convergences.
Translating it to my ontology:
1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough cap...
I don't think in this case the crux/argument goes directly through "the powerful alignment techniques" type of reasoning you describe in the "hopes for alignment".
The crux for your argument is the AIs - somehow -
a. want,
b. are willing to and
c. are able to coordinate with each other.
Even assuming AIs "wanted to", for your case to be realistic they would need to be willing to, and able to coordinate.
Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other?
My...
I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants".
How this work: currently, the training set does not contain many "conversations with helpful AI assistants". "ChatGPT" is likely mostly not the protagonist in the stories it is trained on. As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.
If you train on data where "the ChatGPT...
Seems a bit like too general counterargument against more abstracted views?
1. Hamiltonian mechanics is almost an unfalsifiable tautology
2. Hamiltonian mechanics is applicable to both atoms and starts. So it’s probably a bad starting point for understanding atoms
3. It’s easier to think of a system of particles in 3d space as a system of particles in 3d space, and not as Hamiltonian mechanics system in an unintuitive space
4. Likewise, it’s easier to think of systems involving electricity using simple scalar potential and not the bring in the Hamiltonian
5. It...
A highly compressed version of what the disagreements are about in my ontology of disagreements about AI safety...
I'm not really convinced by the linked post
- the chart is from a someone selling financial advice and illustrated elo ratings of chess programs differ from e.g. wikipedia ("Stockfish estimated Elo rating is over 3500") (maybe it's just old?)
- linked interview in the "yes" answer is from 2016
- elo ratings are relative to other players; it is not trivial to directly compare cyborgs and AI: engine ratings are usually computed in tournaments where programs run with same hardware limits
In summary, in my view in something like "correspondence c...
Yes, the non-stacking issue in the alignment community is mostly due to the nature of the domain
But also partly due to the LessWrong/AF culture and some rationalist memes. For example, if people had stacked on Friston et. al., the understanding of agency and predictive systems (now called "simulators") in the alignment community could have advanced several years faster. However, people seem to prefer reinventing stuff, and formalizing their own methods. It's more fun... but also more karma.
In conventional academia, researchers are typically forced to stack...
Thanks for the comment. I haven't noticed your preprint before your comment, but it's probably worth noting I've described the point of this post in a facebook post on 8th Dec 2022; this LW/AF post is just a bit more polished and referenceable. As your paper had zero influence on writing this, and the content predates your paper by a month, I don't see a clear case for citing your work.
Mostly agree - my gears-level model is the conversations listed tend to hit Limits to Legibility constraints, and marginal returns drop to very low.
For people interested in something like "Double Crux" on what's called here "entrenched views", in my experience what has some chance of working is getting as much understanding as possible in one mind, and then attempting to match the ontologies and intuitions. (I had some success in this and "Drexlerian" vs "MIRIesque" views)
The analogy I had in mind is not so much in exact nature of the problem, but in the aspect it's hard to make explicit precise models of such situations in advance. In case of nukes, consider the fact that smartest minds of the time, like von Neumann or Feynman, spent decent amount of time thinking about the problems, had clever explicit models, and were wrong - in case of von Neumann to the extent that if US followed his advice, they would have launched nuclear armageddon.
One big difference is GoF currently does not seem that dangerous to governments. If you look on it from a perspective not focusing on the layer of individual humans as agents, but instead states, corporations, memplexes and similar creatures as the agents, GoF maybe does not look that scary? Sure, there was covid, but while it was clearly really bad for humans, it mostly made governments/states relatively stronger.
Taking this difference into account, my model was and still is governments will react to AI.
This does not imply reacting in a helpfu...
The GoF analogy is quite weak.
As in my comment here, if you have a model that simultaneously both explains the fact that governments are funding GoF research right now, and predicts that governments would nevertheless react helpfully to AGI, I’m very interested to hear it. It seems to me that defunding GoF is a dramatically easier problem in practically every way.
The only responses I can think of right now are (1) “Basically nobody in or near government is working hard to defund GoF but people in or near government will be working hard to spur on a helpful...
While I have a lot of sympathy for the view expressed here, it seems confused in a similar way to straw consequentialism, just in an opposite direction.
Using the terminology from Limits to Legibility, we can roughly split the way how we do morality into two types of thinking
- implicit / S1 / neural-net type / intuitive
- explicit / S2 / legible
What I agree with:
In my view, the explicit S2 type processing basically does not have the representation capacity to hold "human values", and the non-legible S1 neural-net boxes are necessary for being moral.
Attempts ...
What's described as An ICF technique is just that, one technique among many.
ICF does not make the IFS assumption that there is some "neutral self". It makes a prediction that when you unblend few parts from the whole, there is still a lot of power in "the whole". It also makes the claim that in typical internal conflicts and tensions, there just a few parts which are really activated (and not, e.g., 20). Both seems experimentally verifiable (at least in phenomenological sense) - and true.
In my view there is a subtle difference between the "self...
My technical explanation for why not direct consequentialism is somewhat different - deontology and virtue ethics are effective theories . You are not almost unbounded superintelligence => you can't rely on direct consequentialism.
Why virtue ethics works? You are mostly a predictive processing system. Guess at simple PP story:
PP is minimizing prediction error. If you take some unvirtuous action, like, e.g. stealing a little, you are basically prompting the pp engine to minimize total error between the action taken, your self-model / wants model, and you...
Nice, thanks for the pointer.
My overall guess is after surfing through / skimming philosophy literature on this for many hours is you can probably find all core ideas of this post somewhere in it, but it's pretty frustrating - scattered in many places and diluted by things which are more confused.
- Virtue ethics is the view that our actions should be motivated by the virtues and habits of character that promote the good life
This sentence doesn't make sense to me. Do you mean something like "Virtue ethics is the view that our actions should be motivated by the virtues and habits of character they promote" or "Virtue ethics is the view that our actions should reinforce virtues and habits of character that promote the good life"? It looks like two sentences got mixed up
Sorry for confusion I tried to paraphrase what classical virtue ethicist believe, in ...
In my view, you are possibly conflating
- ICF as a framework
- described "basic ICF technique"
To me, ICF as a framework seem distinct from IFS in how it is built. As you say, introductory IFS materials take the stories about exiles and protectors as pretty real, and also often use parallels with family therapy. On the more theoretical side, my take on parts of your sequence on parts is you basically try to fit some theoretical models (e,g, RL, global workspace) to the "standard IFS prior" about types of parts.
ICF is build the opposite way:&...
I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".
By reinventing it, I means, for example, asking questions like "how to influence the dynamic between AI labs in a way which allows everyone to slow down at critical stage", "can we convince some actors about AI risk without the main effect being they will put more resources into the race", "what's up ...
Here is a sceptical take: anyone who is prone to getting convinced by this post to switch to attempts at “buying time” interventions from attempts at do technical AI safety is pretty likely not a good fit to try any high-powered buying-time interventions.
The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.
Figuring out that AI strategy and governance are maybe important, in late 2022, after spending substantial time on AI safety...
The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.
I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".
Separately, I also don't really think it's worth spending a ton of time trying to really understand what current people think ab...
Sorry for being snarky, but I think at least some LW readers should gradually notice to what extent is the stuff analyzed here mirroring the predictive processing paradigm, as a different way how to make stuff which acts in the world. My guess is the big step on the road in this direction are not e.g. 'complex wrappers with simulated agents', but reinventing active inference... and also I do suspect it's the only step separating us from AGI, which seems like a good reason why not to try to point too much attention in that way.
It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology.
A couple of years ago we developed something in this direction in Epistea that seems generally better and a little less confused, called "Internal Communication Framework".
I won't describe the whole technique here, but the generative idea mostly is "drop priors used in IFS or IDC, and instead lean into a metaphor of facilitating a conversation between parts from a position of kindness and open curiosity". (Second part is getting more theoretical clarity on the whole-parts relation)
From the perspective of ICF, what seems suboptimal with the IDC algorithm de...
The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts. And how the composition works / what survives renormalization / ... is almost the whole problem.
meta:
This seems to be almost exclusively based on the proxies of humans and human institutions. Reasons why this does not necessarily generalize to advanced AIs are often visible when looking from a perspective of other proxies, eg. programs or insects.
Sandwiching:
So far, progress of ML often led to this pattern:
1. ML models sort of suck, maybe help a bit sometimes. Humans are clearly better ("humans better").
2. ML models get overall comparable to humans, but have different strengths and weaknesses; human+AI teams beat both best AIs alone, or best humans a...
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.
While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular
- Friston et al certainly understand this (cf ... do...
I would correct "Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did not have to start from scratch, and already had a reasonably complex 'API' for interoceptive variables."
from the summary to something like this
"Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did have to start locating 'goals' and relevant world-features in the learned world models. Instead, it re-used the the existing goal-specifying circuits, and implicit-world-models, existing in older organisms. Most of the goal...
It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)
In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two so... (read more)