I was finding it a bit challenging to unpack what you're saying here. I think, after a reread, that you're using ‘slow’ and ‘fast’ in the way I would use ‘soon’ and ‘far away’ (aka. referring to the time it will occur from the present). Is this read about correct?
If ‘Opt into Petrov Day’ was aside something other than a big red ominous button, I would think the obvious answer is that it's a free choice and I'd be positively inclined towards it. Petrov Day is a good thing with good side effects, quite unlike launching nuclear weapons.
It is confusing to me that it is beside a big red ominous button. On the one hand, Petrov's story is about the value of caution. To quote a top comment from an older Petrov Day,
Petrov thought the message looked legit, but noticed there were clues that it wasn't.
On the other hand, risk-taking is good, opting in to good things is good, and if one is taking Petrov Day to mean ‘don't take risks if they look scary’ I think one is taking an almost diametrically wrong message from the story.
All that said, for now I am going to fall for my own criticism and not press the big red ominous button around Petrov Day.
I'm interested in having Musk-company articles on LessWrong if it can be done while preserving LessWrong norms. I'm a lot less interested in it if it means bringing in sarcasm, name calling, and ungrounded motive-speculation.
if, judging by looking at some economical numbers, poverty already doesn't exist for centuries, why do we feel so poor
Let's not forget that people who read LW, often highly intelligent and having well-paying jobs such as software development
This underlines what I find so incongruous about EY's argument. I think I genuinely felt richer as a child eating free school meals in the UK but going to a nice school and whose parents owned a house than I do as an obscenely-by-my-standards wealthy person in San Francisco. I'm hearing this elaborate theory to explain why social security doesn't work when I have lived through and seen in others clear evidence that it can and it does. If the question “why hasn't a factor-100 increase in productivity felt like a factor-100 increase in productivity?” was levied at my childhood specifically, my response is that actually it felt like exactly that.
By the standards of low earning households my childhood was probably pretty atypical and I don't mean to say there aren't major systemic issues, especially given the number of people locked into bad employment, people with lives destroyed by addiction, people who struggle to navigate economic systems, people trapped in abusive or ineffectual families, etc. etc. etc. I really don't want to present a case just based on my lived experience, even including those I know living various lives under government assistance. But equally I think someone's lived experience of being wealthy in San Francisco and seeing drug addicts on the street is also not seeing an unbiased take of what social security does for poverty.
Eg. a moderately smart person asking it to do something else by trying a few prompts. We're getting better at this for very simple properties but I still consider it unsolved there.
Reply to https://twitter.com/krishnanrohit/status/1794804152444580213, too long for twitter without a subscription so I threw it here, but do please treat it like a twitter comment.
rohit: Which part of [the traditional AI risk view] doesn't seem accounted for here? I admit AI safety is a 'big tent' but there's a reason they're congregated together.
You wrote in your list,
the LLMs might start even setting bad objectives, by errors of omission or commission. this is a consequence of their innards not being the same as people (either hallucinations or just not having world model or misunderstanding the world)
In the context of traditional AI risk views, this misses the argument. Roughly the concern is instead like so:
ASI is by definition very capable of doing things (aka. selecting for outcomes), in at least all the ways collections of humans can. It is both theoretically true and observably the case in reality that when things are selected for, a bunch of other things that aren't that are traded off, and that the stronger something is selected for, the more stuff ends up traded against, incidentally or not.
We should expect any ASI to have world-changing effects, and for those effects to trade off strongly against other things. There is a bunch of stuff we want that we don't want traded off (eg. being alive).
The first problem is that we don't know how to put any preferences into an AI such that it's robust to even trivial selection pressure, not in theory, not in practice on existing models, and certainly not in ways that would apply to arbitrary systems that indirectly contain ML models but aren't constrained by those models’ expressed preferences.
The second problem is that there are a bunch of instrumental goals (not eg. lying, but eg. continuing to have causal effect on the world) that are useful to almost all goals, and that are concrete examples of why an ASI would want to disempower humans. Aka. almost every thing that could plausibly be called an ASI will be effective at doing a thing, and the natural strategies for doing things involve not failing at them in easily-foreseeable ways.
Stuff like lying is not the key issue here. It often comes up because people say ‘why don’t we just ask the AI if it’s going to be bad’ and the answer is basically code for ‘you don’t seem to understand that we are talking about something that is trying to do a thing and is also good at it.’
Similarly for ‘we wouldn't even know why it chooses outcomes, or how it accomplishes them’ — these are problematic because they are yet another reason to rule out simple fixes, not because they are fundamental to the issue. Like, if you understand why a bridge falls down, you can make a targeted fix and solve that problem, and if you don’t know then probably it’s a lot harder. But you can know every line of code of Stockfish (pre-NNUE) and still not have a chance against it, because Stockfish is actively selecting for outcomes and it is better at selecting them than you.
“LLMs have already lied to us” from the traditional AI risk crowd is similarly not about LLM lying being intrinsically scary, it is a yell of “even here you have no idea what you are doing, even here you have these creations you cannot control, so how in the world do you expect any of this to work when the child is smarter than you and it’s actually trying to achieve something?”
It took me a good while reading this to figure out whether it was a deconstruction of tabooing words. I would have felt less so if the post didn't keep replacing terms with ones that are both no less charged and also no more descriptive of the underlying system, and then start drawing conclusions from the resulting terms' aesthetics.
With regards to Yudkowsky's takes, the key thing to keep in mind is that Yudkowsky started down his path by reasoning backwards from properties ASI would have, not from reasoning forward from a particular implementation strategy. The key reason to be concerned that outer optimization doesn't define inner optimization isn't a specific hypothesis about whether some specific strategy with neural networks will have inner optimizers, it's because ASI will by necessity involve active optimization on things, and we want our alignment techniques to have at least any reason to work in that regime at all.
There is no ‘the final token’ for weights not at the final layer.
Because that is where all the gradients flow from, and why the dog wags the tail.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it's also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you're intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.
You're at token i in a non-final layer. Which token's output are you optimizing for? i+1?
By construction a decoder-only transformer is agnostic over what future token it should be informative to within the context limit, except in the sense that it doesn't need to represent detail that will be more cheaply available from future tokens.
As a transformer is also unrolled in the context dimension, the architecture itself is effectively required to be generic both in what information it gathers and where that information is used. Bias towards next token prediction is not so much a consequence of reward in isolation, but of competitive advantage: at position i, the network has an advantage in predicting i+1 over the network at previous locations by having more recent tokens, and an advantage over the network at future tokens by virtue of still needing to predict token i+1. However, if a token is more predictive of some abstract future token than the next token precisely, say it's a name that might be referenced later, one would expect the dominant learnt effect to be non-myopically optimizing for later use in some timestamp-invariant way.
Fundamentally, the story was about the failure cases of trying to make capable systems that don't share your values safe by preventing specific means by which its problem solving capabilities express themselves in scary ways. This is different to what you are getting at here, which is having those systems actually operationally share your values. A well aligned system, in the traditional ‘Friendly AI’ sense of alignment, simply won't make the choices that the one in the story did.