I think it’s often valuable to provide a short post for describing phenomenon clearly so that you can then reference them in future posts without going on a massive detour.
Unfortunately, getting onto more interesting matters sometimes requires a bunch of setup first. I could skip the setup, but then everyone would end up confused.
I’m probably missing something obvious, but how does the tickle defence handle X-Or Blackmail?
The Marxist arguments for the collapse of capitalism always sounded handwavey to me, but perhaps you could link me to something that would have sounded persuasive in the past?
Seems valuable as a lot of people want social affirmation before considering the hypothesis.
Inferring backwards would significantly reduce my concern since your starting from a point we have information about.
I suppose that maybe we could calculate the Kolmogorov score of worlds close to us by backchaining, although that doesn’t really seem to be compatible with the calculation at each step being a formal mathematical expression.
Yeah, this is the part of the proposal that’s hardest for me to buy. Chaos theory means that small variations in initial conditions lead to massive differences pretty rapidly; and we can’t even measure an approximation of initial conditions. The whole “let’s calculate the universe from the start” approach seems to leave way too much scope to end up with something completely unexpected.
I would love to see more experimentation here to determine whether GPT4 can do more complicated quines that are less likely to be able to be copied. For example, we could insist that it includes a certain string or avoids certain functions.
Thanks for the detailed response.
To be honest, I’ve been persuaded that we disagree enough in our fundamental philosophical approaches, that I’m not planning to deeply dive into infrabayesianism, so I can’t respond to many of your technical points (though I am planning to read the remaining parts of Thomas Larson’s summary and see if any of your talks have been recorded).
“However, CDT and EDT are both too toyish for this purpose, since they ignore learning and instead assume the agent already knows how the world works, and moreover this knowledge is repres...
I’m curious why you say it handles Newcomb’s problem well. The Nirvana trick seems like an artificial intervention where we manually assign certain situations a utility of infinity to enforce a consistent condition which then ensures they are ignored when calculating the maximin. If we are manually intervening, why not just manually cross out the cases we wish to ignore, instead of adding them with infinite value then immediately ignoring them.
Just because we modelled this using infrabayesianism, it doesn’t follow that it contributed anything to the soluti...
I’d encourage you to do that.
I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
Yeah agreed, this doesn't make sense to me.
There are probably just a few MB (wouldn't be surprised if it could be compressed into much less) of information which sets up the brain wiring. Somewhere within that information are the structures/biases that, when exposed to the training data of being a human in our world, gives us our altruism (and much else). It's a hard problem to understand these altruism-forming structures (which are not likely to be distinct things), replicate them in silica and make them robust even to large power differentials.
On the oth...
Why do you believe that “But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended”?
I understand why this could cause the AI to fail, but why might it learn incorrect heuristics?
I’m discussing an agent that does in fact take 5 which imagines taking 10 instead. There have been some discussions of decision theory using proof-based agents and how they can run in spurious counterfactual. If you’re confused, you can try searching the archive of this website. I tried earlier today, but couldn’t find particularly good resources to recommend. I couldn’t find a good resource for playing chicken with the universe either.
(I may write a proper article at some point in the future to explain these concepts if I can’t find an article that explains them well)
I’ve been reading a lot of shared theory and finding fascinating, even though I’m not convinced that a more agentic subcomponent wouldn’t win out and make the shards obsolete. Especially if you’re claiming the more agentic shards would seize power, then surely an actual agent would as well?
I would love to hear why Team Shard believes that the shards will survive in any significant form. Off the top of my head, I’d guess their explanation might relate to how shards haven’t disappeared in humans?
On the other hand, I much more strongly endorse the notion of thinking of a reward as a chisel rather than something we’ll likely find a way of optimising.
Are there any agendas you would particularly like to see distilled?
What does it mean for a constraint to be low-dimensional?
Filtering out all the implicit information would be really, really hard.
I think the key problem is that this sandboxing won’t work for anything with a large language model as a component.
Very happy to see someone writing about this as I’ve been thinking that there should be more research into this for a while. I guess my main doubt with this strategy is that if a system is running for long enough in a wide enough variety of circumstances maybe certain rare outcomes are virtually guaranteed?
I’d encourage you to delve more into this paragraph as I think this is the part of your article where it becomes the most hand-wavey:
“In order to "really solve" outer alignment, you want the AI-optimization process to care about the generalization properties of the created AI beyond the training data. In order to "really solve" inner alignment, the created AI shouldn't just care about the raw outputs of the process that created it, it should care about the things communicated by the AI-optimization process in its real-world context.”
I don’t know, but sounds like an obvious use case for a sub forum? The solutions listed above seem hackish.
I have to be honest, I’m skeptical. If we study how human prosociality works, my expectation is that we learn enough to produce some toy models with some very simplistic pro-sociality, but this seems insufficient for generating an AI capable of navigating tough moral dilemmas; just situations sufficiently off-distribution. The reason why we want humans in the loop is not because they are vaguely pro-social but because of the ability of humans to handle novel situations.
Actually, I shouldn’t completely rule out the value of this research. I ...
I’m sure if he spent five minutes brainstorming he could come up with more things, or maybe I’m just wrongly calibrated on how much agency people have?
I’ve had similar thoughts too. I guess the way I’d implement it is by giving the AI a command that it can activate that directly overwrites the reward buffer but then turns the AI off. The idea here is to make it as easy as possible for an ai inclined to wire head to actually wire head so it is less incentivised to act in the physical world.
During training I would ensure that the SGD used the true reward rather than the wire-headed reward. Maybe that would be sufficient to stop wire-heading, but there are issues with it pursuing the highest probability plan rather than just a high probability plan. Maybe quantilising probability can help here
Theres a difference between debating the merits of different political positions and merely announcing an apparent trend. I’m doing the later and I don’t think the risks associated with this are too severe. So it’s not exactly open season.
There's another possibility, which is that they have some low-level insights that have been dressed up to appear as far more.
When did you start to doubt?
I was never a strong believer. There was never a moment where my "faith shattered", because I never had "faith" in the first place. It's just, given the filtered information, how the regime described the situation, that seemed to me like a plausible description of reality. I haven't heard any alternative description, and I didn't have a reason to invent one.
Also, I was a small kid, so my ability to think about politics was quite limited. For example, I heard the broadcast of Voice of America / Radio Free Europe (I am not sure which one, maybe both) a few t...
This is an excellent question. Here's some of the things I consider personally important.
Regarding probability, I recently asked the question: Why is Bayesianism Important? I found this Slatestarcodex post to provide an excellent overview of thinking probabilistically, which seems way more important than almost any of the specific theorems.
I would include basic game theory - prisoner's dilemma, tragedy of the commons, multi-polar traps (see Meditations on Moloch for this later idea).
In terms of decision theory, there's the basic concept of expected utility...
I appreciate how Ben handled this: it was nice for him to let me comment before he posted and for him to also add some words of appreciation at the end.
Regarding point 2, since I was viewing this in game mode I had no real reason to worry about being tricked. Avoiding being tricked by not posting about it would have been like avoiding losing in chess by never making the next move.
I guess other than that, I'd suggest that even a counterfactual donation of $100 to charity not occurring would feel more significant than the frontpage going down for a day. Like...
I’d suggest that even a counterfactual donation of $100 to charity not occurring would feel more significant than the frontpage going down for a day.
This suggests an interesting idea: A charity drive for the week leading up to Petrov Day, on condition that the funds will be publicly wasted if anyone pushes the button (e.g. by sending bitcoin to a dead-end address, or donating to two opposing politicians' campaigns).
Why would there be? I'm sure they saw it as just a game too and it would be extremely hypocritical for me to be annoyed at anyone for that.
Thanks, I'm glad to hear that. :) Also, very thankful that the LW community took this really well.
Beyond that, as for my motivations, aside from curiosity as to whether it would work, etc. I considered that it would be an interesting learning opportunity for the community as well. With actual nukes, random untrusted people also have a part to play. Selecting a small group of people tasked with trying to bring down the site might even be a good addition to future instances of Petrov Day.
For what it's worth, I took care to ensure that the damage from taking ...
Hey, I've become interested in this field too recently. I've been listening to the Jim Rutt show which is pretty interesting, but I haven't dived into it in any real depth. I agree that it is something that we should be looking more into.
I won't pretend to be an expert on this topic, but my understanding of the differences is as follow:
I hadn't decided whether or not to nuke it, but if I did nuke it, I would have been it several hours later, after people had a chance to wake up.
For the evidential game, it doesn't just matter whether you co-operate or not, but why. Different why's will be more or less likely to be adopted by the other agents.
It's something people say, but don't necessarily fully believe
I appreciated this post for explaining Berkeley's beliefs really clearly to me. I never knew what he was going on about before.
Would be happy to try this
In a booming market, buying can be valuable as a hedge against rising house prices
Yeah, I meant part 7. What did he say about feminism and neoreaction?
I'd like to know more about the dark sides part of the book
I'd still like the ability to make the explicit abstract just read off the text after a certain point, but I suppose it would require a lot of work to support that functionality.
I agree fairly strongly, but this seems far from the final word on the subject, to me.
Hmm, actually I think you're right and that it may be more complex than this.
Ah. I take you to be saying that the quality of the clever arguer's argument can be high variance, since there is a good deal of chance in the quality of evidence cherry-picking is able to find. A good point.
Exactly. There may only be a weak correlation between evidence and truth. And maybe you can do something with it or maybe it's better to focus on stronger signals instead.
I view the issue of intellectual modesty much like the issue of anthropics. The only people who matter are those whose decisions are subjunctively linked to yours (it only starts getting complicated when you start asking whether you should be intellectually modest about your reasoning about intellectual modesty)
One issue with the clever arguer is that the persuasiveness of their arguments might have very little to do with how persuasive they should be, so attempting to work off expectations might fail.
Where would you start with his work?
That’s a good point that I hadn’t thought of.