## LESSWRONGLW

Tetraspace

Drew the shoggoth and named notkilleveryoneism.

Sorted by New

# Wiki Contributions

I learned this lesson looking at the conditional probabilities of candidates winning given they were nominated in 2016, where the candidates with less than about 10% chance of being the nominee had conditional probabilities with noise between 0 and 100%. And this was on the thickly traded real-money markets of Betfair! I personally engage in, and also recommend, just kinda throwing out any conditional probabilities that look like this, unless you have some reason to believe it's not just noise.

Another place this causes problems is in the infinitely-useful-if-they-could-possibly-work decision markets, where you want to be able to evaluate counterfactual decisions, except these are counterfactuals so you don't make the decision so there's no liquidity and it can take any value.

Obeying it would only be natural if the AI thinks that the humans are more correct than the AI would ever be, after gathering all available evidence, where "correct" is given by the standards of the definition of the goal that the AI actually has, which arguendo is not what the humans are eventually going to pursue (otherwise you have reduced the shutdown problem to solving outer alignment, and the shutdown problem is only being considered under the theory that we won't solve outer alignment).

An agent holding a belief state that given all available information it will still want to do something other than the action it will think is best then is anti-natural; utility maximisers would want to take that action.

This is discussed on Arbital as the problem of fully updated deference.

This ends up being pretty important in practise for decision markets ("if I choose to do X, will Y?"), where by default you might e.g. only make a decision if it's a good idea (as evaluated by the market), and therefore all traders will condition on the market having a high probability which is obviously quite distortionary.

I replied on discord that I feel there's maybe something more formalisable that's like:

• reality runs on math because, and is the same thing as, there's a generalised-state-transition function
• because reality has a notion of what happens next, realityfluid has to give you a notion of what happens next, i.e. it normalises
• the idea of a realityfluid that doesn't normalise only comes to mind at all because you learned about R^n first in elementary school instead of S^n

which I do not claim confidently because I haven't actually generated that formalisation, and am posting here because maybe there will be another Lesswronger's eyes on it that's like "ah, but...".

Not unexpected! I think we should want AGI to, at least until it has some nice coherent CEV target, explain at each self-improvement step exactly what it's doing, to ask for permission for each part of it, to avoid doing anything in the process that's weird, to stop when asked, and to preserve these properties.

Even more recently I bought a new laptop. This time, I made the same sheet, multiplied the score from the hard drive by  because 512 GB is enough for anyone and that seemed intuitively the amount I prioritised extra hard drive space compared to RAM and processor speed, and then looked at the best laptop before sharply diminishing returns set in; this happened to be the HP ENVY 15-ep1503na 15.6" Laptop - Intel® Core™ i7, 512 GB SSD, Silver. This is because I have more money now, so I was aiming to maximise consumer surplus rather than minimise the amount I was spending.[1]

Surprisingly, it came with a touch screen! That's just the kind of nice thing that laptops do nowadays, because as I concluded in my post, everything nice about laptops correlates with everything else so high/low end is an axis it makes sense to sort things on. Less surprisingly, it came with a graphics card, because ditto.

Unfortunately this high-end laptop is somewhat loud; probably my next one will be less loud, up to including an explicit penalty for noise.

1. ^

It would have been predictable, however, at the time that I bought that new laptop, that I would have had that much money at a later date. Which means that I should have just skipped straight to consumer surplus maxxing.

It would be evidence at all. Simple explanation: if we did observe a glitch, that would pretty clearly be evidence we were in a simulation. So by conservation of expected evidence, non-glitches are evidence against.

I don't think it's quite that; a more central example I think would be something like a post about extrapolating demographic trends to 2070 under the UN's assumptions, where then justifying whether or not 2070 is a real year is kind of a different field.

, as a mathematical structure, is smarter than god and perfectly aligned to ; the value of  will never actually be  because  is more objectively rational, or because you made a typo and it knows you meant to say ; and no matter how complicated the mapping is from  to  it will never fall short of giving the  that gives the highest value of .

Which is why in principle you can align a superior being, like , or maybe like a superintelligence.

"The AI does our alignment homework" doesn't seem so bad - I don't have much hope for it, but because it's a prosaic alignment scheme so someone trying to implement it can't constrain where Murphy shows up, rather than because it's an "incoherent path description".

A concrete way this might be implemented is

• A language model is trained on a giant text corpus to learn a bunch of adaptations that make it good at math, and then fine-tuned for honesty. It's still being trained at a safe and low level of intelligence where honesty can be checked, so this gets a policy that produces things that are mostly honest on easy questions and sometimes wrong and sometimes gibberish and never superhumanly deceptive.[1]
• It's set to work producing conceptually crisp pieces of alignment math, things like expected utility theory or logical inductors, slowly on inspectable scratchpads and so on, with the dumbest model that can actually factor scientific research[1], with human research assistants to hold their hand if that lets you make the model dumber. It does this, rather than engineering, because this kind of crisp alignment math is fairly uniquely pinned down so it can be verified, and it's easier to generate compared to any strong pivotal engineering task where you're competing against humans on their own ground so you need to be smarter than humans, so while it's operating in a more dangerous domain it's using a safer level of intelligence.[1]
• The human programmers then use this alignment math to make an corrigible thingy that has dangerous levels of intelligence that does difficult engineering and doesn't know about humans, while this time knowing what they're doing. Getting the crisp alignment math from parallelisable language models helps a lot and gives them a large lead time, because a lot of it's the alignment version of backprop where it would have took a surprising amount of time to discover otherwise.

This all happens at safe-ish low-ish levels of intelligence (such a model would probably be able to autonomously self-replicate on the internet, but probably not reverse protein folding, which means that all the ways it could be dangerous are "well don't do that"s as long as you keep the code secret[1]), with the actual dangerous levels of optimisation being done by something made by the humans using pieces of alignment math which are constrained down to a tiny number of possibilities.

EDIT 2023-07-25: A longer debate that I think is worth reading about the model that leads it to being an incoherent path description between Holden Karnofsky (pro) and Nate Soares (against) is here; I hadn't read this as of writing this.

1. ^

Unless it isn't; it's a giant pile of tensors, how would you know? But this isn't special to this use case.