Wiki Contributions

Comments

Things that I seem to notice about the plan:

  1. Adjusting weights a plan for basic AIs, which can't seek to e.g. be internally consistent, eventually landing wherever the attractors take it.
  2. Say, you manage to give your AI enough quirks for it to go cry in a corner. Now you need to lower your AI nerfing to get more intelligence, leading to brinkmanship dynamics.
  3. In the middle, you have a bunch of AI, trained for maximum of various aspects of incorrigibility, hoping they are incapable of cooperating; or for that any single AI will not act destructively (while trained for incorrigibility).

Maybe, in-vivo genetic editing of the brain is possible. Adenoviruses that are a normal delivery mechanism for genetic therapy can pass hemo-encephalic barrier, so seems plausible to an amateur.

(Not obvious that this works in adult organisms, maybe genes activate while fetus grows or during childhood.)

Odds games against engine are played with contempt equal to matherial difference.

Sorry you didn't know that beforehand.

Obviously fine. I posted here to get better than my single point estimate of what's up with this thing.

The post expands on the intuition of ML field that reinforcement learning doesn't always work and getting it to work is fiddly process.

In the final chapter, a DeepMind paper that argues that 'one weird trick' will work, is demolished.

The problem under consideration is very important for some possible futures of humanity.

However, author's eudamonic wishlist is self-admittedly geared for fiction production, and don't seem to be very enforceable. 

It's a fine overview of modern language models. Idea of scaling all the skills at the same time is highlighted, different from human developmental psychology. Since publishing 500B-PaLM models seemed to have jumps at around 25% of the tasks of BIG-bench.

Inadequacy of measuring average performance on LLM is discussed, where a proportion is good, and rest is outright failure from human PoV. Scale seems to help with rate of success.

In 7th footnote,  should be 5e9, not 5e6 (doesn't seem to impact reasoning qualitatively).

Argument against CEV seems cool, thanks for formulating it. I guess we are leaving some utility on the table with any particular approach.

Part on referring to a model to adjudicate itself seems really off. I have a hard time imagining a thing that has better performance at meta-level than on object-level. Do you have some concrete example?

Thanks for giving it a think.

Turning off is not a solved problem, e.g. https://www.lesswrong.com/posts/wxbMsGgdHEgZ65Zyi/stop-button-towards-a-causal-solution 

Finite utility doesn't help, as long as you need to use probability. So you get, 95% chance of 1 unit of utility is worse than 99%, is worse than 99.9%, etc. And then you apply the same trick to probabilities you get a quantilizer. And that doesn't work either https://www.lesswrong.com/posts/ZjDh3BmbDrWJRckEb/quantilizer-optimizer-with-a-bounded-amount-of-output-1 

Load More