All of Vivek Hebbar's Comments + Replies

When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)?  Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

7Eliezer Yudkowsky10d
At the superintelligent level there's not a binary difference between those two clusters.  You just compute each thing you need to know efficiently.

Cool! It wrote and executed code to solve the problem, and it got it right.

Are you using chat-GPT-4?  I thought it can't run code?

1Jonathan Marcus2mo
Interesting! Yes, I am using ChatGPT with GPT-4. It printed out the code, then *told me that it ran it*,  then printed out a correct answer. I didn't think to fact-check it; instead I assumed the OpenAI has been adding some impressive/scary new features.

Interesting, I find what you are saying here broadly plausible, and it is updating me (at least toward greater uncertainity/confusion).  I notice that I don't expect the 10x effect, or the Von Neumann effect, to be anywhere close to purely genetic.  Maybe some path-dependency in learning?  But my intuition (of unknown quality) is that there should be some software tweaks which make the high end of this more reliably achievable.

Anyway, to check that I understand your position, would this be a fair dialogue?:

Person: "The jump from chimps to hu

... (read more)
5jacob_cannell2mo
Your model of my model sounds about right, but I also include neotany extension of perhaps 2x which is part of the scale up (spending longer on training the cortex, especially in higher brain regions). For Von Neumann in particular my understanding is he was some combination of 'regular' genius and a mentant (a person who can perform certain computer like calculations quickly), which was very useful for many science tasks in an era lacking fast computers and software like mathematica, but would provide less of an effective edge today. It also inflated people's perception of his actual abilities.

In your view, who would contribute more to science -- 1000 Einsteins, or 10,000 average scientists?[1]

"IQ variation is due to continuous introduction of bad mutations" is an interesting hypothesis, and definitely helps save your theory.  But there are many other candidates, like "slow fixation of positive mutations" and "fitness tradeoffs[2]".

Do you have specific evidence for either:

  1. Deleterious mutations being the primary source of IQ variation
  2. Human intelligence "plateauing" around the level of top humans[3]

Or do you believe these things just because ... (read more)

2Alexander Gietelink Oldenziel16d
IIRC according to gwern the theory that IQ variation is mostly due to mutational load has been debunked by modern genomic studies [though mutational load definitely has a sizable effect on IQ]. IQ variation seems to be mostly similar to height in being the result of the additive effect of many individual common allele variations.
8jacob_cannell2mo
I vaguely agree with your 90%/60% split for physics vs chemistry. In my field of programming we have the 10x myth/meme, which I think is reasonably correct but it really depends on the task. For the 10x programmers it's some combination of greater IQ/etc but also starting programming earlier with more focused attention for longer periods of time, which eventually compounds into the 10x difference. But it really depends on the task distribution - there are some easy tasks where the limit is more typing speed and compilation, and at the extreme there are more theoretical tasks that require some specific combination of talent, knowledge, extended grind focus for great lengths of time, and luck. Across all fields combined there seem to be perhaps 1000 to 10000 top contributors? But it seems to plateau in the sense that I do not agree that John Von Neumman (or whoever your 100x candidate is) was 10x einstein or even Terrence Tao or Kasparov (or that either would be 10x carmack in programming, if that was their field), and given that there have been 100 billion humans who have ever lived and most lived a while ago, there should have been at least a few historical examples 10x or 100x John Von Neumman. I dont see evidence for that at all. I do think people here hero worship a bit and overestimate the flatness of the upper tail of the genetic component of intelligence in particular (ie IQ) and its importance. But that being said your vibe numbers don't seem so out of whack.

It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

5Steven Byrnes2mo
When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1 [https://www.alignmentforum.org/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety],2 [https://www.lesswrong.com/posts/AKtn6reGFm5NBCgnd/in-defense-of-oracle-tool-ai-research]) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in the meantime I was also gradually getting into thinking about brain algorithms, which involve RL much more centrally, and I came to believe that that RL was necessary to reach dangerous capability levels (recent discussion here [https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions#1_1__Trying__to_figure_something_out_seems_both_necessary___dangerous]; I think the first time I wrote it down was here [https://www.lesswrong.com/posts/Gfw7JMdKirxeSPiAk/solving-the-whole-agi-control-problem-version-0-0001#7_2__Tool_AI__from_self_supervised_learning_without_RL]). And I still believe that, and I think the jury’s out as to whether it’s true. (RLHF doesn’t count, it’s just a fine-tuning step, whereas in the brain it’s much more central.) My updates since then have felt less like “Wow look at what GPT can do” and more like “Wow some of my LW friends think that GPT is rapidly approaching the singularity, and these are pretty reasonable people who have spent a lot more time with LLMs than I have”. I haven’t personally gotten much useful work out of GPT-4. Especially not for my neuroscience work. I am currently using GPT-4 only for copyediting. (“[The following is a blog post draft. Pleas
2Alexander Gietelink Oldenziel2mo
fwiw, I think I'm fairly close to Steven Byrnes' model. I was not surprised by gpt-4 (but like most people who weren't following LLMs closely was surprised by gpt-2 capabilities)

Human intelligence in terms of brain arch priors also plateaus

Why do you think this?

POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values). 

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

6TurnTrout2mo
I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

Isn't going from an average human to Einstein a huge increase in science-productivity, without any flop increase? Then why can't there be software-driven foom, by going farther in whatever direction Einstein's brain is from the average human?

6jacob_cannell2mo
Science/engineering is often a winner-take all race. To him who has is given more - so for every Einstein there are many others less well known (Lorentz, Minkowski), and so on. Actual ability is filtered through something like a softmax to produce fame, so fame severely underestimates ability. Evolution proceeds by random exploration of parameter space, the more intelligent humans only reproduce a little more than average in aggregation, and there is drag due to mutations. So the subset of the most intelligent humans represents the upper potential of the brain, but it clearly asymptotes. Finally, intelligence results from the interaction of genetics and memetics, just like in ANNs. Digital minds can be copied easily (well at least current ones - future analog neuromorphic minds may be more difficult to copy), so it seems likely that they will not have the equivalent of the mutation load issue as much. On the other hand the great expense of training digital minds and the great cost of GPU RAM means they have much less diversity - many instances of a few minds. None of this by itself leaves much hope for foom.

Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster??  The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons).  So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein.  Couldn't a million Einstein-level scientists running at 1000x speed ... (read more)

4jacob_cannell2mo
Yes it will be transformative. GPT models already think 1000x to 10000x faster - but only for the learning stage (absorbing knowledge), not for inference (thinking new thoughts).
3Vivek Hebbar2mo
Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling.  Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?

1NicholasKross2mo
Oops. Fixed!
2Teerth Aloke2mo
QA sessions.

Any idea why "cheese Euclidean distance to top-right corner" is so important?  It's surprising to me because the convolutional layers should apply the same filter everywhere.

2TurnTrout3mo
I'm also lightly surprised by the strength of the relationship, but not because of the convolutional layers. It seems like if "convolutional layers apply the same filter everywhere" makes me surprised by the cheese-distance influence, it should also make me be surprised by "the mouse behaves differently in a dead-end versus a long corridor" or "the mouse tends to go to the top-right."  (I have some sense of "maybe I'm not grappling with Vivek's reasons for being surprised", so feel free to tell me if so!)
3Vaniver3mo
My naive guess is that the other relationships are nonlinear, and this is the best way to approximate those relationships out of just linear relationships of the variables the regressor had access to.

See Godel's incompleteness theorems.  For example, consider the statement "For all A, (ZFC proves A) implies A", encoded into a form judgeable by ZFC itself.  If you believe ZFC to be sound, then you believe that this statement is true, but due to Godel stuff you must also believe that ZFC cannot prove it.  The reasons for believing ZFC to be sound are reasons from "outside the system" like "it looks logically sound based on common sense", "it's never failed in practice", and "no-one's found a valid issue".  Godel's theorems let us conv... (read more)

1ZT53mo
I think we understand each other! Thank you for clarifying. The way I translate this: some logical statements are true (to you) but not provable (to you), because you are not living in a world of mathematical logic, you are living in a messy, probabilistic world. It is nevertheless true, by the rule of necessitation in provability logic [https://en.wikipedia.org/wiki/Provability_logic], that if a logical statement is true within the system, then it is also provable within the system. P -> □P. Because the fact that the system is making the statement P is the proof. Within a logical system, there is an underlying assumption that the system only makes true statements. (ok, this is potentially misleading and not strictly correct) This is fascinating! So my takeaway is something like: our reasoning about logical statements and systems is not necessarily "logical" itself, but is often probabilistic and messy. Which is how it has to be, given... our bounded computational power, perhaps? This very much seems to be a logical uncertainty [https://www.lesswrong.com/tag/logical-uncertainty] thing.

??? For math this is exactly backward, there can be true-but-unprovable statements

1ZT53mo
Then how do you know they are true? If you do know then they are true, it is because you have proven it, no? But I think what you are saying is correct, and I'm curious to zoom in on this disagreement.

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

5abramdemski3mo
LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.

Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he's already revealed to be Waluigi, there's no point in acting like Luigi). If there's a context window of 10, then the simulator's probability of Waluigi never goes below 1/1025, w... (read more)

1Eschaton3mo
The transform isn't symmetric though right? A character portraying "good" behaviour is, narratively speaking, more likely to have been deceitful the whole time or transform into a villain than for the antagonist to turn "good".
7Cleo Nardo3mo
Yep I think you might be right about the maths actually. I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation. So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them. I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis. Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them.  What are the correct labels?  (And maybe fix the paper if these are typos?)

Does the easiest way to make you more intelligent also keep your values intact?

What exactly do you mean by "multi objective optimization"?

1DragonGod6mo
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals. I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.

It would help if you specified which subset of "the community" you're arguing against.  I had a similar reaction to your comment as Daniel did, since in my circles (AI safety researchers in Berkeley), governance tends to be well-respected, and I'd be shocked to encounter the sentiment that working for OpenAI is a "betrayal of allegiance to 'the community'".

To be clear, I do think most people who have historically worked on "alignment" at OpenAI have probably caused great harm! And I do think I am broadly in favor of stronger community norms against working at AI capability companies, even in so called "safety positions". So I do think there is something to the sentiment that Critch is describing.

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

8Rohin Shah6mo
You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture. (This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space

I think this contrast is wrong.[1]  IIRC, strings have the same status in string theory that particles do in QFT.  In QM, a wavefunction assigns a complex number to each point in configuration space, where state space has an axis for each property of each particle.[2]  So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.[3]  I... (read more)

2Adam Scherlis6mo
QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles. Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe. "Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe. "Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata. "String Theory": quantum theory of strings, and maybe branes, as you describe.* "Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything. You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was agnostic about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other. (Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!) String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT. *Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.[10] 

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused.  I now think you mean "My take on Vivek's is that value s... (read more)

2TurnTrout7mo
Reworded, thanks.

Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable?  Should it be "part" instead?

3Ramana Kumar7mo
Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variable v:S→D as a function from the underlying set S to v's domain D. Then we can equivalently think of v as the partition {{s∈S∣v(s)=d}∣d∈D}∖∅ of S with (up to) |D| parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematica... (read more)

3Koen.Holtman7mo
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted. To comment on your quick thoughts: * My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly. * On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism. * On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about al

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here?  No need to write anything, just links.

8Koen.Holtman7mo
OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones. This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since. I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet). Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions. Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs. Here is the list, with the bold headings describing different approaches to corrigibility. Indifference to being switched off, or to reward function updates Motivated Value Selection for Artificial Agents [https://www.fhi.ox.ac.uk/wp-content/uploads/2015/03/Armstrong_AAAI_2015_Motivated_Value_Selection.pdf] introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level. Corrigibility [https://intelligence.org/files/Corrigibility.pdf] tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper som
7Vivek Hebbar7mo
ETA: Koen recommends reading Counterfactual Planning in AGI Systems [https://arxiv.org/abs/2102.00834] before (or instead of) Corrigibility with Utility Preservation [https://www.alignmentforum.org/posts/3uHgw2uW6BtR74yhQ/new-paper-corrigibility-with-utility-preservation] Update: I started reading your paper "Corrigibility with Utility Preservation [https://www.alignmentforum.org/posts/3uHgw2uW6BtR74yhQ/new-paper-corrigibility-with-utility-preservation]".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer". Quick thoughts after reading less than half: AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems.  Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3]  Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists).  In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful). So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4] Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]: "In particular, [Soares et al [https://intelligence.org/files/Corrigibility.pdf]] uses a Platon
  1. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?  

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

  • The agent's allegiance is to some idealized utility function  (like CEV).  The agent's internal evaluator  is "trying" to approximate  by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t
... (read more)
3TurnTrout7mo
Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-imperfect-decision-influences-and-doesn-t]. 
4adamShimi7mo
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious: * Do you think that you are internally trying to approximate your own Uideal? * Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)? * Can you think of concrete instances where you improved your own Eval? * Can you think of concrete instances where you thought you improved you own Eval but then regretted it later? * Do you think that your own changes to your eval have been moving in the direction of your Uideal?
5cfoster07mo
Yeah I think you're on the right track. A simple framework (that probably isn't strictly distinct from the one you mentioned) would be that the agent has a foresight evaluation method that estimates "How good do I think this plan is?" and a hindsight evaluation method that calculates "How good was it, really?". There can be plans that trick the foresight evaluation method relative to the hindsight one. For example, I can get tricked into thinking some outcome is more likely than it actually is ("The chances of losing my client's money with this investment strategy were way higher than I thought they were.") or thinking that some new state will be hindsight-evaluated better than it actually will be ("He convinced me that if I tried coffee, I would like it, but I just drank it and it tastes disgusting."), etc.
8Wei Dai7mo
This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework? My own framework is something like this: * The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers. * I think there are "adversarial inputs" because I've previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas. * I can try to improve my evaluation process by doing things like 1. look for patterns in my and other people's mistakes 2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses 3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.) 4. talk (selectively) to other people 5. try to improve how I do explicit reasoning or philosophy

Yeah, the right column should obviously be all 20s.  There must be a bug in my code[1] :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on  with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into  with weight , then the middle column is still near... (read more)

2Slider7mo
This maps the credence but I would imagine that the confidence would not be evenly spread around the boxes. With confidence literally 0 it does not make sense to express any credence to stand any taller than another as 1 and 0 would make equal sense. With a miniscule confidence the foggy hunch does point in some direction. Without h3 it is consistent to have middle square confidence 0. With positive plausibily of h3 middle square is not "glossed over" we have some confidence it might matter. But because h3 is totally useless for credences those come from the structures of h1 and h2. Thus effectively h1 and h2 are voting for zero despite not caring about it. Contrast what would happen with an even more trivial hypothesis of one square covering all with 100% or 9x9 equiprobable hypothesis. You could also have a "micro detail hypothesis", (actually a 3x3) a 9x9 grid where each 3x3 is zeroes everywhere else than the bottom right corner and all the "small square locations" are in the same case among the other "big square" correspondents. The "big scale" hypotheses do not really mind the "small scale" dragging of the credence around. Thus the small bottom-right square is quite sensitive to the corresponding big square value and the other small squares are relatively insensitive. Mixing two 3x3 resolutions that are orthogonal results in a 9x9 resolution which is sparse (because it is separable). John Vervaeke meme of "sterescopic vision" seems to apply. The two 2x2 perspectives are not entirely orthogonal so the "sparcity" is not easy to catch.
2Scott Garrabrant7mo
The point I was trying to make with the partial functions was something like "Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about." I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the 10−9s) are really coming from the fact that (almost) nobody cares about those events.
2Scott Garrabrant7mo
I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous. It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible. I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn't happen. Maybe everything is edge cases. I am not sure. It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that "0 means you cant update" but maybe you aren't supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.  I actually prefer a different proposal for the type of "epistemic state that is closed under coarsening and mixture" that is more general than the thing I gesture at in the post: A generalized epistemic state is a (quasi-?)convex function ΔW→R. A standard probability distribution is converted to an epistemic state through P↦(Q↦DKL(P||Q)). A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function W→V, we can convert a generalized epistemic state over V to a generalized epistemic state over W by precomposing wit

Now, let's consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now  is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d... (read more)

2Scott Garrabrant7mo
I think your numbers are wrong, and the right column on the output should say 20% 20% 20%. The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either. I agree that the 0s are a bit unfortunate. I think the best way to think of the type of the object you get out is not a probability distribution on W, but what I am calling a partial probability distribution on W. A partial probability distribution is a partial function from 2W→[0,1] that can be completed to a full probability distribution on W (with some sigma algebra that is a superset of the domain of the partial probability distribution. I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components. One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.) This doesn't really make it that much better, but the point here is that this framework admits that it doesn't really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don't disagree with the output at all on any of these sets.

most egregores/epistemic networks, which I'm completely reliant upon, are much smarter than me, so that can't be right

*Egregore smiles*

Another way of looking at this question:  Arithmetic rationality is shift invariant, so you don't have to know your total balance to calculate expected values of bets.  Whereas for geometric rationality, you need to know where the zero point is, since it's not shift invariant.

Which is equivalent to 

Some results related to logarithmic utility and stock market leverage (I derived these after reading your previous post, but I think it fits better here):

Tl;dr: We can derive the optimal stock market leverage for an agent with utility logarithmic in money.  We can also back-derive a utility function from any constant leverage[1], giving us a nice class of utility functions with different levels of risk-aversion.  Logarithmic utility is recovered a special case, and has additional nice properties which the others may or may not have.

For an agent i... (read more)

A framing I wrote up for a debate about "alignment tax":

  1. "Alignment isn't solved" regimes:
    1. Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
    2. We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one
  2. "Alignment tax" regimes:
    1. We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x.  Furthermore, the situation remains multipolar and competitive for a while.
    2. The alignment tax is <0.001%, so it's not a concern.
    3. The leadi
... (read more)

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be?  And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

  • Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)] is the probability that, for a random policy π∈ξ, that policy has worse utility than the policy G* its program dictates; in essence, how good G's policies are compared to random policy selection

What prior over policies?

given g(G|U), we can infer the probability that an agent G has a given utility function U, as Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]) where means "is proportional to" and K(U) is the kolmogorov complexity of utility function U.

Suppose the prior over policies is max-entropy (uniform over all action seq... (read more)

2Martín Soto7mo
Some kind of simplicity prior, as mentioned here [https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized#Evaluating_agents]. Yes. In fact I'm not even sure we need your assumption about bits. Say policies are sequences of actions, and suppose at each time step we have N actions available. Then, in our process of approximating your perfect/overfitted utility "1 if {acts exactly like [insert exact copy of my brain] would}, else 0", adding one more specified action to our U can be understood as adding one more symbol to its generating program, and so incrementing K(U) by 1. But also, adding one more (perfect) specified action multiplies the denominator probability by 1N (since the prior is uniform). So as long as N>2, Pr[U] will be unbounded when approximating your utility. And of course, this is solved by the simplicity prior, because this makes it easier for simple Us to achieve low denominator probability. So a way simpler U (less overfitted to G*) will achieve almost the same low denominator probability as your function, because the only policies that maximize U better than G* are too complex.

Consider this quote by Alan Watts:

There are basically two kinds of philosophy. One’s called prickles, the other’s called goo. And prickly people are precise, rigorous, logical. They like everything chopped up and clear. Goo people like it vague. For example, in physics, prickly people believe that the ultimate constituents of matter are particles. Goo people believe it’s waves.

*facepalm*

If one has a technical understanding of QFT[1] (or even half of technical understanding, like me), this sounds totally silly.  There's no real question as to whet... (read more)

5Valentine7mo
I mean… what you're actually criticizing is that Alan Watts is a goo philosopher. He's not trying to be precise or carefully define what he's talking about. He's instead using loose metaphors to convey a feeling. And your objection is that his loose metaphors are pointing at things that have precise definitions and therefore he's technically mistaken in how he's trying to illustrate his point. To which a goo person would shrug. Because they understood the message, and that's the real point. Not technical accuracy of the words & metaphors. So in a funny way, you're actually illustrating Watts' point.

In theory, there can be multiple disconnected manifolds like this.

1Martín Soto7mo
Idk either, but in any event I basically wrote this just to share with Caspar and Sylvester

The directional derivative is zero, so the change is zero to first order.  The second order term can exist. (Consider what happens if you move along the tangent line to a circle.  The distance from the circle goes ~quadratically, since the circle looks locally parabolic.)  Hence .

1Ulisse Mini7mo
I see, thanks!

Is the "Analogies" thing a typo?  It says the same thing in both columns.

Interesting post btw!

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible.  It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

All in all, I don't think my original post held up well.  I guess I was excited to pump out the concept quickly, before the dust settled.  Maybe this was a mistake?  Usually I make the ~opposite error of never getting around to posting things.

3Johannes Treutlein5mo
I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs.  The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim.  (I did not know at the time ... (read more)

5Vivek Hebbar7mo
All in all, I don't think my original post held up well.  I guess I was excited to pump out the concept quickly, before the dust settled.  Maybe this was a mistake?  Usually I make the ~opposite error of never getting around to posting things.

Note that, for rational *altruists* (with nothing vastly better to do like alignment), voting can be huge on CDT grounds -- if you actually do the math for a swing state, the leverage per voter is really high.  In fact, I think the logically counterfactual impact-per-voter tends to be lower than the impact calculated by CDT, if the election is very close. 

I'm often in favor, whereas Yudkowsky seems generally against, especially when he's the person being asked to defer (see for example his takedown of "Humbali" here).

This is well explained by the hypothesis that he is epistemically superior to all of us (or at least thinks he is).

[Replying to this whole thread, not just your particular comment]

"Epistemic humility" over distributions of times is pretty weird to think about, and imo generally confusing or unhelpful.  There's an infinite amount of time, so there is no uniform measure.  Nor, afaik, is there any convergent scale-free prior.  You must use your knowledge of the world to get any distribution at all.

You can still claim that higher-entropy distributions are more "humble" w.r.t. to some improper prior.  Which begs the question "Higher entropy w.r.t. what m... (read more)

1Vivek Hebbar8mo
This is well explained by the hypothesis that he is epistemically superior to all of us (or at least thinks he is).
Load More