Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:
https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh
Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either
I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.
edit: The proof is easy. Let , be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for replaced by . Then ....
I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far O...
This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.
Edit: I explain this view in a reply.
Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.
What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.
Using chatbots and feeling ok about it seems like a no-brainer. It's technology that provides me a multiple percentage point productivity boost, it's used by over a billion people, and a boycott of chatbots is well outside the optimal or feasible space of actions to help the world.
I think the restaurant analogy fails because ChatGPT was not developed in malice, just recklessness. For the open source models, there's not even an element of greed.
It doesn't look circular to me? I'm not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like "very rhetorically persuasive" or "tricks the human into typing a very large number into the rating box" that won't affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we'll probably have to replace independence with this.
I think you're splitting hairs. We prove Goodhart follows from certain assumptions, and I've given some justification for ...
In my frame, is not just some variable correlated with , it's some estimator's best estimate, and so it makes sense that residuals would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.
The basic idea why we might get is that there are some properties that increase the overseer's rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the o...
I think this is more like Extremal Goodhart in Garrabrant's taxonomy: there's a distributional shift inherent to high .
SGD has inductive biases, but we'd have to actually engineer them to get high rather than high when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.
I think the point still holds in mainline shard theory world, which in m...
That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.
This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.
Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.
I'd be much happier with increasing participants enough to equal 10-20% of the field of ML than a 6 month unconditional pause, and my guess is it's less costly. It seems like leading labs allowing other labs to catch up by 6 months will reduce their valuations more than 20%, whereas diverting 10-20% of their resources would reduce valuations only 10% or so.
There are currently 300 alignment researchers. If we take additional researchers from the pool of 30k people who attended ICML, you get 3000 researchers, and if they're equal quality this is 10x particip...
If I've already done WMLB, what day should I start on? The WMLB curriculum on mechinterp wasn't very polished, and IOI and superposition were not covered. But doing part of the transformers week would mean getting material I've already learned on RL.
Fair point. Another difference is that the pause is popular! 66-69% in favor of the pause, and 41% think AI would do more harm than good vs 9% for more good than harm.
I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:
The obvious dis-analogy is that if the police had no funding and largely ceased to exist, a string of horrendous things would quickly occur. Murders and thefts and kidnappings and rapes and more would occur throughout every country in which it was occurring, people would revert to tight-knit groups who had weapons to defend themselves, a lot of basic infrastructure would probably break down (e.g. would Amazon be able to pivot to get their drivers armed guards?) and much more chaos would ensue.
And if AI research paused, society would continue to basically function as it has been doing so far.
One of them seems to me like a goal that directly causes catastrophes and a breakdown of society and the other doesn't.
What information? What spectrum? The color information received by the webcam is the total intensity of light when passed through a red filter, the total intensity when passed through a blue filter, and the total intensity when passed through a green filter, at each point. You do not know the frequency of these filters (or that frequency of light is even a thing). I'm sure you could deduce something by playing around with relative intensities and chromatic aberration, but ultimately you cannot build a spectrum with three points.
I don't think we disag...
Some thoughts:
I don't know how to engage with the first two comments. As for diffusion being slow, you need to argue that it's so slow as to be uncompetitive with replication times of biological life, and that no other mechanism for placing individual atoms / small molecules could achieve better speed and energy efficiency, e.g. this one.
I don't have the expertise to evaluate the comment by Muireall, so I made a Manifold market.
I'm not sure how to evaluate this, so I made a Manifold market for it. I'd be excited for you to help me edit the market if you endorse slightly different wording.
https://manifold.markets/ThomasKwa/does-thermal-noise-make-drexlerian
Not an expert in chemistry or biochemistry, but this post seems to basically not engage with the feasibility studies Drexler has made in Nanosystems, and makes a bunch of assertions without justification, including where Nanosystems has counterarguments. I wish more commenters would engage on the object level because I really don't have the background to, and even I see a bunch of objections. Nevertheless I'll make an attempt. I encourage OP and others to correct me where I am ignorant of some established science.
Points 1, 2, 3, 4 are not relevant to Drexl...
I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about and are optimizing for a proxy , Goodhart's Law sometimes implies that optimizing hard enough for causes to stop increasing.
A large part of the post would be proofs about what the distributions of and must be for , where X and V are independent random variables with mean zero. It's clear that
Belrose et al found that the tuned lens is generally superior to the logit lens. Would the results change if the tuned lens were used here? My guess is probably not, since in the paper there is little difference when applying the two techniques to later layers, but maybe it's worth a try.
In future posts, we will describe a more complete categorisation of these situations and how they relate to the AI alignment problem.
Did this ever happen?
after talking to Eliezer, I now have a better sense of the generator of this list. It now seems pretty good and non-arbitrary, although there is still a large element of taste.
Suppose an agent has this altruistic empowerment objective, and the problem of getting an objective into the agent has been solved.
Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.
I think this reflects two problems:
I'm offering a $300 bounty to anyone that gets 100 karma doing this this year (without any vote manipulation).
Manifold market for this:
They also separately believe that by the time an AI reaches superintelligence, it will in fact have oriented itself around a particular goal and have something like a goal slot in its cognition - but at that point, it won’t let us touch it, so the problem becomes we can't put our own objective into it.
My guess is this is a bit stronger than what Nate believes. The corresponding quote (emphasis mine) is
...Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a b
Even if it has some merits, I find the "death with dignity" thing an unhelpful, mathematically flawed, and potentially emotionally damaging way to relate to the problem. Even if MIRI has not given up, I wouldn't be surprised if the general attitude of despair has substantially harmed the quality of MIRI research. Since I started as a contractor for MIRI in September, I've deliberately tried to avoid absorbing this emotional frame, and rather tried to focus on doing my job, which should be about computer science research. We'll see if this causes me problems.
Here's how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you've been specifically taught that chocolate is bad for dogs, maybe this is a bad example.
I also wasn't trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.
What was the equation for research progress referenced in Ars Longa, Vita Brevis?
...“Then we will talk this over, though rightfully it should be an equation. The first term is the speed at which a student can absorb already-discovered architectural knowledge. The second term is the speed at which a master can discover new knowledge. The third term represents the degree to which one must already be on the frontier of knowledge to make new discoveries; at zero, everyone discovers equally regardless of what they already know; at one, one must have mastered every
We're definitely unlucky that, of the two challenges, this has been solved and AI strategy is unsolved.
There's a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.
Here's one factor that might push against the value of Steinhardt's post as something to send to ML researchers: perhaps it is not arguing for anything controversial, and so is easier to defend convincingly. Steinhardt doesn't explicitly make any claim about the possibility of existential risk, and barely mentions alignment. Gates spends the entire talk on alignment and existential risk, and might avoid being too speculative because their talk is about a survey of basically the same ML researcher population as the audience, and so can engage with the most ...
+1 to this, I feel like an important question to ask is "how much did this change your mind?". I would probably swap the agree/disagree question for this?
I think the qualitative comments also bear this out as well:
dislike of a focus on existential risks or an emphasis on fears, a desire to be “realistic” and not “speculative”
This seems like people like AGI Safety arguments that don't really cover AGI Safety concerns! I.e. the problem researchers have isn't so much with the presentation but the content itself.
I agree with the following caveats:
I feel like FTX is a point against utilitarianism for the same reasons Bentham is a point for utilitarianism. If you take an ethical system to logical conclusions and anticipate feminism, animal rights, etc. this is evidence for a certain algorithm creating good in practice. If you commit massive fraud this is evidence against.
This also doesn't shift my meta-ethics much, so maybe I'm not one of the people you're talking about?
Hypothesis: much of this is explained by the simpler phenomenon of loss aversion. $1 to your ingroup is a gain, $1 to your outgroup is a loss and therefore mentally multiplied by ~2. The paper finds a factor of 3, so maybe there's something else going on too.
not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.
I don't think we can take this as evidence that Yudkowsky or the average rationalist "underestimates more average people". In the Bankless podcast, Eliezer was not trying to do anything like trying to explore the beliefs of the podcast hosts, just explaining his views. And there have been attempts at outreach before. If Bankless was evidence towards "t... (read more)