Shortform Content

An idea about instrumental convergence for non-equilibrium RL algorithms.

There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.

If we have an asynchronous dynam... (read more)

Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one... (read more)

One of the greatest tragedies of truth-seeking as a human is that the things we instinctively do when someone else is wrong are often the exact opposite of the thing that would actually convince the other person.

Hang on—is the 2022 review happening this year?

Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:

The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.

In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (a

... (read more)

NDAs sure do seem extremely costly.  My current sense is that it's almost never worth signing one, or binding oneself to confidentiality in any similar way, for anything except narrowly-scoped technical domains (such as capabilities research).

Showing 3 of 6 replies (Click to show all)
I have more examples, but unfortunately some of them I can't talk about.  A few random things that come to mind: * OpenPhil routinely requests that grantees not disclose that they've received an OpenPhil grant until OpenPhil publishes it themselves, which usually happens many months after the grant is disbursed. * Nearly every instance that I know of where EA leadership refused to comment on anything publicly post-FTX due to advice from legal counsel. * So many things about the Nonlinear situation. * Coordination Forum requiring attendees agree to confidentiality re: attendance and content of any conversations with people who wanted to attend but not have their attendance known to the wider world, like SBF, and also people in the AI policy space.
That explains why the NDAs are costly. But if you don't sign one, you can't e.g. get the OpenPhil grant. So the examples don't explain how "it's almost never worth signing one".

Not all of these are NDAs; my understanding is that the OpenPhil request comes along with the news of the grant (and isn't a contract).  Really my original shortform should've been a broader point about confidentiality/secrecy norms, but...

More information about alleged manipulative behaviour of Sam Altman


Text from article (along with follow-up paragraphs):

Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the p

... (read more)
Showing 3 of 6 replies (Click to show all)
Eh, random people complain. Screenshots of text seems fine, especially in shortform. It honestly seems fine anywhere. I also really don't think that accessibility should matter much here, the number of people reading on a screenreader or using assistive technologies are quite small, if they browse LessWrong they will already be running into a bunch of problems, and there are pretty good OCR technologies around these days that can be integrated into those. 
I have some idea about how much work it takes to maintain something like, so this random person would like to take this opportunity to thank you for running LW for the last many years.

Thank you! :)

In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.

But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It I'd not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection ... (read more)

There are competing theories here.  Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements.  The latter also makes the entire value chain safer, as other organizations can learn from your methods.

Nora talks sometimes about the alignment field using the term black box wrong. This seems unsupported, from my experience, most in alignment use the term “black box” to describe how their methods treat the AI model, which seems reasonable. Not a fundamental state of the AI model itself.

When autism was low-status, all you could read was how autism is having a "male brain" and how most autists were males. The dominant paradigm was how autists lack the theory of mind... which nicely matched the stereotype of insensitive and inattentive men.

Now that Twitter culture made autism cool, suddenly there are lots of articles and videos about "overlooked autistic traits in women" (which to me often seem quite the same as the usual autistic traits in men). And the dominant paradigm is how autistic people are actually too sensitive and easily overwhel... (read more)

Operations research (such as queuing theory) can be viewed as a branch of complex systems theory that has a track record of paying rent by yielding practical results in supply chain and military logistics management

Say more please.


It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)

Showing 3 of 10 replies (Click to show all)

A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you're right about some people. I don't know what 'lots of alignment folk' means, and I've not considered the topic of other-people's-update-rates-and-biases much.

For me, most changes route via governance.

I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions.

I've also made negative (evid... (read more)

(Updating a bit because of these responses -- thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)
I did not update towards misalignment at all on bing chat. I also do not think chatgpt is (strong) evidence of alignment. I generally think anyone who already takes alignment as a serious concern at all should not update on bing chat, except perhaps in the department of "do things like bing chat, which do not actually provide evidence for misalignment, cause shifts in public opinion?"

I might start a newsletter on the economics of individual small businesses. Does anyone know anyone who owns or manages e.g. a restaurant, or a cafe, or a law firm, or a bookstore, or literally any kind of small business? Would love intros to such people so that I can ask them a bunch of questions about e.g. their main sources of revenue and costs or how they make pricing decisions.

So I just posted what I think is a pretty cool post which so far hasn't gotten a lot of attention and which I'd like to get more attention[1]: Using Prediction Platforms to Select Quantified Self Experiments.

TL;DR: I've set up 14 markets on Manifold for the outcomes experiments I could run on myself. I will select the best and a random market and actually run the experiments. Please predict the markets.

Why I think this is cool: This is one step in the direction of more futarchy, and there is a pretty continuous path from doing this to more and more involve... (read more)


Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework

Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I e... (read more)

Showing 3 of 14 replies (Click to show all)

LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting.

To what extent do you worry about the training methods used for ChatGPT, and why?
I think a lot of this probably comes back to way overestimating the complexity of human values. I think a very deeply held belief of a lot of LWers is that human values are intractably complicated and gene/societal-specific, and I think if this was the case, the argument would actually be a little concerning, as we'd have to rely on massive speed biases to punish deception. These posts gave me good intuition for why human value is likely to be quite simple, one of them talks about how most of the complexity of the values is inaccessible to the genome, thus it needs to start from far less complexity than people realize, because nearly all of it needs to be learned. Some other posts from Steven Byrnes are relevant, which talks about how simple the brain is, and a potential difference between me and Steven Byrnes is that the same process of learning from scratch algorithms that generate capabilities also applies to values, and thus the complexity of value is upper-bounded by the complexity of learning from scratch algorithms + genetic priors, both of which are likely very low, at the very least not billions of lines complex, and closer to thousands of lines/hundreds of bits. But the reason this matters is because we no longer have good reason to assume that the deceptive model is so favored on priors like Evan Hubinger says here, as the complexity is likely massively lower than LWers assume. Putting it another way, the deceptive and aligned models both have very similar complexities, and the relative difficulty is very low, so much so that the aligned model might be outright lower complexity, but even if that fails, the desired goal has a complexity very similar to the undesired goal complexity, thus the relative difficulty of actual alignment compared to deceptive alignment is quite low.

That's planning for failure, Morty. Even dumber than regular planning.

- Rick Sanchez on Mortyjitsu (S02E05 of Rick and Morty)


Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?

Showing 3 of 5 replies (Click to show all)
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I'd still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.

Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don't buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.

To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion) Important disanalogies seem: 1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time. 2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating]) 3) A fairly large proportion of humans are nice (I think!). The second could be addressed somewhat by raising the stakes. The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests". Combined with (3), this might seem like a decent solution. Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.

Theory for a capabilities advance that is going to occur soon:

OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.

Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".

This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradien... (read more)

Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.

Just had a conversation with a guy where he claimed that the main thing that separates him from EAs was that his failure mode is us not conquering the universe. He said that, while doomers were fundamentally OK with us staying chained to Earth and never expanding to make a nice intergalactic civilization, he, an AI developer, was concerned about the astronomical loss (not his term) of not seeding the galaxy with our descendants. This P(utopia) for him trumped all other relevant expected value considerations.

What went wrong?

Among other things I suppose they're not super up on that to efficiently colonise the universe [...] watch dry paint stay dry.

Ethicophysics for Skeptics

Or, what the fuck am I talking about?

In this post, I will try to lay out my theories of computational ethics in as simple, skeptic-friendly, non-pompous language as I am able to do. Hopefully this will be sufficient to help skeptical readers engage with my work.

The ethicophysics is a set of computable algorithms that suggest (but do not require) specific decisions in response to ethical decisions in a multi-player reinforcement learning problem.

The design goal that the various equations need to satisfy is that they should select a... (read more)

Load More