One of the greatest tragedies of truth-seeking as a human is that the things we instinctively do when someone else is wrong are often the exact opposite of the thing that would actually convince the other person.
Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:
...The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.
In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (a
NDAs sure do seem extremely costly. My current sense is that it's almost never worth signing one, or binding oneself to confidentiality in any similar way, for anything except narrowly-scoped technical domains (such as capabilities research).
Not all of these are NDAs; my understanding is that the OpenPhil request comes along with the news of the grant (and isn't a contract). Really my original shortform should've been a broader point about confidentiality/secrecy norms, but...
More information about alleged manipulative behaviour of Sam Altman
Text from article (along with follow-up paragraphs):
...Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the p
In software development / IT contexts, "security by obscurity" (that is, having the security of your platform rely on the architecture of that platform remaining secret) is considered a terrible idea. This is a result of a lot of people trying that approach, and it ending badly when they do.
But the thing that is a bad idea is quite specific - it is "having a system which relies on its implementation details remaining secret". It I'd not an injunction against defense in depth, and having the exact heuristics you use for fraud or data exfiltration detection ...
There are competing theories here. Including secrecy of architecture and details in the security stack is pretty common, but so is publishing (or semi-publishing: making it company confidential, but talked about widely enough that it's not hard to find if someone wants to) mechanisms to get feedback and improvements. The latter also makes the entire value chain safer, as other organizations can learn from your methods.
Nora talks sometimes about the alignment field using the term black box wrong. This seems unsupported, from my experience, most in alignment use the term “black box” to describe how their methods treat the AI model, which seems reasonable. Not a fundamental state of the AI model itself.
When autism was low-status, all you could read was how autism is having a "male brain" and how most autists were males. The dominant paradigm was how autists lack the theory of mind... which nicely matched the stereotype of insensitive and inattentive men.
Now that Twitter culture made autism cool, suddenly there are lots of articles and videos about "overlooked autistic traits in women" (which to me often seem quite the same as the usual autistic traits in men). And the dominant paradigm is how autistic people are actually too sensitive and easily overwhel...
Operations research (such as queuing theory) can be viewed as a branch of complex systems theory that has a track record of paying rent by yielding practical results in supply chain and military logistics management
It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
A lot of the people around me (e.g. who I speak to ~weekly) seem to be sensitive to both new news and new insights, adapting both their priorities and their level of optimism[1]. I think you're right about some people. I don't know what 'lots of alignment folk' means, and I've not considered the topic of other-people's-update-rates-and-biases much.
For me, most changes route via governance.
I have made mainly very positive updates on governance in the last ~year, in part from public things and in part from private interactions.
I've also made negative (evid...
I might start a newsletter on the economics of individual small businesses. Does anyone know anyone who owns or manages e.g. a restaurant, or a cafe, or a law firm, or a bookstore, or literally any kind of small business? Would love intros to such people so that I can ask them a bunch of questions about e.g. their main sources of revenue and costs or how they make pricing decisions.
So I just posted what I think is a pretty cool post which so far hasn't gotten a lot of attention and which I'd like to get more attention[1]: Using Prediction Platforms to Select Quantified Self Experiments.
TL;DR: I've set up 14 markets on Manifold for the outcomes experiments I could run on myself. I will select the best and a random market and actually run the experiments. Please predict the markets.
Why I think this is cool: This is one step in the direction of more futarchy, and there is a pretty continuous path from doing this to more and more involve...
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I e...
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting.
That's planning for failure, Morty. Even dumber than regular planning.
- Rick Sanchez on Mortyjitsu (S02E05 of Rick and Morty)
Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don't buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.
Theory for a capabilities advance that is going to occur soon:
OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.
Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".
This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradien...
Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.
Just had a conversation with a guy where he claimed that the main thing that separates him from EAs was that his failure mode is us not conquering the universe. He said that, while doomers were fundamentally OK with us staying chained to Earth and never expanding to make a nice intergalactic civilization, he, an AI developer, was concerned about the astronomical loss (not his term) of not seeding the galaxy with our descendants. This P(utopia) for him trumped all other relevant expected value considerations.
What went wrong?
Among other things I suppose they're not super up on that to efficiently colonise the universe [...] watch dry paint stay dry.
In this post, I will try to lay out my theories of computational ethics in as simple, skeptic-friendly, non-pompous language as I am able to do. Hopefully this will be sufficient to help skeptical readers engage with my work.
The ethicophysics is a set of computable algorithms that suggest (but do not require) specific decisions in response to ethical decisions in a multi-player reinforcement learning problem.
The design goal that the various equations need to satisfy is that they should select a...
An idea about instrumental convergence for non-equilibrium RL algorithms.
There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.
If we have an asynchronous dynam... (read more)
Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one... (read more)