Wiki Contributions


I don't think the lack of machine translations of AI alignment materials is holding the field back in Japan. Japanese people have an unlimited amount of that already available. "Doing it for them", when their web browser already has the feature built in, seems honestly counterproductive, as it signals how little you're willing to invest in the space.

I think it's possible increasing the amount of human-translated material could make a difference. (Whether machines are good enough to aid such humans or not, is a question I leave to the professional translators.)

Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?

My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they're made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I've done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear (!) single-layer (!!) regression models, not anything like a LLM. How much progress does Conjecture expect to really make? What are other papers our study group should read?

Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average "civilization", with "thousands of years" to plan, that wants to break out of the box.

This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn't want to commit species-cide on humans.

These are the most compelling-to-me quotes from "Simulators", saved for posterity.


Perhaps it shouldn’t be surprising if the form of the first visitation from mindspace mostly escaped a few years of theory conducted in absence of its object.


…when AI is all of a sudden writing viral blog posts, coding competitively, proving theorems, and passing the Turing test so hard that the interrogator sacrifices their career at Google to advocate for its personhood, a process is clearly underway whose limit we’d be foolish not to contemplate.


GPT-3 does not look much like an agent. It does not seem to have goals or preferences beyond completing text, for example. It is more like a chameleon that can take the shape of many different agents. Or perhaps it is an engine that can be used under the hood to drive many agents. But it is then perhaps these systems that we should assess for agency, consciousness, and so on.


It would not be very dignified of us to gloss over the sudden arrival of artificial agents often indistinguishable from human intelligence just because the policy that generates them “only cares about predicting the next word”.


There is a clear sense in which the network doesn’t “want” what the things that it simulates want, seeing as it would be just as willing to simulate an agent with opposite goals, or throw up obstacles which foil a character’s intentions for the sake of the story. The more you think about it, the more fluid and intractable it all becomes. Fictional characters act agentically, but they’re at least implicitly puppeteered by a virtual author who has orthogonal intentions of their own.


Everything can be trivially modeled as a utility maximizer, but for these reasons, a utility function is not a good explanation or compression of GPT’s training data, and its optimal predictor is not well-described as a utility maximizer. However, just because information isn’t compressed well by a utility function doesn’t mean it can’t be compressed another way.


Gwern has said that anyone who uses GPT for long enough begins to think of it as an agent who only cares about roleplaying a lot of roles.

That framing seems unnatural to me, comparable to thinking of physics as an agent who only cares about evolving the universe accurately according to the laws of physics.


That said, GPT does store a vast amount of knowledge, and its corrigibility allows it to be cajoled into acting as an oracle, like it can be cajoled into acting like an agent.


Treating GPT as an unsupervised implementation of a supervised learner leads to systematic underestimation of capabilities, which becomes a more dangerous mistake as unprobed capabilities scale.


Benchmarks derived from supervised learning test GPT’s ability to produce correct answers, not to produce questions which cause it to produce a correct answer down the line. But GPT is capable of the latter, and that is how it is the most powerful.


Natural language has the property of systematicity: “blocks”, such as words, can be combined to form composite meanings. The number of meanings expressible is a combinatorial function of available blocks. A system which learns natural language is incentivized to learn systematicity; if it succeeds, it gains access to the combinatorial proliferation of meanings that can be expressed in natural language. What GPT lets us do is use natural language to specify any of a functional infinity of configurations, e.g. the mental contents of a person and the physical contents of the room around them, and animate that. That is the terrifying vision of the limit of prediction that struck me when I first saw GPT-3’s outputs.


Autonomous text-processes propagated by GPT, like automata which evolve according to physics in the real world, have diverse values, simultaneously evolve alongside other agents and non-agentic environments, and are sometimes terminated by the disinterested “physics” which governs them.


Calling GPT a simulator gets across that in order to do anything, it has to simulate something, necessarily contingent, and that the thing to do with GPT is to simulate! Most published research about large language models has focused on single-step or few-step inference on closed-ended tasks, rather than processes which evolve through time, which is understandable as it’s harder to get quantitative results in the latter mode. But I think GPT’s ability to simulate text automata is the source of its most surprising and pivotal implications for paths to superintelligence: for how AI capabilities are likely to unfold and for the design-space we can conceive.

"Do stuff that seems legibly valuable" becomes the main currency, rather than "do stuff that is actually valuable."


In my experience, these are aligned quite often, and a good organization/team/manager's job is keeping them aligned. This involves lots of culture-building, vigilance around goodharting, and recurring check-ins and reevaluations to make sure the layer under you is properly aligned. Some of the most effective things I've noticed are rituals like OKR planning and wide-review councils, and having a performance evaluation culture that tries to uphold objective standards.

And of course, the level after this you describe, where people are basically pretending to work, is something effective organizations try to weed out ruthlessly. I'm sure it exists, but competitive pressures are against it persisting for long.

My direct experience is as an increasingly-senior software engineer at Google. It leads me to be much more optimistic about corporations' abilities to be effective than this article. I truly believe that the things have been set up such that that what gets me and my teammates good ratings or promotions, are actually valuable.

Oh, I didn't realize it was such a huge difference! Almost half of the sequences omitted. Wow. I guess I can write a quick program to diff and thus answer my original question.

Was there any discussion about how and why R:A-Z made the selections it did?

Do you have any recommendations for particularly good sequences or posts that R:A-Z omitted, from the ~430 that I've apparently missed?

This seems like a simulator in the same way the human imagination is a simulator. I could mentally simulate a few chess moves after the ones you prompted. After a while (probably a short while) I'd start losing track of things and start making bad moves. Eventually I'd probably make illegal moves, or maybe just write random move-like character strings if I was given some motivation for doing so and thought I could get away with it.