I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work). But I decided maybe it's best to comment in a way that gives a better signal than silence.
I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.
Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.
I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.
I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like:
If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.
In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:
Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.
But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").
Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.
I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.
We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.
I hope to write up a reasonable pitch sometime over the next few weeks.
In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and together are the two halves of the alignment problem and so solving both is very exciting. That said, I've considered this in less detail than the ELK application. I'll try to give a bit more detail on this in the child comment.
Discussing the application of heuristic estimators to adversarial training:
Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C. For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.
You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that "we can detect bad behavior but the model does a treacherous turn anyway" is a plausible failure mode to address.
A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. . You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.
So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.
If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary---you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.
Have you seen https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/ and any of the other recent posts on https://www.alignment.org/blog/? I don't think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.
I can't say anything rigorous, sophisticated, or credible. I can just say that the paper was a very welcome spigot of energy and optimism in my own model of why "formal verification" -style assurances and QA demands are ill-suited to models (either behavioral evals or reasoning about the output of decompilers).
we have more of a need for people with a strong theoretical background (in math, physics or computer science, for example)
Out of curiosity, while the math/CS background makes sense what kind of physics background do you expect to be important and why?
I think the kind of mathematical problem solving we're engaged in is common across theoretical physics (although this is just my impression as a non-physicist). I've noticed that some specific topics that have come up (such as Gaussian integrals and permanents) also crop up in quantum field theory, but I don't think that's a strong reason to prefer that background particularly. Broad areas that often come up include probability theory, computational complexity and discrete math, but it's not necessary to have a lot of experience in those areas, only to be able to pick things up from them as needed.
I think the problem of "actually specifying to an AI to do something physical, in reality, like 'create a copy of strawberry down to the cellular but not molecular level', and not just manipulate its own sensors to believe it perceives itself achieving that even if it accomplishes real things in the world to do that" is a problem that is very deeply related to physics, and is almost certainly dependent on the physical laws of world more than some abstract disembodied notion of an agent.
What’s the difficulty level of the take home? Are you looking for math Olympiad level theorists, or just folks who can handle graduate level math/cs?
The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).
(Note: I am talking about the current version of the test, it's possible that the difficulty will change as we refine our interview process.)
"We will keep applications open until at least the end of August 2023"
Is there any advantage to applying early vs in August 2023? I ask as someone intending to do a few months of focused independent MI research. I would prefer to have more experience and sense of my interests before applying, but on the other hand, don't want to find out mid-August that you've filled all the roles and thus it's actually too late to apply. Thanks.
We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive. Moreover, if we were unable to hire someone due to capacity constraints, we would very likely be interested in hiring them at a later date. For these reasons, I think the advantage to applying earlier is a fairly small consideration overall, and it sounds like it would make more sense for you to apply whenever you are comfortable.
The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here.
Update January 2024: we have paused hiring and expect to reopen in the second half of 2024. We are open to expressions of interest but do not plan to process them until that time.
What is ARC’s Theory team?
The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we’re trying to design ML training objectives that incentivize systems to honestly report their internal beliefs.
For the last year or so, we’ve mostly been focused on an approach to ELK based on formalizing a kind of heuristic reasoning that could be used to analyze neural network behavior, as laid out in our paper on Formalizing the presumption of independence. Our research has reached a stage where we’re coming up against concrete problems in mathematics and theoretical computer science, and so we’re particularly excited about hiring researchers with relevant background, regardless of whether they have worked on AI alignment before. See below for further discussion of ARC’s current theoretical research directions.
Who is ARC looking to hire?
Compared to our last hiring round, we have more of a need for people with a strong theoretical background (in math, physics or computer science, for example), but we remain open to anyone who is excited about getting involved in AI alignment, even if they do not have an existing research record.
Ultimately, we are excited to hire people who could contribute to our research agenda. The best way to figure out whether you might be able to contribute is to take a look at some of our recent research problems and directions:
What is working on ARC’s Theory team like?
ARC’s Theory team is led by Paul Christiano and currently has 2 other permanent team members, Mark Xu and Jacob Hilton, alongside a varying number of temporary team members (recently anywhere from 0–3).
Most of the time, team members work on research problems independently, with frequent check-ins with their research advisor (e.g., twice weekly). The problems described above give a rough indication of the kind of research problems involved, which we would typically break down into smaller, more manageable subproblems. This work is often somewhat similar to academic research in pure math or theoretical computer science.
In addition to this, we also allocate a significant portion of our time to higher-level questions surrounding research prioritization, which we often discuss at our weekly group meeting. Since the team is still small, we are keen for new team members to help with this process of shaping and defining our research.
ARC shares an office with several other groups working on AI alignment such as Redwood Research, so even though the Theory team is small, the office is lively with lots of AI alignment-related discussion.
What are ARC’s current theoretical research directions?
ARC’s main theoretical focus over the last year or so has been on preparing the paper Formalizing the presumption of independence and on follow-up work to that. Roughly speaking, we’re trying to develop a framework for “formal heuristic arguments” that can be used to reason about the behavior of neural networks. This framework can be thought of as a confluence of two existing approaches:
Human understandable
Machine verifiable
Confident and final
Formal proof
Uncertain and defeasible
Mechanistic interpretability
Formal heuristic argument
This research direction can be framed in a couple of different ways:
Over the coming weeks we'll post an update on our progress and current focus, as well as an AMA with research staff.
Hiring process
Our current interview process involves:
We will compensate candidates for their time when this is logistically possible.
We will keep applications open until at least the end of August 2023, and will aim to get a final decision back within 6 weeks of receiving an application.
Employment details
ARC is based in Berkeley, California, and we would prefer people who can work full-time from our office, but we are open to discussing remote or part-time arrangements in some circumstances. We can sponsor visas and are H-1B cap-exempt.
We are accepting applications for both visiting researcher (1–3 months) and full-time positions. The intention of the visiting researcher position is to assess potential fit for a full-time role, and we expect to invite around half of visiting researchers to join full-time. We are also able to offer straight-to-full-time positions, but we anticipate that we will only be able to do this for people with a legible research track-record.
Salaries are in the $150k–400k range for most people depending on experience.
Further information
If you have any questions about anything in this post, please ask in the comments section, email hiring@alignment.org, or stay tuned for our future posts and AMA.