Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.
I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.
Is there an AI transcript/summary?
I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won't publish.
Now that this post is out, I feel like I should at least make this known. I don't regret attempting the dialogue, I just wish we had something more interesting to disagree about.
The model ultimately predicts the token two positions after B_def. Do we know why it doesn't also predict the token two after B_doc? This isn't obvious from the diagram; maybe there is some way for the induction head or arg copying head to either behave differently at different positions, or suppress the information from B_doc.
The Brownian motion assumption is rather strong but not required for the conclusion. Consider the stock market, which famously has heavy-tailed, bursty returns. It happens all the time for the S&P 500 to move 1% in a week, but a 10% move in a week only happens a couple of times per decade. I would guess (and we can check) that most weeks have >0.6x of the average per-week variance of the market, which causes the median weekly absolute return to be well over half of what it would be if the market were Brownian motion with the same long-term variance.
Also, Lawrence tells me that in Tetlock's studies, superforecasters tend to make updates of 1-2% every week, which actually improves their accuracy.
I talked about this with Lawrence, and we both agree on the following:
To some degree yes, but I expect lots of information to be spread out across time. For example: OpenAI releases GPT5 benchmark results. Then a couple weeks later they deploy it on ChatGPT and we can see how subjectively impressive it is out of the box, and whether it is obviously pursuing misaligned goals. Over the next few weeks people develop post-training enhancements like scaffolding, and we get a better sense of its true capabilities. Over the next few months, debate researchers study whether GPT4-judged GPT5 debates reliably produce truth, and control researchers study whether GPT4 can detect whether GPT5 is scheming. A year later an open-weights model of similar capability is released and the interp researchers check how understandable it is and whether SAEs still train.
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and the fact you don't start at 50%. But I do believe that evidence is coming in every week such that ideal market prices should move by 1% on maybe half of weeks, and it is not crazy for your probabilities to shift by 1% during many weeks if you think about it often enough. [Edit: I'm not claiming that you should try to make more 1% updates, just that if you're calibrated and think about AI enough, your forecast graph will tend to have lots of >=1% week-to-week changes.]
I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.
Hangnails are annoying and painful, and most people deal with them poorly. [1] Instead, use a drop of superglue to glue it to your nail plate. It's $10 for 12 small tubes on Amazon. Superglue is also useful for cuts and minor repairs, so I already carry it around everywhere.
Hangnails manifest as either separated nail fragments or dry peeling skin on the paronychium (area around the nail). In my experience superglue works for nail separation, and a paper (available free on Scihub) claims it also works for peeling skin on the paronychium.
Cyanoacrylate glue is regularly used in medicine to close wounds, and now frequently replaces stitches. Medical superglue has slightly different types of cyanoacrylate, but doctors I know say it's basically the same thing.
I think medical superglue exists to prevent rare reactions and for large wounds where the exothermic reaction from a large quantity might burn you, and the safety difference for hangnails is minimal [2]. But to be extra safe you could just use 3M medical grade superglue or Dermabond.
[1]: Typical responses to hangnails include:
[2]: There have been studies showing cytotoxicity in rabbits when injecting it in their eyes, or performing internal (bone or cartilage) grafts. A 2013 review says that although some studies have found internal toxicity, "[f]or wound closure and various other procedures, there have been a considerable number of studies finding histologic equivalence between ECA [commercial superglue] and more widely accepted modalities of repair."
Much dumber ideas have turned into excellent papers