All of AdamGleave's Comments + Replies

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

2rohinmshah2moYeah, I agree with that. (I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)
What fraction of breakthrough COVID cases are attributable to low antibody count?

It could be net-negative if receiving a booster shot caused stronger imprinting, making future immune response less adaptive. I don't have a good sense of whether this original antigenic sin effect has already saturated after receiving two-doses (or even a single-dose), or whether it continues to become stronger.

My sense is this is an open question. From Petras et al (2021):

As suggested by a recent observation in naturally immunized individuals receiving two doses of the Pfizer COVID-19 (Comirnaty) vaccine, original antigenic sin may pose a problem in fu

... (read more)
2AllAmericanBreakfast3moOne possible strategy in a world of sane and effective governance might be to reserve one or more protein targets for a truly global mass-vaccination campaign. Really drill in the idea that we have to wipe out Covid or else live in a world that's long-term deadlier than it was before. Produce enough vaccine and infrastructure to get the planet vaccinated in a short period of time. Then deliver it all at once. This could be going on in the background while we maintain our present efforts, building consensus and establishing infrastructure.
What fraction of breakthrough COVID cases are attributable to low antibody count?

I largely with this analysis. One major possible "side-effect" of a third booster is original antigenic sin. Effectively, the immune system may become imprinted on the ancestral variant of the spike protein, preventing adaptation to new variants (whether via direct exposure or via future boosters targeting new variants). This would be the main way I could see a third booster being seriously net-negative, although I don't have a good sense of the probability. Still, if antibody levels are low, the benefit of a booster is greater and I'd guess (caveat: not a... (read more)

3AllAmericanBreakfast3moFrom the article: This was the first time I've encountered this concept. It actually made it seem like a booster shot would just be ineffective, rather than "seriously net-negative." Immunological memory would be optimized for the early variant the vaccine was designed for, and would be unable (or less able?) to update for the new variant. Pfizer and Moderna vaccines target the spike protein. However, there are several other potential protein targets []. Maybe the booster could be designed to target these instead. That said, my read on the original antigenic sin article makes it seem like a plausible cause of breakthrough cases of Covid. If so, then I'd predict that antibody levels wouldn't be a good predictor of susceptibility to infection. But it would be much better to base this on empirical data, and I don't know if that exists.
A Better Time until Sunburn Calculator

Thanks for sharing this! I did notice a weird non-monotonicity: if I go from 90 minutes exposure to 120 minutes, the "Percent of Population w/ Sunburn Degree 1 at Time Exposed" drops from 96.8% to 72.7%. There is a warning in both cases that it's outside normal range, but it still seems odd that more exposure gives lower risk.

1Josh Jacobson3moIndeed, the results for which warnings are thrown should be disregarded; the non-monotonicity of out-of-bounds results is a situation I noticed as well. The authors were quite clear about the equation only being useful in certain conditions, and it does seem to act reliably in those conditions, so I think this is just an out-of-bounds quirk that can be disregarded.
Delta Strain: Fact Dump and Some Policy Takeaways

Just to flag I messed up the original calculation and underestimated everything by a factor of 2x, I've added an errata.

I'd also recommend Matt Bell's recent analysis, who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible.

The main things driving the difference from my comment were:

  • Uses data from the UK's Office of National Statistics that I'd missed, which has a very high number of 55% of people reporting symptoms after 5 weeks
... (read more)

I should probably argue with Matt directly, but my brief take is that this is just entirely incompatible with what we see on the ground. The friends of mine who got COVID aren't reporting 45% chance of their life being 20% worse. That's... an incredibly massive effect that we would definitely see. Would anyone realistically bet on that?

7Owain_Evans4moBell mentions this paper [] in Nature Medicine that finds only 2.3% of people having symptoms after 12 weeks. (The UK ONS study [] that is Bell's main sources estimates 13%). It seems better to take a mean of these estimates than to just drop one of them, as the studies are fairly similar in approach. (Both rely on self-report. The sample size for the Nature paper is >4000). Note that the 13% figure in the ONS study drops to 1% if you restrict to subjects who had symptoms every week. (The study allows for people to go a week without any symptoms while still counting as a Long Covid case). I realize people report Long Covid as varying over time, but it's clearly worse to have a condition that causes some fatigue or tiredness at least once a week rather at least once every two weeks.
Delta Strain: Fact Dump and Some Policy Takeaways

This is a good point, the demographics here are very skewed. I'm not too worried about it overstating risk, simply because the risk ended up looking not that high (at least after adjusting for hospitalization). I think at this point most of us have incurred more than 5 days of costs from COVID restrictions, so if that was really all the cost from COVID, I'd be pretty relaxed.

The gender skew could be an issue, e.g. chronic fatigue syndrome seems to occur at twice the rate in women than men.

Delta Strain: Fact Dump and Some Policy Takeaways

This is an accurate summary, thanks! I'll add my calculation was only for long-term sequelae. Including ~10 days cost from acute effects, my all-things-considered view would be mean of ~40 days, corresponding to 1041 uCOVIDs per hour.

This is per actual hour of (quality-adjusted) life expectancy. But given we spend ~1/3rd of our time sleeping, you probably want to value a waking-hour at 1.5x a life-hour (assuming being asleep has neutral valence). If you work a 40 hour work week and only value your productive time (I do not endorse this, by the way), then y... (read more)

Delta Strain: Fact Dump and Some Policy Takeaways

Errata: My original calculation underestimated the risk by a factor of about 2x. I neglected two key considerations, which fortunately somewhat canceled each other out. My new estimate from the calculation is 3.0 to 11.7 quality-adjusted days lost to long-term sequelae, with my all-things-considered mean at 45. 

The two key things I missed:

  - I estimated the risk of a non-hospitalized case is about 10x less than a hospitalized case, and so divided the estimates of disease burden by 10x. The first part is correct, but the second part would only ma... (read more)

1MichaelStJules3moHere's another BOTEC, by Matt Bell []: I think the main differences are using studies with higher excess burdens and using a lower reduction factor to translate to lifelong risk. On the latter:
5AdamGleave4moJust to flag I messed up the original calculation and underestimated everything by a factor of 2x, I've added an errata. I'd also recommend Matt Bell's recent analysis [] , who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible. The main things driving the difference from my comment were: * Uses data from the UK's Office of National Statistics that I'd missed, which has a very high number of 55% of people reporting symptoms after 5 weeks, with fairly slow rates of recovery all the way out to 120 days post-infection. Given this is significantly higher than most other studies I've seen, I think Matt is being pessimistic by only down-adjusting to 45%, but I should emphasize these numbers are credible and the ONS study is honestly better than most out there. * Long COVID making your life 20% worse is on the pessimistic end. I put most mild symptoms at 5% worse. Ultimately subjective and highly dependent on what symptoms you get. * I think the difference in hospitalized vs non-hospitalized risk is closer to 10x (based on Al-Aly figure) not Matt's estimate of 2x, that means we should multiply by a factor of ~60% not ~97%.
3Owain_Evans4moI quickly skimmed the El-Aly et al paper. It does look much better than some of the other studies. One concern is the demographics of the patients. Only 25% of people with Covid are younger than 48. Only 12% are female. I'd guess the veterans under 35 are significantly less affluent than LW readers. (Would more affluent veterans use private health care?). At a glance, I can't see results of any regressions on age but it might be worth contacting the authors about this. How to adjust for this? One thing is just look at hospitalization risk (see AdamGleave's adjustment point (1)). However, it seems plausible that younger and healthier people would also recover better from less acute cases (and be less likely to have lingering symptoms). OTOH, there's anecdata and data (of less high quality IMO) suggesting that Long Covid doesn't fit the general patter of exponential increases in badness of Covid (and other similar diseases) with age. Overall, I'd still be inclined to make an adjustment of risk down if you are under 35 and healthy. Demographic info about patients in El-Aly et al. []
6Ben Pace4mo(I'd personally appreciate you saying how many microcovids you think is equivalent to an hour's time; that's the main number I've been using to figure out whether various costs are worth it.)
Inner Alignment in Salt-Starved Rats

I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)

Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'... (read more)

Inner Alignment in Salt-Starved Rats

Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.

Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.

One class of model-based RL approaches is based on [model-predictive control](ht... (read more)

2Steven Byrnes1yHmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can't—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled "model-based RL Atari" and the first hit was this [] which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I'm not intimately familiar with the deep RL literature, I wouldn't know what's typical and I'll take your word for it, but it does seem that both possibilities are out there. Anyway, I don't think the neocortex can evaluate the true reward function in arbitrary states, because it's not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that's the whole thing with TD learning and dopamine. I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you. Well, the rats are trying to do the rewarding thing after zero samples, so I don't think "sample-efficiency" is quite the right framing. In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here. Well, I guess you could say that this is still a "normal MDP", but where "having thoughts" and "having ideas" etc. a
Inner Alignment in Salt-Starved Rats

I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:

  - Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.

  - Planner concludes that any sequence of actions involving pressing lever wi... (read more)

7Steven Byrnes1yGood question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see. In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it does that, all it has to go on is the observations of what the reward function has output in the past. ( Related discussion [] .) So yeah, in the salt-deprived state, the reward function has changed. But how does the planner know that? It hasn't seen the salt-deprived state before. Presumably if you built such a planner, it would go in with a default assumption of "the salt-deprivation state is different now than I've ever seen before—I'll just assume that that doesn't affect the reward function!" Or at best, its default assumption would be "the salt deprivation state is different now than I've ever seen before—I don't know how and whether that impacts the reward function. I should increase my uncertainty. Maybe explore more.". In this experiment the rats were neither of those, instead they were acting like "the salt deprivation state is different than I've ever seen, and I specifically know that, in this new state, very salty things are now very rewarding". They were not behaving as if they were newly uncertain about the reward consequences of the lever, they were absolutely gung-ho about pressing it. Sorry if I'm misunderstanding :-)
The ground of optimization

Thanks for the post, this is my favourite formalisation of optimisation so far!

One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si... (read more)

Following human norms

I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t... (read more)

2rohinmshah3yYeah, agreed with all of that, thanks for the comment. You could definitely try to figure out each of these things individually, eg. learning constraints that can be used with Constrained Policy Optimization [] is along the "what not to do" axis, and a lot of the multiagent RL work is looking at how we can get some norms to show up with decentralized training. But I feel a lot more optimistic about research that is trying to do all three things at once, because I think the three aspects do interact with each other. At least, the first two feel very tightly linked, though they probably can be separated from the multiagent setting.
2018 AI Alignment Literature Review and Charity Comparison

Thanks for the informative post as usual.

Full-disclosure: I'm a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.

Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there's a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e... (read more)

2Larks3yI definitely being near AI hubs is helpful, and I'd be interested in supporting any credible new groups that started in other hubs. Thanks for that extra info on CHAI staff. In general my objections to the bay area are partly about the EA/LW culture there, and partly about the broader culture. I did end up donating to CHAI despite this!
Current AI Safety Roles for Software Engineers

Description of CHAI is pretty accurate. I think it's a particularly good opportunity for people who are considering grad school as a long-term option: we're in an excellent position to help people get into top programs, and you'll also get a sense of what academic research culture is like.

We'd like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we'd be willing to consider applicants with lim... (read more)

+1 on the last para, has repeatedly been my experience that the best qualified candidates for a job were not sure that they were and thought this meant they shouldn't apply, which is a quite unfortunate default decision.