Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.


The new 30-person research group in DC investigating how emerging technologies could affect national security (Rob Wiblin and Helen Toner): This 80,000 Hours podcast with Helen Toner dives into details of AI policy, China and the new Center for Security and Emerging Technology (CSET). I'm only summarizing the parts I found most relevant.

Many of the analogies for AI are quite broken. AI is a very broad set of software technologies, unlike nuclear weapons which are very discrete. It's not feasible to use export controls to keep "AI" within the US. In addition, AI will affect war far more fundamentally than just creating lethal autonomous weapons -- Helen thinks that the biggest military impact might be on logistics. It's also weird to compare data to oil, because oil is a rival good (two people can't use the same oil), whereas data can easily be copied. In addition, one barrel of oil can replace any other barrel, but data is very specific to the particular application. Helen's preferred analogy is thinking of AI as electricity -- a very general purpose tool that will transform lots of aspects of society. However, this analogy can also break down -- for example, the AI research community seems pretty important, but there was no analog for electricity.

And now for a few random points, in no particular order. China "exports" around 50,000 inventors (patent holders) every year, while the US imports 190,000, far more than any other country, suggesting that the US is a global hub for talent. AI is hard to define, because many of its properties lie on a continuum -- for example, is a landmine a lethal autonomous weapon? The way to affect policy is to make small, targeted changes in proposed policies so that the government makes slightly better decisions -- it's far too difficult to execute on a grand plan to get the government to do some big thing. The main skills for engaging with government on technology issues: be able to speak both to scientists as well as bureaucrats, and be able to navigate the DC setting -- knowing what people are doing, what their incentives are, and how to get your thing done given their different incentives.

Rohin's opinion: I enjoyed the section on how analogies for AI are broken -- I don't usually think much about them, but they always felt a bit off, and Helen makes it very clear what the issues are. It was also interesting seeing how the perspectives on AI are quite different from those of us thinking about AGI accident risk -- we often think about single, generally intelligent AGI systems, whereas Helen emphasized how current technologies can be easily deployed in many application-specific contexts. While data for current systems is very application-specific as Helen mentioned, if you believe the unsupervised learning story data may be more interchangeable for AGI systems.

AI Alignment Podcast: On the Governance of AI (Lucas Perry and Jade Leung): Jade makes a lot of points in this podcast, some of which I've summarized here in no particular order.

GovAI works on lots of research topics, including analysis of the inputs to AI, understanding historical cases of competition, looking at the relationship between firms and governments, and understanding public opinion.

Governance is particularly difficult because in the current competitive environment it's hard to implement any form of "ideal" governance; we can only make changes on the margin. As a result, it is probably better if we could get to a state where we could take a long time to deliberate about what ideal governance would look like, without having to worry about competitive pressures.

The biggest risk for governments is that they will make hasty, ill-informed regulation. However, given how uncertain we are, it's hard to recommend any concrete actions right now -- but governance will happen anyway; it won't wait for more research. One useful action we can take is to correct or add nuance to inaccurate memes and information, such as the "race" between the US and China, or the performance-safety tradeoff. Plausibly we should engage with government more -- we may have been biased towards working with private organizations because they are more nimble and familiar to us.

Instead of thinking about short term vs. long term, we should be thinking about the stakes. Some issues, such as privacy or job loss, can be thought of as "short term" but their stakes could scale to be huge in the long term. Those would be good areas to think about.

Rohin's opinion: I don't have any particular thoughts on these topics, but I am glad for both this and the previous podcast, which give more of a birds-eye view of the AI governance landscape, which is hard to get from any single paper.

Technical AI alignment

Technical agendas and prioritization

On the purposes of decision theory research (Wei Dai): In this post, Wei Dai clarifies that he thinks decision theory research is important because it can help us learn about the nature of rationality, philosophy, and metaphilosophy; it allows us to understand potential AI failure modes; we can better understand puzzles about intelligence such as free will, logical uncertainty, counterfactuals and more; and it could improve human rationality. It is not meant to find the "correct" decision theory to program into an AI, nor to create safety arguments that show that an AI system is free of "decision-theoretic" flaws.

Preventing bad behavior

Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning (Jaime F. Fisac, Neil F. Lugovoy et al): Reinforcement learning is not great at enforcing constraints that hold at all times, because the agent would violate a constraint now if it would lead to higher reward later. In robust optimal control theory, we maximize the minimum of the constraint reward over time to avoid this. We can do this in the Bellman equation by taking a minimum between the current reward and estimated future value (instead of summing), but this does not uniquely define a fixed point. Just as in regular RL, we can use discounting to avoid the problem: in particular, if we interpret the discount as the probability that the episode continues, we can derive a Safety Bellman equation for which Q-learning is guaranteed to converge. They demonstrate their method in classic control environments as well as half-cheetah, with a range of RL algorithms including soft actor-critic (SAC).

Rohin's opinion: I really like how simple the change is here -- it should be a one-line change for many deep RL algorithms. Previously, we had to choose between unconstrained agents for high dimensional problems, or constrained agents for low dimensional problems -- I like that this work is making progress on constrained agents for high dimensional problems, similarly to Constrained Policy Optimization. While this work doesn't involve a performance reward, you could use the resulting safe policy in order to guide a process of safe exploration to learn a policy that safely optimizes a performance metric. Of course, this is all assuming a specification for the constraint to satisfy.

Miscellaneous (Alignment)

Modeling AGI Safety Frameworks with Causal Influence Diagrams (Tom Everitt, Ramana Kumar, Victoria Krakovna et al): This paper describes several AI safety frameworks using the language of causal influence diagrams (AN #49), in order to make it easy to compare and contrast them. For example, the diagrams make it clear that while Cooperative IRL and reward modeling (AN #34) are very similar, there are significant differences: in cooperative IRL, the rewards come directly from the underlying human preferences, whereas in reward modeling, the rewards come from a reward model that depends on human feedback, which itself depends on the underlying human preferences.

Rohin's opinion: I like these diagrams as a way to demonstrate the basics of what's going on in various AI safety frameworks. Sometimes the diagrams can also show the differences in safety features of frameworks. For example, in reward modeling, the agent has an incentive to affect the human feedback in order to affect the reward model directly. (Imagine getting the human hooked on heroin, so that future feedback causes the reward model to reward heroin, which could be easy to produce.) On the other hand, in cooperative IRL, the agent only wants to affect the human actions inasmuch as the actions affect the state, which is a normal or allowed incentive. (Imagine the agent causing the human to leave their house earlier so that they get to their meeting on time.)

AI strategy and policy

Information security careers for GCR reduction (Claire Zabel and Luke Muehlhauser): This post suggests that information security could be a good career path for people looking to reduce global catastrophic risks (GCRs). For AI in particular, such experts could help mitigate attacks by malicious or incautious actors to steal AI-related intellectual property. It also reduces the risk of destabilizing AI technology races. Separately, such experts could think about the potentially transformative impact of AI on cyber offense and defense, develop or advise on credible commitment techniques (see eg. model governance (AN #38)), or apply the security mindset more broadly.

An Interview with Ben Garfinkel (Joshua Monrad, Mojmír Stehlík and Ben Garfinkel): AI seems poised to be a very big deal, possibly through the development of AGI, and it's very hard to forecast what would happen next. However, looking at history, we can see a few very large trajectory shifts, such as the Agricultural Revolution and Industrial Revolution, where everything changed radically. We shouldn't assume that such change must be for the better. Even though it's hard to predict what will happen, we can still do work that seems robustly good regardless of the specific long-term risk. For example, Ben is optimistic about research into avoiding adversarial dynamics between different groups invested in AI, research into how groups can make credible commitments, and better forecasting. However, credible commitments are probably less tractable for AI than with nukes or biological weapons because AI systems don't leave a large physical footprint, can easily proliferate, and are not a clear category that can be easily defined.

Other progress in AI


Self-Supervised Exploration via Disagreement (Deepak Pathak, Dhiraj Gandhi et al) (summarized by Cody): For researchers who want to build a reinforcement learning system that can learn to explore its environment without explicit rewards, a common approach is to have the agent learn a model of the world, and incentivize it to explore places where its model has the highest error, under the theory that these represent places where it needs to interact more to collect more data and improve its world model. However, this approach suffers in cases when the environment is inherently stochastic, since in a stochastic environment (think: sitting in front of a static TV and trying to predict the next frame), prediction error can never be brought to zero, and the agent will keep interacting even when its world model has collected enough data to converge as much as it can. This paper proposes an alternative technique: instead of exploring in response to prediction error, learn an ensemble of bootstrapped next-state prediction models and explore in response to variance or disagreement between the models. This has a few nice properties. One is that, in cases of inherent stochasticity, all models will eventually converge to predicting the mean of the stochastic distribution, and so even though they've not brought error down to zero, the variance among models will be low, and will correctly incentivize our agent to not spend more time trying to learn. Another benefit is that since the reward is purely a function of the agent's models, it can be expressed analytically as a function of the agent's choices and trained via direct backpropogation rather than "black box reward" RL, making it more efficient.

Cody's opinion: I found this approach really elegant and clever as a way of addressing the "static TV" problem in curiosity literature. I'd be curious to see more work that introduces even stronger incentives towards diversity among the ensemble models (different architectures, even more different datasets they're trained on), to see if that amplifies the cases of model disagreement.

Deep learning

Weight Agnostic Neural Networks (Adam Gaier et al) (summarized by Cody): Inspired by the ability of animals to perform some tasks at birth, before learning about the world, this paper tries to find network architectures that perform well over a wide range of possible model parameters. The idea here is that if an architecture performs well with different sampled weights and without training to update those weights, then the architecture itself is what's responsible for encoding the solution, rather than any particular weight configuration. The authors look for such architectures on both classification and reinforcement learning problems by employing NEAT, a evolutionary method from Neural Architecture Search that searches for the best-performing topologies within the space of possible node connections and activations. The authors find that they're able to construct architectures that do better than random on their test problems without training weights explicitly.

Cody's opinion: I appreciate the premise of this paper, and in general feel positively towards papers that delve into a better understanding of how much of modern neural network performance is attributable to (discrete) structural architectures vs particular settings of continuous weight parameters, and I think this paper does that in a clever way by essentially marginalizing over different weight values. The framing of this paper, implicitly comparing networks used without weight training to animals with innate abilities, did make me wonder whether the architecture vs weights analogy to evolution vs learning is a sound one. Because, while it's true that the weights weren't explicitly gradient-descent trained in this paper, the network did still perform optimization based on task performance, just over a set of discrete parameters rather than continuous ones. In that context, it doesn't really seem correct to consider the resulting architectures "untrained" in a way that I think that analogy would suggest. I'd be curious to see more work in this direction that blends in ideas from meta-learning, and tries to find architectures that perform well on multiple tasks, rather than just one.

Hierarchical RL

Unsupervised Discovery of Decision States for Transfer in Reinforcement Learning (Nirbhay Modhe et al)

Miscellaneous (AI)

Explainable AI, Sparse Representations, and Signals: So far, we have built AI systems that store knowledge symbolically or in a distributed fashion (with neural nets being the latter). While the distributed form allows us to learn knowledge and rules automatically, it is much harder to understand and interpret than symbolically represented knowledge. This post argues that the main difference is in the sparsity of the learned knowledge. Of course, with more "sparse" knowledge, it should be easier for us to understand the internal workings of the AI system, since we can ignore the pruned connections. However, the author also argues that sparse knowledge will help 'guide the search for models and agents that can be said to "learn" but also "reason"'. Given that AGI will likely involve finding good representations for the world (in the sense of unsupervised learning), then sparse learning can be thought of as a bias towards finding better bases for world models, that are more likely to be conceptually clean and more in line with Occam's razor.

In a postscript, the author considers arguments for AI risk. Notably, there isn't any consideration of goal-directedness or alignment failures; the worry is that we will start applying superhuman AI systems to superhuman tasks, and we won't know how to deal with these situations.

Rohin's opinion: Sparsity seems like a good objective to shoot for in order to ensure explainability. I'm less convinced that it's worthwhile for representation learning: I doubt humans have any sort of "sparse learning" bias; I think sparsity of knowledge is a natural consequence of having to understand a very complex world with a very small brain. (Whereas current ML systems only have to understand much simpler environments.)


Microsoft invests in and partners with OpenAI to support us building beneficial AGI (Greg Brockman): After moving to a capped-profit investment model (AN #52), Microsoft has invested $1 billion in OpenAI. This allows OpenAI to keep their focus on developing and sharing beneficial AGI: instead of having to create a product to cover costs, they can license their pre-AGI technologies, likely through Microsoft.

Research Associate in Paradigms of Artificial General Intelligence and Their Associated Risk (José Hernández-Orallo): CSER is hiring a post-doctoral research assistant to inform the AGI safety agenda by looking at existing and possible kinds of agents; the deadline is August 26.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:11 PM

This work on learning with constraints seems interesting.

Looks like the paper "Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning" has moved so that link is currently broken. Here's a working URL: Also one more where the full paper is more easily accessible:[RL][Constraints].pdf

Thanks, updated