Wiki Contributions


successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress

Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said "known to be." Saying that it's conceivable isn't evidence they're actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.

In the DL paradigm you can't easily separate capabilities and alignment

This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren't externalities. (More discussion of on this topic here.)

forcing that separation seems to constrain us

I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with "trust me I'm making the right utility calculations and have the right empirical intuitions" which is a very unreliable standard of evidence in deep learning.

Answer by Dan_HAug 30, 2021Ω1113

Others can post their own papers, but I'll post some papers I was on and group them into one of four safety topics: Enduring hazards (“Robustness”), identifying hazards (“Monitoring”), steering ML systems (“Alignment”), and forecasting the future of ML ("Foresight").

The main ML conferences are ICLR, ICML, NeurIPS. The main CV conferences are CVPR, ICCV, and ECCV. The main NLP conferences are ACL and EMNLP.


Alignment (Value Learning):

Aligning AI With Shared Human Values (ICLR)

Robustness (Adversaries):

Using Pre-Training Can Improve Model Robustness and Uncertainty (ICML)

Robustness (Tail Events):

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (ICLR)
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (ICLR)
Natural Adversarial Examples (CVPR)
Pretrained Transformers Improve Out-of-Distribution Robustness (ACL)
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization (ICCV)


Measuring Massive Multitask Language Understanding (ICLR)
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (NeurIPS)
Measuring Coding Challenge Competence With APPS (in submission)
Measuring Mathematical Problem Solving With the MATH Dataset (in submission)

Monitoring (Anomaly Detection):

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks (ICLR)

Deep Anomaly Detection with Outlier Exposure (ICLR)

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty (NeurIPS)


Note that these are DL (representation learning/vision/text) papers not RL (gridworld/MuJoCo/Bellman equation) papers.

There are at least four reasons for this choice. First, researchers need to be part of a larger RL group to do RL research well--for most of my time as a researcher I was not around RL researchers. Second, since RL is a relatively small area in ML (some DL workshops at NeurIPS are bigger than RL conferences), I prioritized DL for safety community building since that's where more researchers are. Third, I think MuJoCo/gridworld work stands less a chance of surviving the filter of time compared to upstream DL work (upstream DL is mainly studied through vision and text; vision is a stand-in for continuous signals and text is a stand-in for discrete signals). Fourth, the safety community bet heavily on RL (and its implied testbeds and methods) as the main means for making progress on safety, but the safety community would have a more diversified portfolio by having someone work on DL too.


This seems like a fun exercise, so I spent half an hour jotting down possibilities. I'm more interested in putting potential considerations on peoples' radars and helping with brainstorming than I am in precision. None of these points are to be taken too seriously since this is fairly extemporaneous and mostly for fun.



Multiple Codex alternatives are available. The financial viability of training large models is obvious.

Research models start interfacing with auxiliary tools such as browsers, Mathematica, and terminals.



Large pretrained models are distinctly useful for sequential decision making (SDM) in interactive environments, displacing previous reinforcement learning research in much the same way BERT rendered most previous work in natural language processing wholly irrelevant. Now SDM methods don't require as much tuning, can generalize with fewer samples, and can generalize better.

For all of ImageNet's 1000 classes, models can reliably synthesize images that are realistic enough to fool humans.

Models have high enough accuracy to pass the multistate bar exam.

Models for contract review and legal NLP see economic penetration; it becomes a further source of economic value and consternation among attorneys and nontechnical elites. This indirectly catalyzes regulation efforts.

Programmers become markedly less positive about AI due to the prospect of reducing demand of some of their labor. 

~10 trillion parameter (nonsparse) models attain human-level accuracy on LAMBADA (a proxy for human-level perplexity) and expert-level accuracy on LogiQA (a proxy for nonsymbolic reasoning skills). With models of this size, multiple other capabilities(this gives proxies for many capabilities) are starting to be useful, whereas with smaller models these capabilities were too unreliable to lean on. (Speech recognition started "working" only after it crossed a certain reliability threshold.)

Generated data (math, code, models posing questions for themselves to answer) help ease data bottleneck issues since Common Crawl is not enough. From this, many capabilities are bootstrapped.

Elon re-enters the fight to build safe advanced AI.



A major chatbot platform offers chatbots personified through video and audio.

Although forms of search/optimization are combined with large models for reasoning tasks, state-of-the-art models nonetheless only obtain approximately 40% accuracy on MATH.

Chatbots are able to provide better medical diagnoses than nearly all doctors.

Adversarial robustness for CIFAR-10 (assuming an attacker with eps=8/255) is finally over 85%.

Video understanding finally reaches human-level accuracy on video classification datasets like Something Something V2. This comports with the heuristic that video understanding is around 10 years behind image understanding.



Upstream vision advancements help autonomous driving but do not solve it for all US locations, as the long tail is really long.

ML models are competitive forecasters on platforms like Metaculus.

Nearly all AP high school homework and exam questions (including long-form questions) can be solved by answers generated from publicly available models. Similar models cut into typical Google searches since these models give direct and reliable answers.

Contract generation is now mostly automatable, further displacing attorneys.



Machine learning systems become great at using Metasploit and other hacking tools, increasing the accessibility, potency, success rate, scale, stealth, and speed of cyberattacks. This gets severe enough to create global instability and turmoil. EAs did little to use ML to improve cybersecurity and reduce this risk.


no AI safety relevant publications in 2019 or 2020, and only one is a coauthor on what I would consider a highly relevant paper.

Context: I'm an OpenPhil fellow who is doing work on robustness, machine ethics, and forecasting.

I published several papers on the research called for in Concrete Problems in AI Safety and OpenPhil's/Steinhardt's AI Alignment Research Overview. The work helped build a trustworthy ML community and aimed at reducing accident risks given very short AI timelines. Save for the first paper I helped with (when I was trying to learn the ropes), the motivation for the other dozen or so papers was always safety.

These papers have nothing to do with RL and are about DL, and they do not fit in with the type of technical research shared on this forum, which might be why these are not considered "highly relevant." Some (not all) of the OpenPhil fellows are working on safety, though with OpenPhil's broader research agenda.