All of aogara's Comments + Replies

Epic. Do you have to apply the normal route in order to get the GRFP, or can you go GRFP-only?

My unclear interpretation of the rules was that when you apply for GRFP you must state an "intended school". It felt disingenuous to state a school I didn't apply to, ergo you should apply to a school. But I have no reason to believe it is verified in any way shape or form.

Really cool. I read some of these kinds of papers last week, but this is better context on the topic. Redundancy seems like evidence in favor of a narrow loss basin, but e.g. the fact that fine-tuned BERT models generalize very differently is evidence of multiple local minima. Your guess that linear mode connectivity works in simple image classification domains but not in language models seems like the most likely answer to me, but I would be interested to see it tested.

One technique that might help for fine-tuning the generator is Meta AI’s DIRECTOR [1]. The technique uses a classifier to estimate the probability that a generated sequence will be unacceptable each time a new token is generated. Rather than generating full completions and sampling among them, this method guides the towards acceptable completions during the generation process. The Blender Bot 3 paper finds that this method works better than the more standard approach of ranking full completions according to the classifier’s acceptability score [2].

[1] http... (read more)

In particular, that report considers substituting capital for labor a potential driver of explosive growth. Fen’s argument from the Baumol effect relies on the premise that there are baseline levels of labor that cannot be automated, and that productivity growth is therefore limited by those bottlenecks.

I like this way of thinking about how quickly AI will grow smarter, and how much of the world will be amenable to its methods. Is understanding natural language sufficient to take over the world? I would argue yes, but my NLP professor disagrees — he thinks physical embodiment and the accompanying social cues would be very important for achieving superintelligence.

Your first two points make a related argument: that ML requires lots of high quality data, and that our data might not be high quality, or not in the areas it needs to be. A similar question woul... (read more)

Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper

Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul's arguments seem more detailed on this and I'm not sure what I would think if I thought ab... (read more)

I did not know! However, I don't think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.

Agreed it's really difficult for a lot of the work. You've probably seen it already but Dan Hendrycks has done a lot of work explaining academic research areas in terms of x-risk (e.g. this and this paper). Jacob Steinhardt's blog and field overview and Sam Bowman's Twitter are also good for context. 

I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.

In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.

All of these academics are widely read and cited. Looking at their Google Scholar profiles, everyone one of them has more than 1000, and half have more than 10,000 citations. Outside of LessWrong, lots of people in academia and industry labs already read and understand their work. We shouldn't disparage people who are successfully bringing AI safety into the mainstream ML community. 

(Also, this is an incredibly helpful writeup and it’s only to be expected that some stuff would be missing. Thank you for sharing it!)


These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

I don't think the onus should be on the reader to infer x-risk motivations. In academic ML, it's the author's job to explain why the reader should care about the paper. I don't see why this should be different in safety. If it's hard to do that in the paper itself, you can always e.g. write a blog post explaining safety relevance (as mentioned by aogara, people are already doing this, which is great!). There are often many different ways in which a paper might be intended to be useful for x-risks (and ways in which it might not be). Often the motivation for a paper (even in the groups mentioned above) may be some combination of it being an interesting ML problem, interests of the particular student, and various possible thoughts around AI safety. It's hard to try to disentangle this from the outside by reading between the lines.

Agree with both aogara and Eli's comment. 

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

On the one hand, yeah, probably frustrating. On the other hand, that's the norm in academia: people publish work and then nobody reads it.
Fair, I see why this would be frustrating and apologize for any frustration caused. In an ideal world we would have read many of these papers and summarized them ourselves, but that would have taken a lot of time and I think the post was valuable to get out ASAP. ETA: Probably it would have been better to include more of a disclaimer on the "everyone" point from the get-go, I think not doing this was a mistake.

“OpenAI leadership tend to put more likelihood on slow takeoff”

Could you say more about the timelines of people at OpenAI? My impression was that they’re very short and explicitly include the possibility of scaling language models to AGI. If somebody builds AGI in the next 10 years, OpenAI seems like a leading candidate to do so. Would people at OpenAI generally agree with this?

Thank you, this answered my question

It might have web search capabilities a la WebGPT, in which case I wouldn’t be confident of this. Without web search I’d agree.

I think even with search capabilities, it wouldn't accurately sort a set of real submissions (eg, if a high school English teacher have the assignment to thier ~150 students - with a note that they would be only mildly punished for plagiarism on this one assignment to ensure at least some actual plagiarized essays in the sample)

2. Are you still estimating that algorithmic efficiency doubles every 2.5 years (for now at least, until R&D acceleration kicks in?) I've heard from thers (e.g. Jaime Sevilla) that more recent data suggests it's doubling every 1 year currently.

It seems like the only source on this is Hernandez & Brown 2020. Their main finding is a doubling time of 16 months for AlexNet-level performance on ImageNet: "the number of floating point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44x betwe... (read more)

“We'll still probably put it in a box, for the same reason that keeping password hashes secure is a good idea. We might as well. But that's not really where the bulk of the security comes from.”

This seems true in worlds where we can solve AI safety to the level of rigor demanded by security mindset. But lots of things in the world aren’t secure by security mindset standards. The internet and modern operating systems are both full of holes. Yet we benefit greatly from common sense, fallible safety measures in those systems.

I think it’s worth working on vers... (read more)

It's not a question of "making safety more likely" vs "guarantees". Either we will basically figure out how to make an AGI which does not need a box, or we will probably die. At the point where there's an unfriendly decently-capable AGI in a box, we're probably already dead. The box maybe shifts our survival probability from epsilon to 2*epsilon (conditional on having an unfriendly decently-capable AGI running). It just doesn't increase our survival probability by enough to be worth paying attention to, if that attention could otherwise be spent on something with any hope at all of increasing our survival probability by a nontrivial amount. The main reason to bother boxing at all is that it takes relatively little marginal effort. If there's nontrivial effort spent on it, then that effort would probably be more useful elsewhere.

Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they've done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:

... (read more)
4Stephen Fowler1mo
Yes. I strongly suspect a model won't need to be anywhere close to an AGI before it's capable of producing incredibly damaging malware.

Turns out that this dataset contains little to no correlation between a researcher's years of experience in the field and their HLMI timelines. Here's the trendline, showing a small positive correlation where older researchers have longer timelines -- the opposite of what you'd expect if everyone predicted AGI as soon as they retire. 

My read of this survey is that most ML researchers haven't updated significantly on the last five years of progress. I don't think they're particularly informed on forecasting and I'd be more inclined to trust the inside ... (read more)

I wonder if the fact that there are ~10 respondents who have worked in AI for 7 years, but only one who has for 8 years is because of Superintelligence which came out in 2015.

Yes, to be clear, I don't buy the M-G law either on the basis of earlier surveys which showed it was just cherrypicking a few points motivated by dunking on forecasts. But it is still widely informally believed, so I point this out to annoy such people: 'you can have your M-G law but you will have to also have the implication (which you don't want) that timelines dropped almost an entire decade in this survey & the past few years have not been business-as-usual or "expected" or "predicted"'.

8Zach Stein-Perlman2mo
For people in epistemic positions similar to ours, I think surveys like this are not very useful for updating on timelines & p(doom) & etc., but are very useful for updating on what ML researchers think about those things, which is important for different reasons. (I do not represent AI Impacts, etc.)

If you have the age of the participants, it would be interesting to test whether there is a strong correlation between expected retirement age and AI timelines. 

4Zach Stein-Perlman2mo
We do not. (We have "Which AI research area have you worked in for the longest time?" and "How long have you worked in this area?" which might be interesting, though.)

This was heavily upvoted at the time of posting, including by me. It turns out to be mostly wrong. AI Impacts just released a survey of 4271 NeurIPS and ICML researchers conducted in 2021 and found that the median year for expected HLMI is 2059, down only two years from 2061 since 2016. Looks like the last five years of evidence hasn’t swayed the field much. My inside view says they’re wrong, but the opinions of the field and our inability to anticipate them are both important.

It's worth noting that Ajeya's BioAnchors report estimates that TAI will require a median of 22T data points, nearly an order of magnitude more than the available text tokens as estimated here. See here for more. 

My report estimates that the amount of training data required to train a model with N parameters scales as N^0.8, based significantly on results from Kaplan et al 2020. In 2022, the Chinchilla scaling result (Hoffmann et al 2022) showed that instead the amount of data should scale as N.

Are you concerned that pretrained language models might hit data constraints before TAI? Nostalgebraist estimates that there are roughly 3.2T tokens available publicly for language model pretraining. This estimate misses important potential data sources such as &nb... (read more)

2Conor Sullivan2mo
What's RLHF?

I suspect Chinchilla's implied data requirements aren't going to be that much of a blocker for capability gain. It is an important result, but it's primarily about the behavior of current backpropped transformer based LLMs.

The data inefficiency of many architectures was known before Chinchilla, but the industry worked around it because it wasn't yet a bottleneck. After Chinchilla, it has become one of the largest architectural optimization targets. Given the increase in focus and the relative infancy of the research, I would guess the next two years will s... (read more)

Would you have any thoughts on the safety implications of reinforcement learning from human feedback (RLHF)? The HFDT failure mode discussed here seems very similar to what Paul and others have worked on at OpenAI, Anthropic, and elsewhere. Some have criticized this line of research as only teaching brittle task-specific preferences in a manner that's open to deception, therefore advancing us towards more dangerous capabilities. If we achieve transformative AI within the next decade, it seems plausible that large language models and RLHF will play an important role in those systems — so why do safety minded folks work on it?

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):

  1. Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good
... (read more)

Ah right. Thank you!

[Chinchilla 10T would have a 143x increase in parameters and] 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed. 

Would anybody be able to explain this calculation a bit? It implies that compute requirements scale linearly with the number of parameters. Is that true for transformers?

My understanding would be that making the transformer deeper would increase compute linearly with parameters, but a wider model would require more than linear compute because it increases the number of connections between nodes at each layer. 

The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).

Great post, thanks for sharing. Here's my core concern about LeCun's worldview, then two other thoughts:

The intrinsic cost module (IC) is where the basic behavioral nature of the agent is defined. It is where basic behaviors can be indirectly specified. For a robot, these terms would include obvious proprioceptive measurements corresponding to “pain”, “hunger”, and “instinctive fears”, measuring such things as external force overloads, dangerous electrical, chemical, or thermal environments, excessive power consumption, low levels of energy reserves in the

... (read more)
2Ivan Vendrov2mo
Ah you're right, the paper never directly says the architecture is trained end-to-end - updated the post, thanks for the catch. He might still mean something closer to end-to-end learning, because 1. The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn't trained purely using self-supervised learning. 2. The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.

If anybody has good sources about LeCun's views on AI safety and value learning, I'd be interested.

There's a conversation LeCun had with Stuart Russell and a few others in a Facebook comment thread back in 2019, arguing about instrumental convergence.

The full conversation is a bit long and difficult to skim. I haven't finished reading it myself, but in it LeCun links to an article he co-authored for Scientific American which argues x-risk from AI misalignment isn't something people should worry about. (He's more concerned about misuse risks.) Here's a ... (read more)

I'd like to publicly preregister an opinion. It's not worth making a full post because it doesn't introduce any new arguments, so this seems like a fine place to put it. 

I'm open to the possibility of short timelines on risks from language models. Language is a highly generalizable domain that's seen rapid progress shattering expectations of slower timelines for several years in a row now. The self-supervised pretraining objective means that data is not a constraint (though it could be for language agents, tbd), and the market seems optimistic about b... (read more)

I'm having trouble understanding the argument for why a "sharp left turn" would be likely. Here's my understanding of the candidate reasons, I'd appreciate any missing considerations:

  • Inner Misalignment
    • AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don't realize t
... (read more)

Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.

5Donald Hobson2mo
Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all. I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit islog(p) −log(1−p), and an increase inlog(p)means a decrease inlog(1−p),, then each of those must be perterbed by at most KX. What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren't effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.

“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?

3Donald Hobson3mo
Not quite. It means running the network on the training data. For each node, look at the values. (which will always be≥0, as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.

Similarly, humans are terrible at coordination compared to AIs.

Are there any key readings you could share on this topic? I've come across arguments about AIs coordinating via DAOs or by reading each others' source code, including in Andrew Critch's RAAP. Is there any other good discussion of the topic?

5Daniel Kokotajlo3mo
Unfortunately I don't know of a single reading that contains all the arguments. This post [] is relevant and has some interesting discussion below IIRC. Mostly I think the arguments for AIs being superior at coordination are: 1. It doesn't seem like humans are near-optimal at coordination, so AIs will eventually be superior to humans at cooperation just like they will eventually be superior in most other things that humans can do but not near-optimally. 2. We can think of various fancy methods (such as involving reading source code, etc.) that AIs might use that humans don't or can't. 3. There seems to be a historical trend of increasing coordination ability / social tech, we should expect it to continue with AI. 4. Even if we just model AIs as somewhat smarter, more agentic, more rational humans... it still seems like that would probably be enough. Humans have coordinated coups and uprisings successfully before, if we imagine the conspirators are all mildly superhuman... I think it might be possible to design an AI scheme in which the AIs don't coordinate with each other even though it would be in their interest. E.g. perhaps if they are trained to be bad at coordinating, strongly rewarded for defecting on each other, etc. But I don't think it'll happen by default.

A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule

Curious to hear if/how you would update your credence in this being achieved by 2026 or 2030 after seeing the 50%+ accuracy from Google's Minerva. Your prediction seemed reasonable to me at the time, and this rapid progress seems like a piece of evidence favoring shorter timelines. 


3Tomás B.3mo
I’m pretty sure I will “win” my bet against him; even two months is a lot of time in AI these days.
5Matthew Barnett3mo
I've updated significantly. However, unfortunately, I have not yet seen how well the model performs on the hardest difficulty problems on the MATH dataset, which could give me a much better picture of how impressive I think this result is.

I think it’s a pretty good argument. Holden Karnofsky puts a 1/3rd chance that we don’t see transformative AI this century. In that world, people today know very little about what advanced AI will eventually look like, and how to solve the challenges it presents. Surely some people should be working on problems that won’t be realized for a century or more, but it would seem much more difficult to argue that AI safety today is more altruistically pressing than other longtermist causes like biosecurity, and even neartermist causes like animal welfare and glo... (read more)

Its not enough that AI might appear in a few decades, you also need something useful you can do about it now, compared to investing your money to have more to spend later when concrete problems appear.
5Conor Sullivan3mo
A 2/3rds chance of a technology that might kill everyone (and certainly would change the world in any case) is still easily the most important thing going on right now. You'd have to demonstrate that AI has less than a 10% chance of appearing in my lifetime for me to not care about AI risk.

This is an awesome post. I've read it before, but hadn't fully internalized it. 

My timelines on TAI / HLMI / 10x GDP growth are a bit longer than the BioAnchors report, but a lot of my objections to short timelines are specifically objecting to short timelines on rapid GDP growth. It's obvious after reading this that what we care about is x-risk timelines, not GDP timelines. Forecasting when x-risk might spike is more difficult because it requires focusing on specific risk scenarios, like persuasion tools or fast takeoff, rather than general growth in... (read more)

I'd like to publicly preregister an opinion. It's not worth making a full post because it doesn't introduce any new arguments, so this seems like a fine place to put it. 

I'm open to the possibility of short timelines on risks from language models. Language is a highly generalizable domain that's seen rapid progress shattering expectations of slower timelines for several years in a row now. The self-supervised pretraining objective means that data is not a constraint (though it could be for language agents, tbd), and the market seems optimistic about b... (read more)

Incredible. Somebody please get this to Hofstadter and the rest of The Economist folks. Issuing a correction would be send a strong message to readers.

Here’s other AI safety coverage from The Economist, including quotes from Jack Clark of Anthropic. Seems thoughtful and well-researched, if a bit short on x-risk:

I sent an email to the editor with this information, and included Hofstadter on it.

Fantastic agenda for the field, thanks for sharing. 

Honesty is a narrower concept than truthfulness and is deliberately chosen to avoid capabilities externalities, since truthful AI is usually a combination of vanilla accuracy, calibration, and honesty goals. Optimizing vanilla accuracy is optimizing general capabilities, and we cover calibration elsewhere. When working towards honesty rather than truthfulness, it is much easier to avoid capabilities externalities.

I think it's worth mentioning that there are safety benefits to truthfulness beyond hone... (read more)

Interesting question. As far as what government could do to slow down progress towards AGI, I'd also include access to high-end compute. Lots of RL is knowledge that's passed through papers or equations, and it can be hard to contain that kind of stuff. But shutting down physical compute servers seems easier. 

It's definitely a common belief on this site. I don't think it's likely, I've written up some arguments here

I strongly agree with this objection. You might be interested in Comprehensive AI Services, a different story of how AI develops that doesn't involve a single superintelligent machine, as well as "Prosaic Alignment" and "The case for aligning narrowly superhuman systems". Right now, I'm working on language model alignment because it seems like a subfield with immediate problems and solutions that could be relevant if we see extreme growth in AI over the next 5-10 years. 

Thanks, that really clarifies things. Frankly I’m not on board with any plan to “save the world” that calls for developing AGI in order to implement universal surveillance or otherwise take over the world. Global totalitarianism dictated by a small group of all-powerful individuals is just so terrible in expectation that I’d want to take my chances on other paths to AI safety.

I’m surprised that these kinds of pivotal acts are not more openly debated as a source of s-risk and x-risk. Publish your plans, open yourselves to critique, and perhaps you’ll revise your goals. If not, you’ll still be in a position to follow your original plan. Better yet, you might convince the eventual decision makers of it.

"Building an actual aligned AI, of course, would be a pivotal act." What would an aligned AI do that would prevent anybody from ever building an unaligned AI?

I mostly agree with what you wrote. Preventing all unaligned AIs forever seems very difficult and cannot be guaranteed by soft influence and governance methods. These would only achieve a lower degree of reliability, perhaps constraining governments and corporations via access to compute and critical algorithms but remaining susceptible to bad actors who find loopholes in the system. I guess what I'm ... (read more)

My guess is that it would implement universal surveillance and intervene, when necessary, to directly stop people from doing just that. Sorry, I should've been clearer that I was talking about an aligned superintelligent AI. Since unaligned AI killing everyone seems pretty obviously extremely bad according to the vast majority of humans' preferences, preventing that would be a very high priority for any sufficiently powerful aligned AI.

Thank you, this was very helpful. As a bright-eyed youngster, it's hard to make sense of the bitterness and pessimism I often see in the field. I've read the old debates, but I didn't participate in them, and that probably makes them easier to dismiss. Object level arguments like these help me understand your point of view. 

Yeah, I guess the answer is yes by definition. Still wondering what kind of pivotal acts people are thinking about -- whether they're closer to a big power-grabs like "burn all the GPUs", or softer governance methods like "publishing papers with alignment techniques" and "encouraging safe development with industry groups and policy standards". And whether the need for a pivotal act is the main reason why alignment researchers need to be on the cutting edge of capabilities. 

I can't see how "publishing papers with alignment techniques" or "encouraging safe development with industry groups and policy standards" could be pivotal acts. To prevent anyone from building unaligned AI, building an unaligned AI in your garage needs to be prevented. That requires preventing people who don't read the alignment papers or policy standards and aren't members of the industry groups from building unaligned AI. That, in turn, appears to me to require at least one of 1) limiting access to computation resources from your garage, 2) limiting knowledge by garage hackers of techniques to build unaligned AI, 3) somehow convincing all garage hackers not to build unaligned AI even though they could, or 4) surveillance and intervention to prevent anyone from actually building an unaligned AI even though they have the computation resources and knowledge to do it. Surveillance, under option 4, could (theoretically, I'm not saying all of these possibilities are practical) be by humans, by too-weak-to-be-dangerous AI, or by aligned AI. "Publishing papers with alignment techniques" and "encouraging safe development with industry groups and policy standards" might well be useful actions. It doesn't seem to me that anything like that can ever be pivotal. Building an actual aligned AI, of course, would be a pivotal act.

Specifically, do you agree with Eliezer that preventing existential risks requires a "pivotal act" as described here (#6 and #7)?

Eliezer did define "pivotal act" so as to be necessary. It's an act which makes it so that nobody will build an unaligned AI; that's pretty straightforwardly necessary for preventing existential risk, assuming that unaligned AI poses an existential risk in the first place. However, the danger in introducing concepts via definitions is that there may be "pivotal acts" which satisfy the definition but do not match the prototypical picture of a "pivotal act".

Love the effort to engage with alignment work in academia. It might be a very small thread of authors and papers at this point, but hopefully it will grow. 

“It is necessary that people working on alignment have a capabilities lead.” Could you say a little more about this? Seems true but I’d be curious about your line of thought.

The theory of change could be as simple as “once we know how to build aligned AGI, we’ll tell everybody”, or as radical as “once we have an aligned AGI, we can steer the course of human events to prevent future catastrophe”. The more boring argument would be that any good ML research happens on the cutting edge of the field, so alignment needs big budgets and fancy labs just like any other researcher. Would you take a specific stance on which is most important?

Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook's OPT-175B, so that future researchers can build on it. 

I'm planning on working on technical AI safety full-time this summer. Right now I'm busy applying to a few different programs, but I'll definitely follow up on this idea with you. 

Did you consider using the approach described in Ethan Perez's "Red Teaming LMs with LMs"? This would mean using a new generator model to build many prompts, having the original generator complete those prompts, and then having a classifier identify any injurious examples in the completion. 

The tricky part seems to be that this assumes the classifier's judgements are correct. If you trained the classifier on the examples identified by this process, it would only generate examples that are already labeled correctly by the classifier. To escape this pro... (read more)

Wild. One important note is that the model is trained with labeled examples of successful performance on the target task, rather than learning the tasks from scratch by trial and error like MuZero and OpenAI Five. For example, here's the training description for the DeepMind Lab tasks:

We collect data for 255 tasks from the DeepMind Lab, 254 of which are used during training, the left out task was used for out of distribution evaluation. Data is collected using an IMPALA (Espeholt et al., 2018) agent that has been trained jointly on a set of 18 procedurally

... (read more)
Load More