Stephen McAleese

Computer science master's student interested in AI and AI safety.

Wiki Contributions

Comments

Sorted by

Thank you for the insightful comment.

On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.

In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.

In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.

So it seems like we're getting 2028 performance on the MATH dataset already in 2024.

Quote from the blog post:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."

Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.

The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.

Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.

Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:

Improving adversarial robustness by classifying several down-sampled noisy images at once:

"Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training."

Improving adversarial robustness by using an ensemble of intermediate layer predictions:

"Using intermediate layer predictions. We show experimentally that a successful adversarial
attack on a classifier does not fully confuse its intermediate layer features (see Figure 5). An
image of a dog attacked to look like e.g. a car to the classifier still has predominantly dog-like
intermediate layer features. We harness this de-correlation as an active defense by CrossMax
ensembling the predictions of intermediate layers. This allows the network to dynamically
respond to the attack, forcing it to produce consistent attacks over all layers, leading to robustness
and interpretability."

I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)

This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.

Then when humans' environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness and inner misalignment.

However this statement seems to suggest that modern humans really have internalized IGF as one of their primary objectives and that they're inner aligned with evolution's outer objective.

I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:

  • It has a back button so that when you click on a reference link that takes you to the references section, you can easily click the button to go back to the text.
  • There is a highlight feature so that you can highlight parts of the text which is convenient when you want to come back and skim the paper later.
  • There is a "sticky note" feature allowing you to leave a note in part of the paper to explain something.

I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.

Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.

It seems like the Centre for AI Security is a new organization.

I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.

Is MIRI still doing technical alignment research as well?

This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.

Load More