Sheikh Abdur Raheem Ali

Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).

This post was really well written. I found it easy to follow the arguments and understand what the results meant; it also kept me engaged throughout.

The rightward movement predicts that Claude 5 will be 100% ravenclaw while Claude 6 will be 50% ravenclaw and 50% slytherin.

I looked at the FairPlay website and agree that “banning schools from contacting kids on social media” or “preventing Gemini rollouts to under-13s” is not coherent under my threat model. However I think there is clear evidence that current parental screen time controls may not be a sufficiently strong measure to mitigate extant generational mental health issues (I am particularly worried about insomnia, depression, eating disorders, autism spectrum disorders, and self harm).

Zvi had previously reported on YouTube shorts reaching 200B daily views. This is clearly a case of egregiously user hostile design with major social and public backlash. I could not find a canonical citation on medrxiv and don’t believe it would be ethical to run a large scale experiment on the long term impacts of this but there are observational studies. Given historical cases of model sycophancy and the hiring of directors focused on maximizing engagement I think it’s not implausible for similar design outcomes.

I think that the numbers in this Anthropic blog post https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and-companionship do not accurately portray reality. They report only 0.5% of conversations as being romantic or sexual roleplay, but I consider this to be misleading because they exclude chats focused on content creation tasks (such as writing stories, blog posts, or fictional dialogues), which their previous research found to be a major use case. Because the models are trained to refuse requests for explicit content, it’s common for jailbreaks to start by saying “it’s okay to do this because it’s just a fictional scenario in a story”. Anecdotally I have heard labs don’t care about this much in contrast to CBRN threats.

Let’s look at the top ten apps ranked by tokens on https://openrouter.ai/rankings. They are most well known for hosting free API instances of DeepSeek v3 and r1, which was the only way to get high usage out of SOTA LLMs for free before the Google AI studio price drop for Gemini 2.5 Pro. It is not the best proxy for real world usage because it requires technical sophistication and this is reflected in the first four (cline, roo code, litellm, and kilo code are all for software development) but the next four (sillytavern, chub ai, hammerai, roleplai) are all indicative that the distribution of tasks done with models at this capabilities level do not differ significantly from the distribution of tasks which people visit websites for. Although I wouldn’t morally panic about this since it seems likely to me that conventional security methods will be good enough to mostly prevent us from turning into glichers.

Kids safety activists are one of the only groups with a track record of introducing AI capabilities restrictions which actually get enforced. Multimodal models can now create both images and text, but the image models are more locked down (Gemini 2.5 defaults to stricter block thresholds for image generation than for text generation), and I think that this would not be the case without people focusing on kids safety. It’s possible for there to be AI Safety issues which affect children right now that are highly relevant to existential risks and this is a common topic in novice discussions of alignment.

At the bottom of this chat is what I believe to be a single concrete example of other models roleplaying Sydney: https://gemini.google.com/share/6d141b742a13

I believe that the ARENA application process is fine, and I definitely didn't mean to imply that taking this much time was due to the form being bloated. I did not predict that it would take me six hours, but I am also not surprised that it took that long.

It may be helpful to outline a few factors that could have played a role here, mostly ones obvious and known to my inside-vew model, though I'll err on the side of providing more detail than strictly necessary.

Health:
- ADHD: I take stimulant medication to compensate for this but the effects had worn off.
- Stress: I was supposed to travel, but recent UAE airspace closures had cancelled my flight.
- Exhaustion: It was the end of the day, so I was tired and thinking more slowly than usual.
- Allergies: I was sick and took antihistamines to prevent being interrupted by sneezes.
Competitiveness:
- Since admissions are holistic, spending a lot more time on the early stages of the application help me eke out a small advantage relative to other candidates who will do better than me on later stages of the process.
- I am happy to invest a lot of resources into increasing the marginal probability doing ARENA as I believe the expected return would be significant.
Lack of Practice:
- I'm quite selective when it comes to targeting opportunities and only apply to a role when I'd be excited to accept an offer, so don't have a lot of experience filling out applications.
- In general, I am measurably slower at all forms of non-LLM assisted writing compared to peers and this is a fundamental weakness where I need to work on fixing errors in cognitive algorithms to improve.

Finally, I think that a 4x difference in allocated vs consumed time to complete a task is worth digging into further, so I'll go over the screen recording of that interval to understand what happened in this case. The following breakdown of where the time went would make it clearer whether we can expect others to have also taken extra steps not accounted for in the original 1.5 hr estimate.

Planned timeline:

9:00 PM Start application.

10:30 PM Finish and submit.

Actual Timeline:

9:00 PM: Wrap-up what I was doing before.

9:07 PM: Form opened

9:08 PM: Start "Career Plans" question.

9:59 PM: Handle interrupt.

10:01 PM: Return to application.

10:21 PM: Finish "Career Plans" question.

10:22 PM: Ask for feedback in family group.

10:24 PM: Save paragraph for future use.

10:25 PM: Start "Why ARENA" question.

10:30 PM: Deadline exceeded, triggering forced context shift^[1]

10:31 PM: Decide to prioritize completing ARENA application over staying on schedule.

11:20 PM: Finish paragraph 1/3 of answer "Why ARENA".

11:29 PM: Look up (scope: ARENA material [extended]^[2]) to verify a claim.

11:33 PM: Complete fact-check, return to writing.

11:41 PM: Look up (scope: Gemini developer documentation) to verify a claim.

11:43 PM: Complete fact-check, return to writing.

11:57 PM: Finish paragraph 2/3 of answer "Why ARENA".

12:03 AM: Look up (scope: explorables, neuronpedia, gmail, DMs) to verify a claim.

12:09 AM: Complete fact-check, return to writing.

12:23 AM: Finish paragraph 3/3 of answer "Why ARENA".

12:24 AM: Ask for feedback in family group.

12:25 AM: Start "Logistics" section.^[3]

12:26 AM: Complete "Logistics" section.

12:27 AM: Double-check "Logistics" section.^[4]

12:28 AM: Ask for feedback in family group.

12:29 AM: Read questions in "AI Safety Experience" and "Technical experience" sections.

12:30 AM: Look up (scope: "How to work through the ARENA program on your own" by Leon Lang)^[5]

12:31 AM: Complete fact-check, return to writing.

12:32 AM: Start "Mentor recommendation" question.

12:33 AM: Finish answer for "Mentor recommendation" question.

12:34 AM: Paste answer to "Tell me about your experience in AI Safety" question from notes.

12:35 AM: Write answer to "Tell us about your most impressive technical accomplishment" question.

12:36 AM: Delete partial answer.^[6]

12:37 AM: Start "Tell me about your coding/ML experience" question.

12:38 AM: Look up (scope: git historical activity statistics) to verify a claim.

12:40 AM: Abort query, return to writing.

12:45 AM: Finish answer for "Tell me about your coding/ML experience" question.^[7]

12:46 AM: Ask for feedback in family group.

12:49 AM: Think about past technical accomplishments I've had.

12:52 AM: Select one from my resume to write about.

1:06 AM: Finish answer for "Tell me about your most impressive technical accomplishment"^[8]

1:07 AM: Ask for feedback in family group.

1:09 AM: Start editing my standard resume template to customize it for ARENA^[9]

1:17 AM: "This shouldn't take more than 20 minutes total", start timer for completing task.^[10]

1:29 AM: Deadline exceeded, triggering forced context shift.^[11]

1:30 AM: Upload resume

1:31 AM: Start "Technical Alignment Research Agenda" question

1:32 AM: Look up (scope: turntrout.com, ninapanickssery.com) to verify a claim.

1:35 AM: Complete fact-check, return to writing.

1:51 AM: Look up (scope: "Distillation robustifies unlearning" by Team Shard) to verify claim.

2:14 AM: Finish answer for "Technical Alignment Research Agenda" question^[12]

2:15 AM: Ask for feedback in family group.

2:16 AM: Start "Read and Understand Alignment Faking summary" question.

2:17 AM: Look up (scope: AF extension experiments, proposal document, overleaf with paper draft)

2:26 AM: Complete fact-check, return to writing.^[13]

2:47 AM: Finish answer for "Read and Understand Alignment Faking summary" question.

2:48 AM: Ask for feedback in family group.

2:49 AM: Final editing pass for typos before submission.

2:50 AM: Submit ARENA 6.0 application.

2:52 AM: Update weekend plans to account for circadian rhythm disruption^[14]

3:08 AM: Post comment here to report actual time taken for application.

^{^}
Using a browser extension https://addons.mozilla.org/en-GB/firefox/addon/tab-scheduler-auto-open-close/ configured to automatically close current tabs and open planned tabs for next task)
^{^}
ARENA [core] means all materials directly included in the main branch of the ARENA repo. ARENA [extended] adds messages exported from public channels of ARENA Slack via slackdump (with verbal permission granted by Callum McDougall), saved text from crawled links to archived web articles (where the policy in robots.txt permits scraping), as well as relevant lesswrong posts such as impact reports, calls for applications, or ARENA final projects (curated manually).
^{^}
No save step, as it is unlikely that section "Why ARENA" can be pasted into future applications.
^{^}
I did this by eye since I was starting to feel pressure to speed up and didn't want to spend another look up cycle on this. In retrospect, I should have taken the few extra minutes for accuracy, since I ended up having making a mistake here and had to email a correction after.
^{^}
Observing that I was two hours over at this point lead to a negative update on p(accept)
^{^}
Realized that I had to start over since what I had written off the top of my head risked revealing potentially confidential/sensitive information that could risk violating terms of an NDA.
^{^}
This was really hard to write for since my instincts were strategically optimizing a myopic goal of getting into ARENA but my principles couldn't relax constraints of full truthfulness and honesty.
^{^}
Skipping over some ad-hoc lookups that I did to add hyperlinks to outside sources
^{^}
Normally I wouldn't bother, but this step was included due to feedback received during a call I'd had with a Research Manager at MATS (John Teichman).
^{^}
I had a lot of things opened at this point (e.g Discord, OpenReview, LinkedIn).
^{^}
In this case, using tex2pdf to build the in-progress project I was working on in my IDE
^{^}
No save step, this is well worth thinking about from scratch every time I am asked!
^{^}
I'd already read Scott's post when it came out so didn't need to reread it. I have been working with collaborators at Anthropic on a research project related to AF+interp but decided it may be unfair to use results from that in this application.
^{^}
i.e, moving a board game cafe hangout from morning to afternoon

Congratulations on MATS!

I would like to know how you got involved with that ITDA work?

Patrick shared a draft with Stepan Shabalin, who shared it with me. We had collaborated on another project earlier https://arxiv.org/abs/2505.03189 which was lead by my former SPAR student Yixiong, so it made sense to work together again.

Thanks for reading the projects in such depth, I honestly didn't expect anyone would.

Oh, not at all, I only took a quick look through everything and could have spent more time on details. Until now, I didn't even notice that https://github.com/BorisTheBrave/itda-coders was a private repo which I cannot access.

I decided to put an application in. Elapsed time to fill this out was 6 hours— from 9 PM to 3 AM local time. Only one data point, but I’m probably slower than median by a fair margin.

Strong upvote. Failures are very common and people should write more posts about them. I'm impressed by all the outputs you've had in this short period of time. You are set up for success upon return from sabbatical!

Here are some quick thoughts on the post itself:

Sorry, to me the title is slightly misleading; I wouldn't really consider any of the projects described in this post to be failures unless you're being exceedingly humble, because you have writeups and code for each of them which is further than most failed projects get. Despite your caveat that these are "those projects which still have something to say", I'd be even more curious about ones from further down the stack which "died" earlier. When reading posts like these I look for snippets that might be embarrassing for the author, things like, "I made an obvious blunder or bad decision" and I didn't really find that here, which makes this less relatable (though the short length means this doesn't detract from the quality of the writing). Perhaps that was a non-goal. The closest might be that bool facts can be memorized with ~2 bits per parameter, but that's not a trivial result.

That might also appeal to a broader audience since a lot of people are moving into the field and they tend to repeat the same mistakes at first. Some common errors are well known and generic, but other widespread issues are invisible (or noise!) to the people best equipped to address them. But that's alright if you were aiming this post at a smaller audience that was more senior than you, in contrast to a larger one that was more junior.

Here are some quick thoughts on the projects:

In Appendix A of your Google Doc on multiplication experiments, it would be helpful if the file names were hyperlinks to the jupyter notebooks. You do link to the github at the top, so this isn't really a big deal since the work is 4 months old at this point and probably won't have many people taking a look at it. But a few other internal sections link to that appendix, so it's an attractor for readers to fall into, and if you use this structure for future documents then it'd be natural to want to get to the code linked to the notes.
Only tangentially relevant, but this file https://github.com/BorisTheBrave/llm-addition-takehome/blob/main/notes.txt stood out to me as perfect for prompting base models. It's been a while since I've touched gpt-4-base but this is the sort of thing it was most helpful for back when I was heavily using loom: https://manifold.markets/market/will-an-ai-get-gold-on-any-internat?tab=comments#HcayZchcTfCdFQb0Q6Dr.
(Please correct me if I'm wrong on this) It is surprising to me that you don't seem to have technical collaborators, except on the do no harm post. This makes sense for the submission to Neel Nanda's MATS stream, since that is meant to be a test of your personal skills, but for most other codebases I typically see at least 1-2 people besides the project lead making code contributions to the repo. On your website I found a link to a Discord server with 600+ members- surely at least some of them would be excited about teaming up with you? Maybe if you're especially conscientous you wouldn't want to risk wasting people's time, or there's some inherent tradeoff between networking and problem solving, but teamwork is a valuable skill to develop. Of course, this is a case where people might give you the opposite advice if all of your projects were done in a large group, so take it with a grain of salt. Based on this post alone, if you can share a concrete proposal and plan for future work, I'd be happy to hop on a call and connect you to individuals or programs which might be a good fit to support your research.
We recently published some research comparing ITDA and SAEs for finding features in diffusion models: https://arxiv.org/abs/2505.24360. I'm not first author on this work, and I don't fully understand our results since my role in meetings was mostly minor tasks such as helping to run experiments or fix citations, but it might be worth skimming over in case you're still curious about what other people have been doing with the method.

Finally, this section made me laugh:

Neel Nanda directly told me that it is "cursed".
So I made a toy model trained to memorise a set of boolean facts chosen at random, and investigated the circuits and the scaling laws.

Thanks for writing this up. I really appreciate this post because I was confused about the intuition behind variance explained despite this being the primary evaluation metric used in a recent paper I co-authored on interpreting text-to-image diffusion models with dictionary learning. It's more helpful than any other resource I used.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments