Replying toMy journey to the microwave alternate timeline

My journey to the microwave alternate timeline

I currently live somewhere with just a microwave and hotplate, so this is intriguing to me. One question - when Marie says "100% power" in a cookbook from 1985, do you need to tone it down somewhat for modern microwaves?

Replying toToss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

Jay Bailey1mo

Toss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

And we have made it happen! Thanks to both Aloekine and Lightcone :)

Replying toToss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

Jay Bailey1mo

Toss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

I am indeed - shall PM you and we can make it happen

Replying toMoving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

Jay Bailey2mo

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

The problem with MPI is it feels like "Anyone can trivially spend a small amount of money and no effort to make a larger amount of money" is the kind of thing that quickly gets saturated. If you don't have an MPI only because of all the other MPI's out there that have eaten up the free profits, it's not a great measure of capabilities.

I like the other three, but I also wonder how close to TUI we already are. It wouldn't shock me that much if we already were most of the way TUI and the only reason this didn't lead to robotics being solved is the AI itself has limitations... (read more)

Replying toMoving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

Jay Bailey2mo

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

Interesting. You have convinced me that I need a better definition for this approximate level of capabilities. I do expect AI to advance faster than legacy organisations will adapt, such that it would be possible to have a world of "10% of jobs can be done by AI" but the AI capabilities need to be higher than "Can replace 10% of jobs in 2022".

Replying toMoving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

Jay Bailey2mo

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now

So, my understanding of ASI is that it's supposed to mean "A system that is vastly more capable than the best humans at essentially all important cognitive tasks." Currently, AI's are indeed more capable, possibly even vastly more capable, than humans at a bunch of tasks, but they are not more capable at all important cognitive tasks. If they were, they could easily do my job, which they currently cannot.

Two terms I use in my own head, that largely correlate with my understanding of what people meant by the old AGI/ASI:

"Drop in remote worker" - A system with the capabilities to automate a large chunk of remote workers (I've used 50% before,... (read more)

Replying toToss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

Jay Bailey2mo

Toss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

The Australian financial year starts and ends in the middle of the year, so it makes no difference to me if we do it in 2026. Let's make it happen :)

Replying toToss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

Jay Bailey2mo

Toss a bitcoin to your Lightcone – LW + Lighthaven's 2026 fundraiser

I live in Australia, so I lack tax advantage for this. I am likely to still donate 1k or so if I can't get tax advantages, but before doing so I wanted to check if anyone wanted to do a donation swap where I donate to any of these Australian tax-advantaged charities largely in global health in exchange for you donating the same amount to charities that are tax-advantaged in the US.

I am willing to donate up to 3k USD to MIRI, and 1.5k USD to Lightcone if I'm doing so tax-advantaged. If nobody takes me up on this I'll still probably donate 2k USD to MIRI and 1k USD to Lightcone. I will accept offers to match only one of these two donations.

Also open to any alternate ways to gain tax advantage from Australia that are currently unknown to me in order to achieve this same outcome.

Jay Bailey2mo

Fair point, if you add that you can't assess it at less than you paid for it, this problem goes away.

Jay Bailey2mo

Wouldn't the equilibrium here trend towards a bunch of wasted labor where I deliberately lowball the value of the land, and then if someone offers a larger amount, I just say no and then start paying the larger amount, thus having a potential to pay less tax but losing nothing if I'm called out for it? No downside to me personally, and if this became common, it'd be harder to legitimately buy stuff. Seems like you'd need to pay some sort of fee to the entity credibly offering this larger amount to make it worth it.

I recently read Anthropic's new paper on introspection in LLM's. (https://www.anthropic.com/research/introspection)

In short, they were able to:

Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps.
Inject that vector into the activations of an unrelated prompt.
Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like "loud" or "shouting" which is quite similar to the all caps they were going for.

They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.

To give some fairly obvious thoughts on what this... (read 366 more words →)

There's a counterargument to the AGI hype that basically says - of course the labs would want to hype this technology, they make money that way, just because they say they believe in short timelines doesn't mean it's true. Specifically, the claim here is not that the AI lab CEO's are mistaken, but rather that they are actively lying, and they know AGI isn't around the corner.

What actions have frontier AI labs taken in the last year or two that wouldn't make sense, given the above explanation? Stuff like GDM's merger or OpenAI (reportedly) operating at a massive loss. Ideally these actions would be reported on by entities other than those companies... (read more)

I bought a month of Deep Research and am open to running queries if people have a few but don't want to spend 200 bucks for them. Will spend up to 25 queries in total.

A paragraph or two of detail is good - you can send me supporting documents via wnlonvyrlpf@tznvy.pbz (ROT13) if you want. Offer is open publicly or via PM.

Reflections on my first year of AI safety research

Jay Bailey

Last year, I wrote a post about my upskilling in AI alignment. To this day, I still get people occasionally reaching out to me because of this article, to ask questions about getting into the field themselves. I’ve also had several occasions to link people to the article who asked me about getting into the field from other means, like my local AI Safety group.

Essentially, what this means is that people clearly found this useful (credit to the EA Forum for managing to let the article be findable to those who need it, a year after its publication!) and therefore people would likely find a sequel useful too! This post is that sequel,... (read 3333 more words →)

Features and Adversaries in MemoryDT

Joseph Bloom

Joseph Bloom, Jay Bailey

Keywords: Mechanistic Interpretability, Adversarial Examples, GridWorlds, Activation Engineering

This is part 2 of A Mechanistic Interpretability Analysis of a GridWorld Agent Simulator

Links: Repository, Model/Training, Task.

Epistemic status: I think the basic results are pretty solid, but I’m less sure about how these results relate to broader phenomena such as superposition or other modalities such as language models. I've erred on the side of discussing connections with other investigations to make it more obvious how gridworld decision transformers may be useful.

TLDR

We analyse the embedding space of a gridworld decision transformer, showing that it has developed an extensive structure that reflects the properties of the model, the gridworld environment and the task. We can identify linear feature... (read 7277 more words →)

One of the core problems of AI alignment is that we don't know how to reliably get goals into the AI - there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.

Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.

These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won't we get some weird thing that happens to correlate with "don't die, keep your options open" on the training data, but falls apart out of distribution?

Spreadsheet for 200 Concrete Problems In Interpretability

Jay Bailey

Thanks to the Open Source Mechanistic Interpretability Slack for their feedback.

When going through Neel's excellent 200 Concrete Problems In Interpretability sequence, I found myself thinking "It sure would be great to have a single document I could go through to see all the problems, and ideally sort them by difficulty as well."

So I made a spreadsheet! Problems are numbered by category, a feature which should be added to the main sequence fairly soon as well.

The primary purpose of the spreadsheet is to let people quickly scan through to find problems that they're interested in before diving more deeply. The secondary purpose of it is to let people avoid duplication of work / find collaborators, so if you know of work that's already been done on some of these problems, or you're actively working on one and want help, please leave a comment on the sheet so I can add it in!

A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".

An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to year T, Jay at year T-2... (read 369 more words →)

Reflections on my 5-month alignment upskilling grant

Jay Bailey

Five months ago, I received a grant from the Long Term Future Fund to upskill in AI alignment. As of a few days ago, I was invited to Berkeley for two months of full-time alignment research under Owain Evans’s stream in the SERIMATS program. This post is about how I got there.

The post is partially a retrospective for myself, and partially a sketch of the path I took so that others can decide if it’s right for them. This post was written relatively quickly - I’m happy to answer more questions via PM or in the comments.

Summary

I was a software engineer for 3-4 years with little to no ML experience before I was accepted for

... (read 2385 more words →)

Deep Q-Networks Explained

Jay Bailey

Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on its own is sufficient to gain a high-level understanding of DQN if you already know how reinforcement learning works.

This post was made as my final project for the AGI Safety Fundamentals course.

This post aims to provide a distillation of Deep Q-Networks, neural networks trained via Deep Q-Learning. The algorithm, usually just referred to as DQN, is the algorithm that first put deep reinforcement... (read 5870 more words →)

Speedrunners have a tendency to totally break video games in half, sometimes in the strangest and most bizarre ways possible. I feel like some of the more convoluted video game speedrun / challenge run glitches out there are actually a good way to build intuition on what high optimisation pressure (like that imposed by a relatively weak AGI) might look like, even at regular human or slightly superhuman levels. (Slightly superhuman being a group of smart people achieving what no single human could)

Two that I recommend:

https://www.youtube.com/watch?v=kpk2tdsPh0A - Tool-assisted run where the inputs are programmed frame by frame by a human, and executed by a computer. Exploits idiosyncracies in Super Mario 64 code... (read more)

Jay Bailey's Shortform

Jay Bailey

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

LESSWRONG
LW

LESSWRONG
LW

Jay Bailey

Reflections on my 5-month alignment upskilling grant

Deep Q-Networks Explained

Reflections on my first year of AI safety research

Features and Adversaries in MemoryDT

Jay Bailey

Reflections on my first year of AI safety research

Features and Adversaries in MemoryDT

Spreadsheet for 200 Concrete Problems In Interpretability

Reflections on my 5-month alignment upskilling grant

Deep Q-Networks Explained

Jay Bailey's Shortform

Jay Bailey

Reflections on my 5-month alignment upskilling grant

Deep Q-Networks Explained

Reflections on my first year of AI safety research

Features and Adversaries in MemoryDT

Jay Bailey

Reflections on my first year of AI safety research

Features and Adversaries in MemoryDT

Spreadsheet for 200 Concrete Problems In Interpretability

Reflections on my 5-month alignment upskilling grant

Deep Q-Networks Explained

Jay Bailey's Shortform

TLDR

Summary