George Ingebretsen

(EE)CS undergraduate at UC Berkeley

Current intern at CHAI

Previously: high-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine

I often change my mind and don't necessarily endorse things I've written in the past

georgeingebretsen.github.io

(EE)CS undergraduate at UC Berkeley

Current intern at CHAI

Previously: high-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine

I often change my mind and don't necessarily endorse things I've written in the past

georgeingebretsen.github.io

Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don't think this would suffice for this to trigger a strong response by convincing relevant people

Its hard for me to imagine a world where we really have internals-based methods that are "extremely well supported theoretically and empirically," so I notice that I should take a second to try and imagine such a world before accepting the claim that internals-based evidence wouldn't convince the relevant people...

Today, the relevant people probably wouldn't do much in response to the interp team saying something like: "our deception SAE is firing when we ask the model bio risk questions, so we suspect sandbagging."

But I wonder how much of this response is a product of a background assumption that modern-day interp tools are finicky and you can't always trust them. So in a world where we really have internals-based methods that are "extremely well supported theoretically and empirically," I wonder if it'd be treated differently?

(I.e. a culture that could respond more like: "this interp tool is a good indicator of whether or not that the model is deceptive, and just because you can get the model to say something bad doesn't mean its actually bad" or something? Kinda like the reactions to the o1 apollo result)

Edit: Though maybe this culture change would take too long to be relevant.

"Lots of very small experiments playing around with various parameters" ... "then a slow scale up to bigger and bigger models"

This Dwarkesh timestamp with Jeff Dean & Noam Shazeer seems to confirm this.

"I'd also guess that the bottleneck isn't so much on the number of people playing around with the parameters, but much more on good heuristics regarding which parameters to play around with."

That would mostly explain this question as well: "If parallelized experimentation drives so much algorithmic progress, why doesn't gdm just hire hundreds of researchers, each with small compute budgets, to run these experiments?"

It would also imply that it would be a big deal if they had an AI with good heuristics for this kind of thing.

I would love to see an analysis and overview of predictions from the Dwarkesh podcast with Leopold. One for Situational awareness would be great too.

I expect that within a year or two, there will be an enormous surge of people who start paying a lot of attention to AI.

This could mean that the distribution of who has influence will change a lot. (And this might be right when influence matters the most?)

I claim: your effect on AI discourse post-surge will be primarily shaped by how well you or your organization absorbs this boom.

The areas I've thought the most about this phenomena are:

AI safety university groups
Non agi lab research organizations
AI bloggers / X influencers

(But this applies to anyone who's impact primarily comes from spreading their ideas, which is a lot of people.)

I think that you or your organization should have an explicit plan to absorb this surge.

Unresolved questions:

How much will explicitly planning for this actually help absorb the surge? (Regardless, it seems worth a google doc and a pomodoro session to at least see if there's anything you can do to prepare)
How important is it to make every-day people informed about AI risks? Or is influence so long-tailed that it only really makes sense to build reputation with highly influential people? (Though- note that this surge isn't just for every day people — I expect that the entire memetic landscape will be totally reformed after AI becomes clearly a big deal, and that applies to big shot government officials along with your average joe)

I'd be curious to see how this looked with Covid: Did all the covid pandemic experts get an even 10x multiplier in following? Or were a handful of Covid experts highly elevated, while the rest didn't really see much of an increase in followers? If the latter, what did those experts do to get everyone to pay attention to them?

Securing AI labs against powerful adversaries seems like something that almost everyone can get on board with. Also, posing it as a national security threat seems to be a good framing.

Some more links from the philosophical side that I've found myself returning to a lot:

(Lately, it's seemed to me that focusing my time on nearer-term / early but post-AGI futures seems better than spending my time discussing ideas like these on the margin, but this may be more of a fact about myself than it is about other people, I'm not sure.)

Ah, sorry- I meant it's genuinely unclear how to classify this.

Once again, props to OAI for putting this in the system card. Also, once again, it's difficult to sort out "we told it to do a bad thing and it obeyed" from "we told it to do a good thing and it did a bad thing instead," but these experiments do seem like important information.

“By then I knew that everything good and bad left an emptiness when it stopped. But if it was bad, the emptiness filled up by itself. If it was good you could only fill it by finding something better.”

- Hemingway, A Moveable Feast

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments