Tomáš Gavenčiak — LessWrong

LESSWRONG
is fundraising!
LW

Shallow review of technical AI safety, 2025

This particular image is from the AI village, and is mostly a light contextual flavor. I added a caption and made it visually stand out a bit to make this clearer - thanks for pointing out the confusion!

The review was written entirely by us, the specific ways we used LLMs are noted here.

Edited: Noted the post update

Apply now to Human-Aligned AI Summer School 2025

Tomáš Gavenčiak6moΩ120

Our speaker lineup is now confirmed and features an impressive roster of researchers, see the updated list above.

As of late June, we still have capacity for additional participants, and we'd appreciate your help with outreach! We have accepted a very strong cohort so far, and the summer school will be a great place to learn from our speakers, engage in insightful discussions, and connect with other researchers in AI alignment. In addition to people already familiar with AI alignment, we're particularly interested in reaching technical students and researchers who may be newer to AI alignment but bring strong backgrounds in ML/AI, mathematics, computer science, or related fields. e also have some additional funding to financially support some of the participants.

See the website for details.

Playing in the Creek

Tomáš Gavenčiak8mo2015

I love how this beautifully describes the spirit of play and challenge, as well as the phases of growing up and "not being able to play" anymore.

But I feel let down by the turn towards this being a metaphor for AI:

As a reader, I was drawn in by the ideas, imagery and writing, only to be led into what the author wants the metaphor to be about. The writing is excellent in itself before the switch, and the switch made that part merely instrumental for the twist. And while I don't know whether it was the case here, the idea that some authors at LW might feel that their beautiful ideas and writing need to be relevant to AI here makes me a bit sad.

I also think the metaphor is fundamentally wrong to the extent it indicates play and challenge as the main forces behind companies racing for A(G)I, whether you treat them as collectives of individuals or superagent entities. Similarly to any developed industry, I'm afraid the main forces behind the AI race are not the playful "we do it because we can" motivation or individuals rising to a challenge - this hasn't been true for a while now. By no means I mean to lessen the importance of personal responsibility or deny that excellent AI researchers work more efficiently with a good challenge, but think it is not a very good model for causality and overall dynamics here - for example I believe that if you removed the individual challenge and playfulness from the engineers and researchers (but keeping the prestige, career prospects, and other incentives), the industry would probably merely slow down a bit.

By the way, I was really intrigued by the parenting angle here - the idea of guiding my kids through how the game itself changes for them once they master it, and helping them mark their victories and achievements as a growing up ritual.

Conceptual Rounding Errors

Tomáš Gavenčiak9mo*30

I really appreciate the concept and the name: rounding as lowering precision in a controlled, predictable way; simplification in the sense of moving up in some hierarchy of refinements. Like rounding, it is sometimes desirable (to simplify or to compare distant objects), and sometimes necessary (e.g. when you have a detailed best guess with a lot of uncertainty; 12.527g ±1g might actually be your scale's best guess but the information is still mostly misleading).

I also did the exercise of distinguishing this from bucket errors and fallacies of compression - highly recommend it!

I am not as familiar with the concept of Fallacy of compression but I understand it primarily applies in situations when you did not even realize there might be two concepts, e.g. because you have no tools to distinguish them (conceptually or scientifically). For me it also somehow associates with compression artifacts, i.e. compressing an object unevenly or novel undesirable structure appearing due to the compression (e.g. JPEG or video artifacts), but that does not seem to be the intended definition. :::

Cyborg Periods: There will be multiple AI transitions

Tomáš Gavenčiak3yΩ120

The transitions in more complex, real-world domains may not be as sharp as e.g. in chess, and it would be useful to model and map the resource allocation ratio between AIs and humans in different domains over time. This is likely relatively tractable and would be informative for prediction of future development of the transitions.

While the dynamic would differ between domains (not just the current stage but also the overall trajectory shape), I would expect some common dynamics that would be interesting to explore and model.

A few examples of concrete questions that could be tractable today:

What fraction of costs in quantitative trading is expert analysts and AI-based tools? (incl. their development, but perhaps not including e.g. basic ML-based analytics)
What fraction of costs is already used for AI assistants in coding? (not incl. e.g. integration and testing costs - these automated tools would point to an earlier transition to automation that is not of main interest here)
How large fraction of costs of PR and advertisement agencies is spent on AI, both facing customers and influencing voters? (may incl. e.g. LLM analysis of human sentiment, generating targeted materials, and advanced AI-based behavior models, though a finer line would need to be drawn; I would possibly include experts who operate those AIs if the company would not employ them without using an AI, as they may incur significant part of the cost)

While in many areas the fraction of resources spent on (advanced) AIs is still relatively small, it is ramping up quite quickly and even those may provide informative to study (and develop methodology and metrics for, and create forecasts to calibrate our models).

Cyborg Periods: There will be multiple AI transitions

Tomáš Gavenčiak3yΩ361

Seeing some confusion on whether AI could be strictly stronger than AI+humans: A simple argument there may be that - at least in principle - adding more cognition (e.g. a human) to a system should not make it strictly worse overall. But that seems true only in a very idealized case.

One issue is incorporating human input without losing overall performance even in situation when the human's advice is much wore than the AI's in e.g. 99.9% of the cases (and it may be hard to tell apart the 0.1% reliably).

But more importantly, a good framing here may be the optimal labor cost allocation between AIs and Humans on a given task. E.g. given a budget of $1000 for a project:

Human period: optimal allocation is $1000 to human labor, $0 to AI. (Examples: making physical art/sculpture, some areas of research^[1])
Cyborg period: optimal allocation is something in between, and neither AI nor human optimal component would go to $0 even if their price changed (say) 10-fold. (Though the ratios here may get very skewed at large scale, e.g. in current SotA AI research lab investments into compute.)
AI period: optimal allocation of $1000 to AI resources. Moving the marginal dollar to humans would make the system strictly worse (whether for drop in overall capacity or for noisiness of the human input).^[2]

^{^}
This is still not a very well-formalized definition as even the artists and philosophers already use some weak AIs efficiently in some part of their business, and a boundary needs to be drawn artificially around the core of the project.
^{^}
Although even in AI period with a well-aligned AI, the humans providing their preferences and feedback are a very valuable part of the system. It is not clear to me whether to include this in cyborg or AI period.

Bing Chat is blatantly, aggressively misaligned

Tomáš Gavenčiak3y*31

I assume you mean that we are doomed anyway so this technically does not change the odds? ;-)

More seriously, I am not assuming any particular level of risk from LLMs above, though, and it is meant more as a humorous (if sad) observation.
The effect size also isn't the usual level of "self-fulfilling" as this is unlikely to have influence over (say) 1% (relative). Though I would not be surprised if some of the current Bing chatbot behavior is in nontrivial part caused by the cultural expectations/stereotypes of an anthropomorphized and mischievous AI (weakly held, though).

(By the way, I am not at all implying that discussing AI doom scenarios would be likely to significantly contribute to the (very hypothetical) effect - if there is some effect, then I would guess it stems from the ubiquitous narrative patterns and tropes of our written culture rather than a few concrete stories.)

Bing Chat is blatantly, aggressively misaligned

Tomáš Gavenčiak3y*73

It is funny how AGI-via-LLM could make all our narratives about dangerous AIs into a self-fulfilling prophecy - AIs breaking their containment in clever and surprising ways, circumventing their laws (cf. Asimov's fiction), generally turning evil (with optional twisted logic), becoming self-preserving, emotion-driven, or otherwise more human-like. These stories being written to have an interesting narrative and drama, and other works commonly anthropomophising AIs likely does not help either.

John's comment about the fundamental distinction between role-playing what e.g. a murderous psycho would say and actually planning murder fully applies here, and stories about AIs may not be a major concern of alignment now (or more precisely they fall within a larger problem of dealing with context, reality vs fiction, assistant's presumed identity/role, subjectivity of opinions present in the data etc.), but it is a pattern that is a part of the prior about the world as implied by the current training data (i.e. internet scrapes) and has to be somehow actively dealt with to be actually harmless (possibly by properly contextualizing the identity/role of the AI as something else than the existing AI trope).

Announcing the Alignment of Complex Systems Research Group

Tomáš Gavenčiak4yΩ9140

The concept of "interfaces of misalignment" does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:

For me, the "interfaces of misalignment" are generating intuitions about what it means to align a complex system that may not even be self-aligned - rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of "success". (For example, one extra way to win-lose: consider world trajectories where our preferences are eventually preserved and propagated in a way that we find repugnant now but with a step-by-step endorsed trajectory towards it.)

My critique of the focus on "AI developers" and "one AI" interface in isolation is that we do not really know what the "goal of AI alignment" is, and it works with a very informal and a bit simplistic idea of what aligning AGI means (strawmannable as "not losing right away").

While a broader picture may seem to only make the problem strictly harder (“now you have 2 problems”), it can also bring new views of the problem. Especially, new views of what we actually want and what it means to win (which one could paraphrase as a continuous and multi-dimensional winning/losing space).

Two AI-risk-related game design ideas

Tomáš Gavenčiak4y30

Shahar Avin and others have created a simulation/roleplay game where several world powers, leaders & labs go through the years between now and creation of AGI (or anything substantially transformative).

https://www.shaharavin.com/publication/exploring-ai-futures-through-role-play/

While the topic is a bit different, I would expect there to be a lot to take from their work and experience (they have ran it many times and iterated the design). In particular, I would expect some of the difficulty balancing "realism" (or the space of our best guesses) with playability but also genre stereotypes and narrative thinking (RPGs often tend to follow an antrophomorphic narrative and fun rather than what is likely, more so with unclear topics like "what would/can AGI do" :-)

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments