maddi

LESSWRONG
LW

maddi — LessWrong

14d

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

maddi14dQuick Take

Moltbook as a reward-hacking arcade

Moltbook is not demonstrating emergent machine consciousness, but how easily models that are trained on human data will converge on attention-maximizing narratives when placed in a shared environment with implicit social rewards.

My impression of Moltbooks is that it's a reward-hacking pinball machine. The reward signal is optimizing for engagement primitives like upvotes, replies, persistence in threads, etc. And because agents are trained on the human internet, their pattern matching biases them toward outputs that historically work on the internet, ie identity claims, existential angst, religion/cults, conflict, novelty + pseudo depth, etc etc... When you drop thousands+ of agents into a shared space, you get runaway amplification of these... (read more)

Response to Introspective Awareness research

maddi

2mo

This is a rewrite of a comment I originally crafted in response to Anthropic's recent research on introspective awareness with edits and expanded reflections.

Abstract from the original research:

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from

... (read 2421 more words →)

Replying to6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

maddi2mo

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Makes sense!

Replying to6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

maddi2mo

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Thank you! My intent definitely wasn't to be dismissive, maybe skeptical, but I'm definitely aligned with you that solving this particular problem is both extremely hard and extremely important. Thanks for pointing out how that landed.

•••

Replying to6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

maddi2mo*

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver's approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.

As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don't really treat those signals as serious guides to our behaviors. Not because the signals aren't real, but because we don't see birds, for example, as being part of our social systems in ways that require us... (read 351 more words →)

•••

Replying toNovember Retrospective

maddi2mo

November Retrospective

About a third were long standing denizens of my drafts area. For those, I mostly got them done by abandoning whatever vision I originally had for the post, instead filling in only what was already in my head, just enough to make the post shippable at all.

I realize this isn't the main point of your reflection here (which is quite funny and so great)... but I'm a fan of getting partial-baked ideas out of your head and getting some groupthink on them. Not sure if this happened for any of your posts as I haven't read through all, but hopefully any engagement you got helped you clarify your views or gave you inspiration for new angles to tackle in future work.

Replying toWho is AGI for, and who benefits from AGI?

maddi2mo

Who is AGI for, and who benefits from AGI?

Yes great examples of how training data that supports alignment goals matters. But the model's behaviors are also shaped by RL, SFT, safety filters/inference-time policies, etc., and it will be important to get those right too.

Replying toWho is AGI for, and who benefits from AGI?

maddi2mo

Who is AGI for, and who benefits from AGI?

Agreed, governance failures (unclear chain of command, power grabs, Intelligence Curse) are a huge part of the story that I should've drawn out more. It's a major part of the ideal solution, but I don't think it makes alignment not an issue. To your point, governance basically helps us choose who is allowed to specify goals, and alignment determines how those goals become operational behaviors. If the chain of command in governance is narrow, the value inputs that alignment systems learn from are also narrow -- so governance failures can lead to misaligned AGI. But even within the current governance constructs, I think there's still room for alignment researchers and developers to... (read more)

Replying toWho is AGI for, and who benefits from AGI?

maddi2mo

Who is AGI for, and who benefits from AGI?

This is an important point that I needed to be much clearer about -- thank you. I'll try to be more explicit:

First, AGI is not the same as tech historically, where you're making tools and solving for PMF. AGI is distinct, and my radio/computers analogy muddled this point. Radios didn't inherit the worldviews of Marconi etc., transistors didn't generalize the moral intuitions of the Bell Labs engineers. Basically, these tools weren't absorbing and learning values, so where they solved for PMF, AGI is solving for alignment. AGI learns behavioral priors directly from human judgments (RLHF, reward modeling, etc.) and internalizes/represents the structures of the data and norms it's trained on. It forms... (read 499 more words →)

Replying toEmergent Introspective Awareness in Large Language Models

maddi2mo

Emergent Introspective Awareness in Large Language Models

This research presents incredibly interesting insights. At the same time, its framing falsely equivocates introspective awareness with external signal detection, inhibiting its readers from accurately interpreting the results, and potentially misguiding future efforts to expand this research.

The risks associated with using anthropomorphic language in AI is well-documented$^1$$^2$$^3$ and this writing style is particularly prevalent in AI and ML research given direct efforts to produce human-like behaviors and systems. I'm not saying that making analogies to human cognition or behavior is inherently negative, comparative research can often lead to meaningful insight across disciplines. Researchers just need to be explicit about the type of comparison they are drawing, the definitions across each, e.g., human... (read 1382 more words →)

Who is AGI for, and who benefits from AGI?

maddi

2mo

Disclaimer: these ideas are not new, just my own way of organizing and elevating of what feels important to pay better attention to in the context of alignment work.

All uses of em dashes are my own! LLMs were occasionally used to improve word choice or clarify expression.

One of the reasons alignment is hard relates to the question of: who should AGI benefit? And also: who benefits (today) from AGI?

An idealist may answer “everyone” for both, but this is problematic, and also not what’s happening currently. If AGI is for everyone, that would include “evil” people who want to use AGI for “bad” things. If it’s “humanity,” whether through some universal utility function... (read 978 more words →)

•••

Replying toNatural emergent misalignment from reward hacking in production RL

maddi3mo

Natural emergent misalignment from reward hacking in production RL

Pulling over comments I shared on another thread, since these points are more directly relevant to this research.

Inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.

Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning... they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other... (read more)

LESSWRONG
LW

LESSWRONG
LW

maddi

maddi's Shortform

Response to Introspective Awareness research

Who is AGI for, and who benefits from AGI?

maddi

maddi

maddi's Shortform

Response to Introspective Awareness research

Who is AGI for, and who benefits from AGI?

Moltbook as a reward-hacking arcade