LessWrong

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Ω 468mo

This is a linkpost for https://arxiv.org/abs/2308.15605

TL;DR: This post discusses our recent empirical work on detecting measurement tampering and explains how we see this work fitting into the overall space of alignment research.

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals that are robust under optimization. One concern is measurement tampering, which is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. (This is a type of reward hacking.)

Over the past few months, we’ve worked on detecting measurement tampering by building analogous datasets and evaluating simple techniques. We detail our datasets and experimental results in this paper.

Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with...

(Continue Reading – 5788 more words)

Fabien Roger4m20

That's right. We initially thought it might be important so that the LLM "understood" the task better, but it didn't matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a "token_loss_weight" of 0.

(Feel free to ask more questions!)

Losing Faith In Contrarianism

omnizoid

15h

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

ChristianKl16m20

What makes you believe that Substack is to blame and not him unpublishing it?

2ChristianKl17m

He explicitly says that the people who argue that there's no gap are mistaken to argue that. He argues for the gap being small, not nonexistent. He does not use the term "near zero" himself.

1Jacob G-W28m

Noted, thanks.

3Viliam2h

I guess in the average case, the contrarian's conclusion is wrong, but it is also a reminder that the mainstream case is not communicated clearly, and often exaggerated or supported by invalid arguments. For example: * it's not that "dieting doesn't work", but that people naively assume that dieting is simple and effective ("if you just stop eating chocolate and start exercising for one hour every day, you will certainly lose weight", haha nope), even when the actual weight-loss research shows otherwise; * it's not that "medicine doesn't improve health", but while some parts of medicine are very useful, other parts may be neutral or even harmful, and we often see that throwing more money at medicine does not actually improve the outcomes; * it's not that "education doesn't work", but if you filter your students by intelligence and hard work, of course they will have better outcomes in life regardless of how good is your teaching, so the impact of education is probably vastly overestimated, and this also explains why so many pedagogical experiments succeed at a pilot project (when you try them with a small group of smart and motivated students) and then fail in mainstream education (when you try the same thing with average or below-average students); * it's not that "opening the borders completely is a good idea", but a lot of potential value is lost by closing the borders for people who are neither fanatics nor criminals and could easily integrate to the new society. There is also an opposite bad extreme to contrarians, the various "I fucking love science... although I do not understand it... but I enjoy attacking people on social networks who seem to disagree with the scientific consensus as I understand it" people. The ones who are sure that the professor or the doctor is always right, and that the latest educational fad is always correct.

LLMs seem (relatively) safe

JustisMills

14h

This is a linkpost for https://justismills.substack.com/p/llms-seem-relatively-safe

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...

(Continue Reading – 1790 more words)

avturchin18m20

LLMs now can also self-play in adversarial word games and it increases their performance https://arxiv.org/abs/2404.10642

2Wei Dai2h

If something is both a vanguard and limited, then it seemingly can't stay a vanguard for long. I see a few different scenarios going forward: 1. We pause AI development while LLMs are still the vanguard. 2. The data limitation is overcome with something like IDA or Debate. 3. LLMs are overtaken by another AI technology, perhaps based on RL. In terms of relative safety, it's probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I'm not getting very optimistic about AI safety just yet.

3the gears to ascension2h

My p(doom) was low when I was predicting the yudkowsky model was ridiculous, due to machine learning knowledge I've had for a while. Now that we have AGI of the kind I was expecting, we have more people working on figuring out what the risks really are, and the previous concern of the only way to intelligence being RL seems to be only a small reassurance because non-imitation-learned RL agents who act in the real world is in fact scary. and recently, I've come to believe much of the risk is still real and was simply never about the kind of AI that has been created first, a kind of AI they didn't believe was possible. If you previously fully believed yudkowsky, then yes, mispredicting what AI is possible should be an update down. But for me, having seen these unsupervised AIs coming from a mile away just like plenty of others did, I'm in fact still quite concerned about how desperate non-imitation-learned RL agents seem to tend to be by default, and I'm worried that hyperdesperate non-imitation-learned RL agents will be more evolutionarily fit, eat everything, and not even have the small consolation of having fun doing it. upvote and disagree: your claim is well argued.

1zeshen2h

I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of 'alignment' is being used. But even with misalignment by default, I'm just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference.

Where to Draw the Boundary?

Eliezer Yudkowsky

16y

The one comes to you and says:

Long have I pondered the meaning of the word "Art", and at last I've found what seems to me a satisfactory definition: "Art is that which is designed for the purpose of creating a reaction in an audience."

Just because there's a word "art" doesn't mean that it has a meaning, floating out there in the void, which you can discover by finding the right definition.

It feels that way, but it is not so.

Wondering how to define a word means you're looking at the problem the wrong way—searching for the mysterious essence of what is, in fact, a communication signal.

Now, there is a real challenge which a rationalist may legitimately attack, but the challenge is not to find a satisfactory definition of...

(See More – 809 more words)

dirk22m10

"Or", in casual conversation, is typically interpreted and meant as being, implicitly, exclusive (this is whence the 'and/or' construction). It's not how "or" is used in formal logic that they would misunderstand, but rather, whether you meant it in the formal-logic sense.

sweenesm's Shortform

sweenesm

32m

sweenesm32m10

American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline: https://dailynous.com/2024/04/25/apa-creates-new-prizes-for-philosophical-research-on-ai/

https://www.apaonline.org/page/ai2050

https://ai2050.schmidtsciences.org/hard-problems/

Examples of Highly Counterfactual Discoveries?

139

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

dr_s36m20

Well, it's hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn't who was unusually blind? But it's an interesting question, it might indeed be a relatively accidental thing which for so... (read more)

1francis kafka2h

Bowler's comment on Wallace is that his theory was not worked out to the extent that Darwin's was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin's version, it's not likely or necessary that Darwinism wins out. Also And he points out that minus Darwin, nobody would have paid as much attention to Wallace. Bowler also points out that Wallace didn't really form the connection between both natural and artificial selection.

3kromem11h

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon. Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

3Lucius Bushnaq13h

Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia. So what it's actually showing is just that: 1. If you've got a class of different hypotheses M, containing many individual hypotheses {θ1,θ2,…θN} . 2. And you've got a prior ahead of time that says the chance any one of the hypotheses in M is true is some number p(M)<1., let's say it's p(M)=0.8 as an example. 3. And you distribute this total probability p(M)=0.8 around the different hypotheses in an even-ish way, so p(θi,M)∝1N, roughly. 4. And then you encounter a bunch of data X (the training data) and find that only one or a tiny handful of hypotheses in M fit that data, so p(X|θi,M)≠0 for basically only one hypotheses θi... 5. Then your posterior probability p(M|X)=p(X|M)0.80.8p(X|M)+0.2p(X|¬M) that the hypothesis θi is correct will probably be tiny, scaling with 1N. If we spread your prior p(M)=0.8 over lots of hypotheses, there isn't a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless p(X|¬M) is really small, i.e. no hypothesis outside the set M can explain the data either. So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we'd have 232k hypotheses if our function fits used k 32-bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as N goes to infinity. So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that havin

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

dirk's Shortform

dirk

dirknow10

I'm not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along (in a few cases even stronger than the ones directly-felt, despite seeming on initial inspection to be simply neutral metadata).

1dirk1h

Classic type of argument-gone-wrong (also IMO a way autistic 'hyperliteralism' or 'over-concreteness' can look in practice, though I expect that isn't always what's behind it): Ashton makes a meta-level point X based on Birch's meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch's continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton's focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.

1dirk37m

Meta/object level is one possible mixup but it doesn't need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn't happen, Dusk clarifies that in fact it is the natural outcome of Z, and we're off once more.

MichaelDickens's Shortform

MichaelDickens

3MichaelDickens11h

Have there been any great discoveries made by someone who wasn't particularly smart? This seems worth knowing if you're considering pursuing a career with a low chance of high impact. Is there any hope for relatively ordinary people (like the average LW reader) to make great discoveries?

niplav44m20

My best guess is that people in these categories were ones that were high in some other trait, e.g. patience, which allowed them to collect datasets or make careful experiments for quite a while, thus enabling others to make great discoveries.

I'm thinking for example of Tycho Brahe, who is best known for 15 years of careful astronomical observation & data collection. Similar for Gregor Mendel's 7-year-long experiments on peas. Same for Dmitry Belayev and fox domestication. Of course I don't know their cognitive scores, but those don't seem like a bott... (read more)

2Gunnar_Zarncke3h

I asked ChatGPT and it's difficult to get examples out of it. Even with additional drilling down and accusing it of being not inclusive of people with cognitive impairments, most of its examples are either pretty smart anyway, savants or only from poor backgrounds. The only ones I could verify that fit are: * Richard Jones accidentally created the Slinky * Frank Epperson, as a child, Epperson invented the popsicle * George Crum inadvertently invented potato chips I asked ChatGPT (in a separate chat) to estimate the IQ of all the inventors is listed and it is clearly biased to estimate them high, precisely because of their inventions. It is difficult to estimate the IQ of people retroactively. There is also selection and availability bias.

Eric Neyman's Shortform

Eric Neyman

Wei Dai1h20

Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.

Why do you think these values are positive? I've been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I'm very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.

1Quinn16h

I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps). In other words, I can't beat the old fragility of value stuff (but I haven't tried in a while). I wrote down my full thoughts on good successor criteria in 2021 https://www.lesswrong.com/posts/c4B45PGxCgY7CEMXr/what-am-i-fighting-for AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don't think this anymore.

1mesaoptimizer3h

e/acc is not a coherent philosophy and treating it as one means you are fighting shadows. Landian accelerationism at least is somewhat coherent. "e/acc" is a bundle of memes that support the self-interest of the people supporting and propagating it, both financially (VC money, dreams of making it big) and socially (the non-Beff e/acc vibe is one of optimism and hope and to do things -- to engage with the object level -- instead of just trying to steer social reality). A more charitable interpretation is that the philosophical roots of "e/acc" are founded upon a frustration with how bad things are, and a desire to improve things by yourself. This is a sentiment I share and empathize with. I find the term "techno-optimism" to be a more accurate description of the latter, and perhaps "Beff Jezos philosophy" a more accurate description of what you have in your mind. And "e/acc" to mainly describe the community and its coordinated movements at steering the world towards outcomes that the people within the community perceive as benefiting them.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA