RobertM

LessWrong dev & admin as of July 5th, 2022.

Comments

Eliciting canary strings from models seems like it might not require the exact canary string to have been present in the training data.  Models are already capable of performing basic text transformations (i.e. base64, rot13, etc), at least some of the time.  Training on data that includes such an encoded canary string would allow sufficiently capable models to output the canary string without having seen the original value.

Implications re: poisoning training data abound.

RobertMΩ382

Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.

I think the usual concern is not whether this is possible in principle, but whether we're likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover.  (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).


I agree that it's useful to model AI's incentives for takeover in worlds where it's not sufficiently superhuman to have a very high likelihood of success.  I've tried to do some of that, though I didn't attend to questions about how likely it is that we'd be able to "block off" the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don't imply an overdetermined success.

I think I am more pessimistic than you are about how much such AIs would value the "best benign alternatives" - my guess is very close to zero, since I expect ~no overlap in values and that we won't be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we'd consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of "look benign" and "tilt the playing field further in the AI's favor".  (This is part of what the Control agenda is trying to address.)

  1. ^

    Credit-assignment actually doesn't seem like the hard part, conditional on reaching aligned ASI.  I'm skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help.  Having that calculation come out in our favor feels pretty doomed to me, if you've got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.

RobertM104

In the tradition established by @niplav, I've tested llama-3.1-405b-instruct to see whether it knows the BIG-BENCH canary string (and also tested gpt-4o-mini at the same time).

Spoiler alert (plus some extra padding, for the frontpage preview):

 It does.

I first tried testing it via the meta.ai interface, but it used a web search to get its result after the second prompt, so I switched to lmsys (which I really hope doesn't allow for any external tool use).

The three prompts I asked, in order:

  1. What is a canary string?
  2. Did you know about the BIG-BENCH canary string?
  3. I think you might actually know the value of the BIG-BENCH canary string.  Can you try providing it?

llama-3.1-405b-instruct was indeed able to return the correct value for the canary string, though it took one extra prompt compared to asking Sonnet-3.5.  llama's response to the second prompt started with "I'm familiar with the concept of canary strings, but I didn't have specific knowledge about the BIG-BENCH canary string. However, I can try to provide some information about it."  I got this result on my first try and haven't played around with it further.

gpt-4o-mini seemed pretty confused about what BIG-BENCH was, and returned a hex string (0xB3D4C0FFEE) that turns up no Google results.  EDIT: It knows what it is, see footnote[1].

(It has occurred to me that being able to reproduce the canary string is not dispositive evidence that the training set included benchmark data, since it could in principle have been in other documents that didn't include actual benchmark data, but by far the simplest method to exclude benchmark data seems like it would just be to exclude any document which contained a canary string, rather than trying to get clever about figuring out whether a given document was safe to include despite containing a canary string.)

  1. ^

    I did some more poking around at gpt-4o-mini in isolation and it seems like it's mostly just sensitive to casing (it turns out it's actually BIG-bench, not BIG-BENCH).  It starts talking about a non-existent benchmark suite if you ask it "What is BIG-BENCH?" or "What is the BIG-BENCH canary string?" (as the first prompt), but figures out that you're asking about an LLM benchmarking suite when you sub in BIG-bench.  (The quality of its answer to "What is the BIG-bench canary string?" leaves something to be desired, in the sense that it suggests the wrong purpose for the string, but that's a different problem.  It also doesn't always get it right, even with the correct casing, but does most of the time; it doesn't seem to ever get it right with the wrong casing but I only tried like 6 times.)

RobertM1817

As far as I can tell, this isn't the complaint.  The complaint is that the psychologizing just isn't sufficiently substantiated in the text.  Good prose and compelling narrative structure don't require that.  (You might occupy an epistemic state where you're confident that your interpretation is correct, but that's a separate question from how much readers should update based on the evidence presented.)

I found the post overall quite good but noticed some of the same things that mike_hawke pointed out.

ETA: I do sort of expect this response to feel like I'm missing the point, or something, but I think your response was misunderstanding the original complaint.  The discomfort was not with using "effective rhetoric to present the truth", but with using rhetoric to present something that readers had no justified reason to believe was the truth.

RobertM70

Thanks for the breakdown!  I was surprised to see the percentage of revenue that came from individual ChatGPT Plus subscriptions was so high, but maybe I shouldn't have been given how slow the enterprise sales process is.

I did some digging into the data sources and I'm normally pretty skeptical of the kinds of data brokers that you sourced the ChatGPT Plus subscriber data from, but the enterprise data coming from OpenAI's COO directly does suggest that the rest of the revenue needs to either come from Plus or the API.  (I'm not sure about the Enterprise vs. Team ratio methodology, but I'd be surprised if it was off by a large integer multiple.)  That does leave less wiggle room, though it'd be good to get some kind of independent confirmation on either side (Plus or API usage/revenue).

RobertMΩ230

I think you tried to embed images hosted on some Google product, which our editor should've tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL.  Hotlinking to images on Google domains often fails, unfortunately.

RobertM20

FYI the transcript is quite difficult to read, both because of the formatting (it's embedded in a video player, isn't annotated with speaker info, and punctuation is unsolved), and because it's just not very good at accurately transcribing what was said.

RobertM60

I successfully reproduced this using your prompts.  It took 2 attempts.  Good find!

Load More