[Book Review] "The Alignment Problem" by Brian Christian

This is an unnecessarily snarky addition to the comment that's disappointing to see (and doesn't even make sense, since mine is neither a top comment, nor does it mention studies).

How should I reply to such a flagrant double standard, where a Twitter screenshot calling out Google is incontrovertible 'disinterested' evidence never to be questioned, and any objection is instead required to be multiple independent third-party studies?

The original post was by a disinterested third party sharing a screenshot.

They were not disinterested in the least! They were specifically 'calling out' and shaming Google for it, and it worked brilliantly in earning them huge fake internet points. (Someone who left Google and is mentioning it in an aside years later which no one noticed, that's much closer to disinterested.)

With the evidence available to us, something similar to world (B) seems much more likely than world (A).

No, it doesn't. A is vastly more plausible. Happens every time. You don't believe the seal thing? Fine, look at the recent Twitter cropping thing! You see anyone pointing out that the social media campaigns about how biased the cropping algorithm were wildly incorrect and exaggerated in every way, and missed the actual pro-woman biases that Twitter's followup study showed? (You wanted a study...) Or Tay! AFAICT, Tay could not learn, and so the media narrative is impossible. Did you hear that from anyone yet? Or how about the 'arrested because facial recognition software accused them of shoplifting'? It's literally in the original media article that they were arrested because a human told the cops to do so; did you hear that from anyone yet? World B is exactly what happens frequently. Did you not pay attention to how things like Timnit Gebru's ultimatum was completely laundered out of media accounts? Or how about Mitchell, where media outlets quoting the Google statement edited out the part of the statement mentioning, y'know, what she did to get fired (dumping docs to leak to friendly reporters)? You've seen the factoids about how much CO2 training a DL model costs, did you see any of the followups like "oops we overestimated the cost by 10,000%" or "actually the cost is 0 because the datacenters use renewable energy"? How about that time that a prominent activist and Nvidia VP shared a literal enemies list for her followers to coordinate attacks on, which you could earn membership on for liking the wrong tweet? Do you not pay any attention to how activists and the media work? Have you not noticed the techlash or the ideological affiliations of reporters? We live in world A, not world B.

[Book Review] "The Alignment Problem" by Brian Christian

So, how many third parties reported about the classification and how trustworthy were they? How many studies were conducted on the classification of black people as gorillas? What should we make of an ecosystem which tells us on a literally daily to weekly basis (google the term) about the gorillas, but never, ever tells you about the seals (I only learned about that one because I was reading the Google expert's post for other reasons)? What should we infer about the epistemics and justifications of the various experts and reporting here?

I'm writing this rather nitpicky comment because this is the top comment replying with rather strong wording about sourcing and studies and double standards for reporting...

Pathways: Google's AGI

It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it's not like they're fooling anyone), but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.

Redwood Research’s current project

Controlling the violence latent would let you systematically sample for it: you could hold the violence latent constant, and generate an evenly spaced grid of points around it to get a wide diversity of violent but stylistically/semantically unique. Kinds of text which would be exponentially hard to find by brute force sampling can be found this way easily. It also lets you do various kinds of guided search or diversity sampling, and do data augmentation (encode known-violent samples into their latent, hold the violent latent constant, generate a bunch of samples 'near' it). Even if the violence latent is pretty low quality, it's still probably a lot better as an initialization for sampling than trying to brute force random samples and running into very rapidly diminishing returns as you try to dig your way into the tails.

And if you can't do any of that because there is no equivalent of a violent latent or its equivalent is clearly too narrow & incomplete, that is pretty important, I would think. Violence is such a salient category, so frequent in fiction and nonfiction (news), that a generative model which has not learned it as a concept is, IMO, probably too stupid to be all that useful as a 'model organism' of alignment. (I would not expect a classifier based on a failed generative model to be all that useful either.) If a model cannot or does not understand what 'violence' is, how can you hope to get a model which knows not to generate violence, can recognize violence, can ask for labels on violence, or do anything useful about violence?

MikkW's Shortform

At some point in time, I took to calling them "dexter" and "winstar", from the Latin »dexter« and Middle English »winstre«, meaning "right" and "left", respectively

Are you aware that "deasil" and "widdershins" mean those from those roots already?

Redwood Research’s current project

Similarly, you might think that a promising approach is to look for snippets which cause the generator to generate violent completions with particularly high probability, reasoning that if the classifier says that the first 99 completions were bad but that the 100th was good, there’s perhaps an unusually high chance that it’s wrong about that 100th completion. And again, you can take this into account at eval time, by increasing the conservatism of your classifier based on how many completions it has rejected already...Try cleverer approaches to look for model mistakes, TBD. We’ve done a couple of things here but nothing has panned out super well yet.

Have you tried any of the guided generation approaches like GeDI to make the model generate only violent completions and then calling in the human oracles on all of those guided completions which the classifier misses? Or looking for a 'violence' latent?

How much should you be willing to pay for an AGI?

GPT-3 is slightly too expensive for many of the use-cases that I am interested in. This problem is made even worse by the fact that one of the basic techniques I normally use in procedural generation is "generate 100 of something and then pick the best one".

It's worth noting here that in a sense, GPT-3 isn't expensive enough if you are trading so much compute to get the necessary quality. You might well be better off with a GPT-4 which cost 10x as much. This is because the best sample out of 100 is only a bit better than the best out of 50, or the best out of 10, or the average sample, but generating 100 samples costs 100x more. If GPT-4 cost up to 100x more to run, then it might still be a win.

Particularly if you include the cost of screening 100 samples and how many workflows that eliminates... Many absolute technical metrics have hard to understand nonlinear translations to enduser utility. Below a certain apparently arbitrary % as defined by accuracy or word error rate or perplexity or whatever, a tool may be effectively useless; and then as soon as it crests it, suddenly it becomes useful for ordinary people. (Speech transcription & machine translation are two examples where I've noticed this.) It could be worth paying much more if it gets you to a level of reliability or quality where you can use it by default, or without supervision, or for entirely new tasks.

[Book Review] "The Alignment Problem" by Brian Christian

I'm curious what animal I would get classified as if people who look like me were removed from Google Photos training dataset. (I hope it's a meerkat.)

If anyone was wondering, no journalists bothered reporting this, but that system classified white people as 'dogs' and 'seals'.

Jitters No Evidence of Stupidity in RL

I agree that much of jittering reflects merely a minor absence of reward-shaping to penalize energy expenditures or wear-and-tear on equipment (the latter especially is why in robotics they do tend to add in tiny penalties for actions/changes to encourage smoothness). And when it learns tactics which depend on ultra-rapid fluctuations, well, that's usually 'a feature not a bug', assuming the environment is faithful to the intended application.

But I still tend to be a little troubled when I see jittering in an agent because it seems like it can reflect pathologies of estimation of values or actions, and to interfere with learning by adding in extraneous variation.

When an agent flips back and forth between actions which are irrelevant, that suggests that the value of the actions are fluctuating rapidly, even though the state of the environment has probably changed only a little; if the agent was learning well, with robust accurate estimation and few weird outliers or overfit estimates, you'd expect more consistency: "in state X, and X+1, and X+2, the best move is to go left"; it would be weird if a single pixel at the edge of the screen being red rather than green convinces the agent to go left - wait now it's one RGB shade brighter, go right - wait, it's back, go left - wait, it's green, go up! - you expect more temporal consistency. (When I read about adversarial attacks on DRL agents, particularly the soccer example, it's hard not to feel like there's some connection to jittering there. There's an analogy there to "non-robust features" in image classification, as well as the original adversarial image attacks: we have a strong intuition that jittering a few pixels should not have any effect.)

In general, it seems like better agents do act more like humans. The hide&seek OA agents or the related DM game agents don't seem to jitter like the original ALE DQN does; AlphaZero, for example, was noted by both Go & chess pros to play in a much more human-like way than weaker computer Go/chess systems (despite the latter also being superhuman), and I've collated many examples of more human-like better-performing systems under the "blessings of scale" rubric. So it seems to me that when an agent is learning clearly inhuman policies like jittering, that is a strong hint that however good it is, it could still be better.

It also seems like it'd interfere with learning: aside from the effect on exploration (jittering looks like epsilon random exploration, about the worst kind), the more disparate actions, the harder it is to estimate the net effect of the key actions or the environmental baseline. If you have only a few actions inside an episode, credit assignment ought to be easier. This might contribute to the previous problem through what you might call "superstitious agents": by twitching rapidly in a particular pattern, maybe it caused the final victory? How do you know it didn't? (It only has a very sparse set of episodes interacting with the environment to try to learn these difficult high-dimensional policies trying to solve potentially arbitrary environments, and those episodes are only partially under control & highly stochastic etc.)

Load More