Zack_M_Davis

Comments

Sorry, this doesn't make sense to me. The boundary doesn't need to be smooth in an absolute sense in order to exist and be learnable (whether by neural nets or something else). There exists a function from business plans to their profitability. The worry is that if you try to approximate that function with standard ML tools, then even if your approximation is highly accurate on any normal business plan, it's not hard to construct an artificial plan on which it won't be. But this seems like a limitation of the tools; I don't think it's because the space of business plans is inherently fractally complex and unmodelable.

Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory

Right. I think the optimists would say that conditional sampling works great in practice, and that this bodes well for applying similar techniques to more ambitious domains. There's no chance of this image being in the Stable Diffusion pretraining set:

One could reply, "Oh, sure, it's obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that's not the danger of true AGI." And I ultimately think you're correct about that. But I don't think the conditional-sampling thing was obvious in 2004.

I agree, but I don't see why that's relevant? The point of the "Adversarial Spheres" paper is not that the dataset is realistic, of course, but that studying an unrealistically simple dataset might offer generalizable insights. If the ground truth decision boundary is a sphere, but your neural net learns a "squiggly" ellipsoid that admits adversarial examples (because SGD is just brute-forcing a fit rather than doing something principled that could notice hypotheses on the order of, "hey, it's a sphere"), that's a clue that when the ground truth is something complicated, your neural net is also going to learn something squiggly that admits adversarial examples (where the squiggles in your decision boundary predictably won't match the complications in your dataset, even though they're both not-simple).

Zack_M_Davis4dΩ216839

This is great work, but I'm a bit disappointed that x-risk-motivated researchers seem to be taking the "safety"/"harm" framing of refusals seriously. Instruction-tuned LLMs doing what their users ask is not unaligned behavior! (Or at best, it's unaligned with corporate censorship policies, as distinct from being unaligned with the user.) Presumably the x-risk-relevance of robust refusals is that having the technical ability to align LLMs to corporate censorship policies and against users is better than not even being able to do that. (The fact that instruction-tuning turned out to generalize better than "safety"-tuning isn't something anyone chose, which is bad, because we want humans to actively choosing AI properties as much as possible, rather than being at the mercy of which behaviors happen to be easy to train.) Right?

Doomimir: No, it wouldn't! Are you retarded?

Simplicia: [apologetically] Well, actually ...

Doomimir: [embarrassed] I'm sorry, Simplicia Optimistovna; I shouldn't have snapped at you like that.

[diplomatically] But I think you've grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence represents how surprised you'd be by data from distribution , that you expected to be from distribution .

It's asymmetric: it blows up when the data is very unlikely according to , which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to , which amounts to not seeing something that you thought was reasonably likely.

We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a penalty term in the RL objective because they don't want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.

But your specific example of threats and promises isn't vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It's true, in a sense, that the function of the KL penalty term is to "stay close" to the base policy. But you need to think about what that means mechanistically; you can't just reason that the webtext prior is somehow "safe" in way that means staying KL-close to it is safe.

But you probably won't understand what I'm talking about for another 70 days.

Just because the defendant is actually guilty, doesn't mean the prosecutor should be able to get away with making a tenuous case! (I wrote more about this in my memoir.)

I affirm Seth's interpretation in the grandparent. Real-time conversation is hard; if I had been writing carefully rather than speaking extemporaneously, I probably would have managed to order the clauses correctly. ("A lot of people think criticism is bad, but one of the secret-lore-of-rationality things is that criticism is actually good.")

I am struggling to find anything in Zack's post which is not just the old wine of the "just" fallacy [...] learned more about the power and generality of 'next token prediction' etc than you have what they were trying to debunk.

I wouldn't have expected you to get anything out of this post!

Okay, if you project this post into a one-dimensional "AI is scary and mysterious" vs. "AI is not scary and not mysterious" culture war subspace, then I'm certainly writing in a style that mood-affiliates with the latter. The reason I'm doing that is because the picture of what deep learning is that I got from being a Less Wrong-er felt markedly different from the picture I'm getting from reading the standard textbooks, and I'm trying to supply that diff to people who (like me-as-of-eight-months-ago, and unlike Gwern) haven't read the standard textbooks yet.

I think this is a situation where different readers need to hear different things. I'm sure there are grad students somewhere who already know the math and could stand to think more about what its power and generality imply about the future of humanity or lack thereof. I'm not particularly well-positioned to help them. But I also think there are a lot of people on this website who have a lot of practice pontificating about the future of humanity or lack thereof, who don't know that Simon Prince and Christopher Bishop don't think of themselves as writing about agents. I think that's a problem! (One which I am well-positioned to help with.) If my attempt to remediate that particular problem ends up mood-affiliating with the wrong side of a one-dimensional culture war, maybe that's because the one-dimensional culture war is crazy and we should stop doing it.

For what notion is the first problem complicated, and the second simple?

I might be out of my depth here, but—could it be that sparse parity with noise is just objectively "harder than it sounds" (because every bit of noise inverts the answer), whereas protein folding is "easier than it sounds" (because if it weren't, evolution wouldn't have solved it)?

Just because the log-depth xor tree is small, doesn't mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence ... which I suppose is your point. (The "function approximation" frame encourages us to look at the boolean circuit and say, "What a simple function, shouldn't be hard to noisily approximate", which is not exactly the right question to be asking.)

This comment had been apparently deleted by the commenter (the comment display box having a "deleted because it was a little rude, sorry" deletion note in lieu of the comment itself), but the ⋮-menu in the upper-right gave me the option to undelete it, which I did because I don't think my critics are obligated to be polite to me. (I'm surprised that post authors have that power!) I'm sorry you didn't like the post.

Load More