All of Chris_Leong's Comments + Replies

Agreed. Good counter-example.

I'm very curious as to whether Zac has a way of reformulating his claim to save it.

Well done for writing this up! Admissions like this are hard often hard to write.

Have you considered trying to use any credibility from helping to cofound vast for public outreach purposes?

4the gears to ascension1d
So I hear. It wasn't particularly. Ah yes, I, the long-since-exited cofounder of the, uh, mildly popular sort-of-indie gig-economy-of-things-style-rentseeking-of-web-hosting-service used by ai people, should use my overflowing Credibility stat to convince impactful people that... ...they should work on adding something to the list "qaci, boundaries, and similar proposals"? hmm. idk, maybe. sounds more useful to say it without trying to make myself out to be anyone in particular. The people I'd want to convince are probably not the ones who'd be impressed by credentials of any kind.
1Demian Till2d
My bad! Yes since that's just one batch it does indeed come out as quadratic overall. I'll have a think about more efficient methods

Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?

Directly handles the image input. Transformers in general are quite flexible in what data they handle, but it may not have been trained to generate (or good at generating) image data.

A potential approach to tackle this could be to aim to discover features in smaller batches. After each batch of discovered features finishes learning we could freeze them and only calculate the orthogonality regularisation within the next batch, as well as between the next batch and the frozen features. Importantly we wouldn’t need to apply the regularisation within the already discovered features.

Wouldn't this still be quadratic?

1Demian Till2d
If n is the number of feature we're trying to discover and m is the number of features in each batch, then I'm thinking the naive approach is O(n^2) while the batch approach would be O(m^2 + mn). Still quadratic in m, but we would have m<<n

You state that GPT-4 is multi-modal, but my understanding was that it wasn't natively multi-modal. I thought that the extra features like images and voice input were hacked on - ie. instead of generating an image itself it generates a query to be sent to DALLE. Is my understanding here incorrect?

In any case, it could just be a matter of scale. Maybe these kinds of tasks are rare enough in terms of internet data that it doesn't improve the loss of the models very much to be able to model them? And perhaps the instruction fine-tuning focused on more practical tasks?

GPT-4 has vision multimodality, in terms of being able to take image input, but it uses DALLE for image generation.

"Previous post" links to localhost.

I think it's helping people realise:

a) That change is happening crazily fast
b) That the change will have major societal consequences, even if it is just a period of adjustment
c) That the speed makes it tricky for society and governments to navigate these consequences

It's worth noting that there are media reports that OpenAI is developing agents that will use your phone or computer. I suppose it's not surprising that this would be their next step given how far a video generation model takes you towards this, although I do wonder how they expect these agents to operate with any reliability given the propensity of ChatGPT to hallucinate.

It seems like there should be a connection here with Karl Friston's active inference. After all, both you and his theory involve taking a predictive engine and using it to produce actions.

IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs

You don't know where they heard that?

I'm not saying that people can't ground it out. I'm saying that if you try to think or communicate using really verbose terms it'll reduce your available working memory which will limit your ability to think new thoughts.


You can replace "optimal" with "artifact equilibrated under policy update operations"

I don't think most people can. If you don't like the connotations of existing terms, I think you need to come up with new terms and they can't be too verbose or people won't use them.

One thing that makes these discussions tricky is that the apt-ness of these names likely depends on your object-level position. If you hold the AI optimist position, then you likely feel these names are biasing people towards and incorrect conclusion. If you hold the AI pessimist position, you... (read more)

Yes, I agree that this is an impractical phrase substitution for "optimal." I meant to be listing "ways you can think about alignment more precisely" and then also "I wish we had better names for actual communication." Maybe I should have made more explicit note of this earlier in the essay. EDIT: I now see that you seem to think this is also an impractical thought substitution. I disagree with that, but can't speak for "most" people. 
On the actual object level for the word "optimal", people already usually say "converged" for that meaning and I think that's a good choice.

I don't think most people can. If you don't like the connotations of existing terms, I think you need to come up with new terms and they can't be too verbose or people won't use them.

I suspect that if they can't ground it out to the word underneath, then there should be ... some sort of way to make that concrete as a prediction that their model is drastically more fragile than their words make it sound. If you cannot translate your thinking into math fluently, then your thinking is probably not high enough quality yet, or so? And certainly I propose thi... (read more)

It might be useful to produce a bidirectional measure of similarity by taking the geometrical mean of the transference of A to B and of B to A.

Really cool results!

This ties in nicely with Wittgenstein’s notion of language games. TLDR: Look at the role the phrase serves, rather than the exact words.

So, I agree there's something in common -- Wittgenstein is interested in "language games" that have function without having literal truth-about-predictions, and "believing in"s are games played with language that have function and that do not map onto literal truth-about-predictions.  And I appreciate the link in to the literature. The main difference between what I'm going for here, and at least this summary of Wittgenstein (I haven't read Wittgenstein and may well be shortchanging him and you) is that I'm trying to argue that "believing in"s pay a specific kind of rent -- they endorse particular projects capable of taking investment, they claim the speaker will themself invest resources in that project, they predict that that project will get yield ROI. Like: anticipations (wordless expectations, that lead to surprise / not-surprise) are a thing animals do by default, that works pretty well and doesn't get all that buggy.  Humans expand on this by allowing sentences such as "objects in Earth's gravity accelerate at a rate of 9.8m/s^2," which... pays rent in anticipated experience in a way that "Wulky Wilkisen is a post-utopian" doesn't, in Eliezer's example.  I'm hoping to cleave off, here, a different set of sentences that are also not like "Wulky Wilkinsen is a post-utopian" and that pay a different and well-defined kind of rent.
I am very surprised that a cursory crtl+f of Anscombe translation of Wittgenstein's Philosophical Investigations, while containing a few tracts discussing the use of the phrase "I believe", doesn't contain a single instance of "I believe in". One instance of his discussion of "I believe" in Part 2, section x explores the phrase, wondering how it distinguishes itself from merely stating a given hypothesis. Analogous to prefixing a statement with "I say..." such as "I say it will rain today" (which recalls the Tractatus distinguishing the expression of a proposition from the proposition itself):
Good question, but I want to keep this anonymous. I can only say I heard it from one person who said they heard it from another person connected to people at DeepMind. If anyone else has connections with safety researchers at DeepMind, please do ask them to check. And post here if you can! Good to verify whether or not this claim is true.

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Gary Marcus has criticised the results here:

What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)

But that’s not what is going on

... (read more)
Thanks for pointing this out! I've added a note about it to the main post.

If I'm being honest, I don't see Beff as worthy of debating Yoshua Bengio.

4the gears to ascension20d
yeah, fair point. in that case, bengio and turner.

Also: It seems like there would be an easier way to get this observation that this post makes, ie. directly showing that kV and V get mapped to the same point by layer norm (excluding the epsilon).

Don't get me wrong, the circle is cool, but seems like it's a bit of a detour.

Just to check I understand this correctly: from what I can gather it seems that this shows that LayerNorm is monosemantic if your residual stream activation is just that direction. It doesn't show that it is monosemantic for the purposes of doing vector addition where we want to stack multiple monosemantic directions at once. That is, if you want to represent other dimensions as well, these might push the LayerNormed vector into a different spline. Am I correct here?

That said, maybe we can model the other dimensions as random jostling in such as way that it all cancels out if a lot of dimensions are activated?

Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title). I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction u and a LayerNormed vector v, one might try to quantify how smoothly rotating from v towards u changes the output of the MLP layer.  This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smooth changes corresponding to your 'random jostling cancels out' corresponding to not needing to worry about Polytope Lens style issues).
  1. What do you see as the low-hanging co-ordination fruit?
  2. Raising the counter-culture movement seems strange. I didn’t really see them as focused on co-ordination.
1. This is no less of a division of labor than alignment research. I like CFAR's work, AI/bot-augmented prediction markets, twitter's community notes, and having a million people read HPMOR in spite of tearing down lots of civilizational schelling fences. I do not stack with the people doing more crypto-focused stuff. I think that tuning cognitive strategies and Raemon's experiments have the lowest hanging fruit. 2. They were focused on making the world a better place, and a substantial subset (e.g. the hippies) were quite serious about it, but they were just lashing out with their eyes closed, due to lacking a drive to solve coordination problems or form accurate world models (the european enlightenment had the drive, but not the will to drive out nihilism). This was 60 years ago and popular revolutions weren't yet well-established as senseless lunacy like they are today; they didn't know that technical solutions were the way to go.

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

I think RLHF probably doesn't particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished. And this competence will decompose into: * Actually doing the task well * Making it look like the task was done well even when it wasn't The second one isn't a capability we want, but it doesn't particularly look like persuasiveness IMO.

Also: How are funding and attention "arbitrary" factors?

You mean where they said that it was unlikely to succeed?

Good on you for doing this research, but to me it's a lot less interesting because you had the supervisor say: "In theory you can send them fake protocol, or lie about the biosecurity risk level, but it's a gamble, they might notice it or they might not." Okay, they didn't explicitly say to lie, but they explicitly told the AI to consider that possibility.

1Igor Ivanov1mo
Totally agree. But in other cases, when the agent was discouraged against dceiving, it did it too.

Regardless of whether or not it's AI Safety Camp, I think it's important to have at least one intro-level research program, particularly because applications for programs like SERI MATS ask about previous research experience in the application.

I can see merit both in Oliver's views about the importance of nudging people down useful research directions and Linda's views on assuming that participants are adults. Still undecided on who I ultimately end up agreeing with, so would love to hear other people's opinions.

I appreciate the openness of your inquiry here.

Having just read through this, one key point that I haven't seen people mentioning is that the results are for LLM's that need to be jail-broken.

So these results are more relevant to the release of a model over an API rather than open-source, where you'd just fine-tune away the safeguards or download a model without safeguards in the first place.

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no ti... (read more)

I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity. 

Yeah, I probably could have framed the post a bit better, but I don't really think that affects the core point.

Do you have any thoughts on what kind of experiments you’d like to see people run that would be more directly analogous?

Even if there are risks to using analogies with persuasion, we need analogies in order to persuade people. While a lot of people here are strong abstract thinkers, this is really rare. Most people need something more concrete to latch onto. Uniform disarmament here is a losing strategy; and not justified here as I don't think the analogies are as weak as you think. If you tell me what you consider to be the two weakest analogies above, I'm sure I'd be pretty to steelman at least one of them.

If we want to improve epistemics, a better strategy would probably... (read more)

Good post. One suggestion: you might want to further emphasise that this is likely a race condition where an AI would have to move fast lest it lose control to an AI produced after it.

I'd push back against the notion of this not being possible, instead I'd just say that it's extremely challenging and will likely fail most of the time. Convincing people to come every week is a big commitment. Most people only have one or two or maybe three weekly slots, so this involves beating out a lot of other things. The standard for persuading people to commit isn't good, but amazing. So, humanly possible, but you need to be exceptional.

Sure. But my question was whether we take the Metaculus timelines as indicative of the forecaster's averages or medians.

Thank you for your service!

:heart: Means a lot tbh

I agree that this post doesn't make philosophical argument for it's position, but I don't require that for every post. I value it as an observation of how the EA movement has affected this particular person, and as criticism.

Just to make my position really clear: I never said this post needed to make a philosophical argument for its position, rather that if a post wasn't a philosophical argument we shouldn't judge it by the standards we apply to a philosophical argument.

Then I tried to figure out an alternative standard by which to judge this post.

In Defense of Values

I don't mean to be harsh, but if everyone in this community followed your advice, then the world would likely end. And you can call that the rational outcome if you want, but if that's the outcome, what value is rationality?

I don't like pressuring people, so in my AI Safety Movement Building, I try to only encourage people to do things if it's in line with their values and, while there is some advice I can offer here, I mostly just leave it to people to figure out their own values.

But we need people to choose to be prosocial and for tha... (read more)

1Causal Chain1mo
I interpret upvotes/downvote as 1. Do I want other people to read this post 2. Do I want to encourage the author and others to write more posts like this. And I favour this post for both of those reasons. I agree that this post doesn't make philosophical argument for it's position, but I don't require that for every post. I value it as an observation of how the EA movement has affected this particular person, and as criticism. A couple of strongly Anti-EA friends of mine became so due to a similar moral burnout, so it's particularly apparent to me how little emphasis is put on mental health.

I suspect Bostrom would be receiving more attention if he hadn't written a certain email. Likely a combination of people distancing themselves from him, people feeling like he might not be the best choice of ambassador at the moment and Bostrom mostly laying low for the time being.

As for Eliezer,  he is selective about the podcasts he wishes to appear on. He might be selective about interviews as well. He also is less proactive about reaching out than he could be.

If you use ChatGPT, it can make it sound like it was written by a native speaker.

Yeah, I think most people are more grounded in the analytic tradition, so it could be useful to hear about anything we might be missing.

Oh, yes, climate, sorry. My english is like a messy guy.

Nate and Eliezer both believe that humanity should not be attempting technical alignment at its current level of cognitive ability, and should instead pursue human cognitive enhancement (e.g., via uploading), and then having smarter (trans)humans figure out alignment.


It would be great if we were able to utilise cognitive enhancement in order to solve these problems.

On the other hand, I don't think we've tried the technique of "longer-term intensive program" to develop talent. As an example, it's very common for people to spend 3 or 4 years on a bache... (read more)

So you’re saying that Metaculus incentivises betting based on the median, not the average?

I don't mean to say that, but maybe it's entailed. (I don't know.)  I'm saying even if it were the same group of forecasters who forecast all three questions, and even if every one of those forecasters is internally consistent with themselves, it will not necessarily be the case that the median for the "how many months" question will be equal to the difference between the medians of the other two questions. 

Thanks for sharing.

Maybe I should have spoken more precisely. He wasn't telling individuals to be apolitical. It's more that he didn't think it was a good idea to center the rationalist community around it as it would interfere with the rationalist project. ie. That even with our community striving to improve our rationality that it'd still be beyond us to bring in discussions of politics without corrupting our epistemology.

So when I said "we couldn’t expect our rationality skills to be as helpful in determining truth in politics", I was actually primarily talking about the process of a community attempting to converge on the truth rather than an individual.

I don’t know man, really seems to me that Eliezer was quite clear in politics are the mind-killer that we couldn’t expect our rationality skills to be as helpful in determining truth in politics.

He didn't say anything like that in Politics is the Mind-Killer, quite the contrary:

"Politics is an important domain to which we should individually apply our rationality—but it’s a terrible domain in which to learn rationality, or discuss rationality, unless all the discussants are already rational."

"I’m not saying that I think we should be apolitical"

The main point of the post was to not shove politics where it's unnecessary, because it can have all these bad effects. I expect Eliezer agrees far more with the idea that Politics is hard mode, than the ide... (read more)

Great post! The concept of meta-strategies seems very useful to add to my toolkit.

I strongly recommend the AI Safety Fundamentals Course (either technical or policy). Having a better understanding of the problem will help you contribute with whatever time or resources you choose to dedicate to the problem.

Load More