LESSWRONG
LW

2787
Jay Bailey
82651660
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
53Reflections on my first year of AI safety research
2y
3
31Features and Adversaries in MemoryDT
2y
6
13Spreadsheet for 200 Concrete Problems In Interpretability
3y
0
83Reflections on my 5-month alignment upskilling grant
3y
4
58Deep Q-Networks Explained
3y
8
2Jay Bailey's Shortform
3y
21
2Jay Bailey's Shortform
3y
21
Heroic Responsibility
Jay Bailey3d62

I am reminded of the Message for Garcia essay as a pretty striking example of this.

Reply
Jay Bailey's Shortform
Jay Bailey7d20

One would hope indeed. But even so, we do now know that this is likely to be the kind of action that could be detected and opposed. And since I didn't predict in advance that this would happen, especially at this capability level, the update for me is that it's going to be significantly harder than I would hope.

Reply
Jay Bailey's Shortform
Jay Bailey7d304

I recently read Anthropic's new paper on introspection in LLM's. (https://www.anthropic.com/research/introspection)

In short, they were able to:

  • Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps.
  • Inject that vector into the activations of an unrelated prompt.
  • Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like "loud" or "shouting" which is quite similar to the all caps they were going for.

They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.

To give some fairly obvious thoughts on what this means:

  • If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now.

  • If you have an alignment plan that involves altering the model's thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly.

Finally a more personal thought:

The idea of "Well, if you try to adjust a superintelligence's thoughts, it will know you did that and try to subvert it / route around the imposing thoughts" would have sounded weird to me, if I'd tried saying it out loud. We're talking about literally altering the model's mind to think what we want it to think, and you still think that's not enough for alignment? I note that I did think that, but I was thinking in terms of "The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts" and not "The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that". And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn't care about what sounds intuitively plausible.

I wonder if this is an example that would allow people to feel, viscerally, the difficulty of aligning an actually smart entity. Here we have a capability that sounds scary and superhuman and only arose with smart models. The idea of a model being aware of when you're altering its thoughts is terrifying. The lesson we would like to give to decision-makers is something like:

"AI's will soon know when you try to adjust its thoughts. That's how smart these things are. We can employ a primitive type of mind control on them and that still isn't enough. We are really not ready for even smarter systems than this. Any alignment plan that might work has to be stronger than mind control. Anything below the level of literal mind control isn't even the ante to sit down at the table any more."

I can't literally reduce this down to five words, but a succinct message might be something like "Mind control won't work. We need a plan that's better than that."

Reply
Widening AI Safety's talent pipeline by meeting people where they are
Jay Bailey1mo10

Makes sense. Would be curious about estimated costs if this wasn't the case!

Reply
Widening AI Safety's talent pipeline by meeting people where they are
Jay Bailey1mo10

Great stuff - super impressed on how much you achieved on this budget! Those numbers seem incredibly impressive, and definitely seems to indicate a lot of value can be had by this type of program. Running it in Sydney/Melbourne also means you have less diminishing returns - normally I'd expect this to be the best program you could run since you're picking from the most pent-up demand for this course, but I can't imagine people likely to come from abroad for a part-time 14-week course.

Reply
Contra Shrimp Welfare.
Jay Bailey2mo*21

I can see why you might find that frustrating. I think a lot of us, myself included, do think that the science is the most important part of the argument - but we don't understand the science well enough to distinguish true arguments from false ones, in this domain.

I don't have the proper expertise to evaluate your claims about pain receptors properly, but I do have the proper expertise to conclude that SWP calling their opponents "irrational, evil, or both" is bad, and that this correlates with shoddy reasoning everywhere else, including the parts I don't understand. Thus, there's a limit to how much I can update even from an incredibly strong neurological takedown, if that takedown requires knowledge of neurology that I don't have in order to fully appreciate its correctness.

So, in terms of what we care about, think of it less as "How many bits of information should this point be worth" and more "How many bits of information can your audience actually draw from this piece of information?"

Reply1
Contra Shrimp Welfare.
Jay Bailey2mo63

If the above is true, I think this is really good information that would have been very nice to have cited within the article. That would make me a lot more skeptical of SWP and of their conclusions, and it'd be great to see links for these examples if you could provide them.

Especially this paragraph:

"It just intuitively seems like they are." This is proposed as a rebuttal for critiques of the shrimp welfare project, not very convincing to me, yet they claim that those who don't support them are "irrational, evil or both". I find that making that claim with sparse, scattered and unclear evidence is not great, and paints anyone who opposes their views as as flawed person.

I agree with the value claims in this paragraph completely, so if you have sources for those quotes I think that would be very persuasive to a lot of us here on this site, and it might even be worth a labelled edit to the main post.

Reply
Contra Shrimp Welfare.
Jay Bailey2mo5928

I think you make some interesting points here, but there are two points I would disagree with:

First is "The Shrimp Welfare Project wants shrimp to suffer so they can have a new problem to solve." This claim is made with no supporting evidence whatsoever. You don't even argue for why it might be the case, and show no curiosity about other explanations for this. They claim to disagree with you, so clearly they have ulterior, malicious motives. (I would say knowingly creating a charity that doesn't solve a real problem, just to be able to say you're solving a new problem, is quite unethical!) Why is it so hard to believe the people who founded SWP did so with the intent of reducing as much suffering as possible, and just happened to be incorrect? What makes you completely dismiss this hypothesis so much that it isn't even worth mentioning the alternative in your article?

Second is "At best, a shrimp sentium would encode only the most surface-level sensory experience: raw sensation without context, meaning, or emotional depth. Think of the mildest irritation you can imagine, like the persistent squeak of a shopping cart wheel at Walmart."

I don't see how the second sentence follows from the first. When I imagine a migraine, the worst pain I personally have ever experienced (being a rather fortunate individual) it doesn't seem to me like the reason I am suffering is because of the context, meaning, or emotional depth of my pain. I'm suffering because it hurts. A lot. It doesn't seem that complicated. It seems like it would be much more principled, using your analysis, to treat 860,000 shrimps freezing to death as suffering equivalent to one human experiencing the sensation of freezing to death, not experiencing mild irritation. I say "experiencing the sensation of" because things like being aware of one's own mortality does seem like a thing that's out of reach of a shrimp. So it's not equivalent to a human actually dying, in my view, but freezing to death is likely still quite unpleasant, and not something I'd do for fun, and I'd much rather experience a squeaky wheel at Walmart even if I was fine as soon as I lost consciousness and had no chance of mental trauma from the incident, which I think still matches the shrimp equivalence.

What I still think makes this article interesting is that 550 humans experiencing the sensation of freezing to death across twenty minutes is bad, but not as bad as even one human death, which could be prevented by orders of magnitude less cost than a shrimp stunner. So even despite this article's flaws I still think it's a good article on net and worth engaging with for a proponent of shrimp welfare.

While a reply isn't required, if you are going to engage with only one of these points, I would prefer it be the first one even though I wrote a lot less about it. The second point doesn't actually change the overall conclusion very much imo, but the first point is generally quite confusing to me, and makes me less confident about rest of the article given the quality of reasoning in that claim.

Reply3
My talk on AI risks at the National Conservatism conference last week
Jay Bailey2mo21

This is, as I understand it, the correct pronunciation.

Reply
life lessons from poker
Jay Bailey4mo54

I do remember when I was learning poker strategy, and I learned about the idea of wanting to get your entire stack in the middle preflop if you can get paid off with AA - was a very fundamental lesson to young me! That said, there's a key insight that goes along with the "pocket aces principle" that is missing here, and that's bankroll management.

In poker, there is standard advice for how much money to have in your bankroll before you sit down at a table at all. E.g, for cash games, it's at least 2000 big blinds (20x the largest stack you can buy). This is what allows you to bet all-in on pocket aces - if your entire bankroll is on the table, you should bet more conservatively. The point of bankroll management is to allow you to make the +EV play of putting your entire stack into the middle without caring about the variance when you lose 20% of the time.

To apply this metaphor to real life, you might say something like "Consider how much you're willing to lose in the event things turn out badly (e.g, a year or two on a startup, six months on a relationship) and then, within that amount, bet the house."

Reply
Load More