All of amaury lorin's Comments + Replies

I'm surprised to hear they're posting updates about CoEm.

At a conference held by Connor Leahy, I said that I thought it was very unlikely to work, and asked why they were interested in this research area, and he answered that they were not seriously invested in it.

We didn't develop the topic and it was several months ago, so it's possible that 1- I misremember or 2- they changed their minds 3- I appeared adversarial and he didn't feel like debating CoEm. (For example, maybe he actually said that CoEm didn't look promising and this changed recently?)
Still, anecdotal evidence is better than nothing, and I look forward to seeing OliviaJ compile a document to shed some light on it.

I invite you. You can send me this summary in private to avoid downvotes.

There's a whole part of the argument which is missing which is the framing of this as being about AI risk.
I've seen various propositions for why this happened, and the board being worried about AI risk is one of them but not the most plausible afaict.

In addition this is phrased similarly to technical problems like the corrigibility, which it is very much not about.
People who say "why can't you just turn it off" typically refer to literally turning off the AI if it appears to be dangerous, which this is not about. This is about turning off the AI company, not the AI.

1- I didn't know Executive Order could be repealed easily. Could you please develop?
2- Why is it good news? To me, this looks like a clear improvement on the previous status of regulations.

9Colin McGlynn1mo
Executive Orders aren't legislation.  They are instructions that the white house makes to executive branch agencies.  So the president can issue new executive orders that change or reverse older executive orders made by themselves or past presidents.

AlexNet dates back to 2012, I don't think previous work on AI can be compared to modern statistical AI.
Paul Christiano's foundational paper on RLHF dates back to 2017.
Arguably, all of agent foundations work turned out to be useless so far, so prosaic alignment work may be what Roko is taking as the beginning of AIS as a field.

When were convnets invented, again? How about backpropagation?

The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.

I assume you do not have a mathematical proof of that, or you'd have mentioned it. What makes you think it is mathematically provable?
I would be ve... (read more)

Yes, I thought for years that the research should be private but as it turns out, most people in policy are pretty robustly not-interested in anything that sounds like "mind control" and the math is hard to explain, so if this stuff ends up causing a public scandal that damages the US's position in international affairs then it probably won't originate from here (e.g. it would get popular elsewhere like the AI surveillance pipeline) so AI safety might as well be the people that profit off of it by open-sourcing it early. It's actually a statistical induction. When you have enough human behavioral data in one place, you can use gradient descent to steer people in measurable directions if the people remain in the controlled interactive environment that the data came from (and social media news feeds are surprisingly optimized to be that perfect controlled environment). More psychologists mean better quality data-labeling, which means people can be steered more precisely.

I don't understand how the parts fit together. For example, what's the point of presenting the (t-,n)-AGI framework or the Four Background Claims?

Newcomers to the AI Safety arguments might be under the impression that there will be discrete cutoffs, i.e. either we have HLAI or we dont. The point of (t,n) AGI is to give a picture of what a continuous increase in capabilities looks like. It is also slightly more formal than the simple "words based" definitions of AGI. If you know of a more precise mathematical formulation of the notion of general and super intelligences, I would love if you could point me towards it so that I can include that in the post. As for Four Background Claims, the reason for inclusion is to provide an intuition behind why general intelligence is important. And that even though future systems might be intelligent it is not the default case that they will either care about our goals, or even follow our goals in the way as intended by the designers.

I assume it's incomplete. It doesn't present the other 3 anchors mentioned, nor forecasting studies.

To avoid being negatively influenced by perverse incentives to make societally risky plays, couldn't TurnTrout just leave the handling of his finances to someone else and be unaware of whether or not he has Google stock?

Doesn't matter if he does, as long as he doesn't think he does; and if he's uncertain about it, I think psychologically it'll already greatly reduce caring about Google stock.

Everyone at Google gets unvested Google stock that can't be sold. It's going to be very hard to start believing that he doesn't own any Google stock.

Not before reading the link, but Elizabeth did state that they expected the pro-meat section to be terrible without reading it, presumably because of the first part.

Since the article is low-quality in the part they read and expected low-quality in the part they didn't, they shouldn't take it as evidence of anything at all; that is why I think it's probably confirmation bias to take it as evidence against excess meat being related to health issues.

Reason for retraction: In hindsight, I think my tone was unjustifiably harsh and incendiary. Also the karma tells that whatever I wrote probably wasn't that interesting.

You think that Elizabeth should have expected that taking an EA forum post with current score 87, written by "a vegan author and data scientist at a plant-based meat company", and taking "what looked like his strongest source", would yield a low-quality pro-vegan article?  I mean, maybe that's true, but if so, that seems like a harsher condemnation of vegan advocacy than anything Elizabeth has written.

A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).

In that case, there is no strategic deception (the designers are not induced in error by the AI).

I think we consider this ... (read more)

At a glance, I couldn't find any significant capability externality, but I think that all interpretability work should, as a standard, have a paragraph explaining why the authors won't think their work will be used to improve AI systems in an unsafe manner.

2Neel Nanda2mo
Whisper seems sufficiently far from the systems pushing the capability frontier (GPT-4 and co) that I really don't feel concerned about that here

Seeing as the above response wasn't very upvoted, I'll try to explain in simpler terms.
If 2+2 comes out 5 the one-thrillionth-and-first time we compute it, then our calculation does not match numbers.
... which we can tell because?
...and writing this now I realize why the answer was more upvoted, because this is circular reasoning. ':-s
Sorry, I have no clue.

Sounds like those people are victim of a salt-in-pasta-water fallacy.

It's also very old-fashioned. Can't say I've ever heard anyone below 60 say "pétard" unironically.

You might also assign different values to red-choosers and blue-choosers (one commenter I saw said they wouldn't want to live in a world populated only by people who picked red) but I'm going to ignore that complication for now.

Roko has also mentioned they think people choose blue for being bozos and I think it's fair to assume from their comments that they care less about bozos than smart people.

I'm very interested in seeing the calculations where you assign different utilities to people depending on their choice (and possibly, also depending on yours, like if you only value people who choose like you).

I'm not sure how I would work it out. The problem is that presumably you don't value one group more because they chose blue (it's because they're more altruistic in general) or because they chose red (it's because they're better at game theory or something). The choice is just an indicator of how much value you would put on them if you knew more about them. Since you already know a lot about the distribution of types of people in the world and how much you like them, the Bayesian update doesn't really apply in the same way. It only works on what pill they'll take because everyone is deciding with no knowledge of what the others will decide. In the specific case where you don't feel altruistic towards people who chose blue specifically because of a personal responsibility argument ("that's their own fault"), then trivially you should choose red. Otherwise, I'm pretty confused about how to handle it. I think maybe only your level of altruism towards the blue choosers matters.

I mean, as an author you can hack through them like butter; it is highly unlikely that out of all the characters you can write, the only ones that are interesting will all generate interesting content iff (they predict) you'll give them value (and this prediction is accurate).

I strongly suspect the actual reason you'll spend half of your post's value on buying ads for Olivia (if in fact you do that, which is doubtful as well) is not that (begin proposition) she would only accept this trade if you did that because
- she can predict your actions (as in, you w... (read more)

0Christopher King4mo
Yeah, I think it's mostly of educational value. At the top of the post: "It might be interesting to try them out for practice/research purposes, even if there is not much to gain directly from aliens.".
0Christopher King4mo
In principle "staying true to your promise" is the enforcement mechanism. Or rather, the ability for agents to predict each other's honesty. This is how the financial system IRL is able to retrofund businesses. But in this case I made the transaction mostly because it was funny. I mean, I kind of have to now right XD. Even if Olivia isn't actually agent, I basically declared a promise to do so! I doubt I'll receive any retrofunding anyways, but that would just be lame if I did receive that and then immediately undermined the point of the post being retrofunded. And yes, I prefer to keep my promises even with no counterparty. But if you'd like to test it I can give you a PayPal address XD. Note that this is still very tricky, the mechanisms in this post probably won't suffice. Acausal Now II will have other mechanisms that cover this case (although the S.E.C. still reduces their potential efficiency quite a bit). (Also, do you have a specific trade in mind? It would make a great example for the post!)

This is mostly wishful thinking.
You're throwing away your advantages as an author to bargain with fictionally smart entities. You can totally void the deal with Olivia and she can do nothing about it because she's as dumb as you write her to be.
Likewise, the author writing about space warring aliens writing about giant cube-having humans could just consider the aliens that have space wars without consideration for humans at all; you haven't given enough detail for the aliens' modelization of the humans be precise enough that their behavior must depend on i... (read more)

1Christopher King4mo
This doesn't seem any different than acausal trade in general. I can simply "predict" that the other party will do awesome things with no character motivation. If that's good enough for you, than you do not need to acausally trade to begin with. I plan on having a less contrived example in Acausal Now II: beings in our universe but past the cosmological horizon. This should make it clear that the technique generalizes past fiction and is what is typically thought of as acausal trade.

For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:

  • Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not s
... (read more)

A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

That sounds nice but is it true? Like, that's not an argument, and it's not obvious! I'm flabbergasted it received so many upvotes.
Can someone please explain?

It's somewhat an applause light being a paraphrase and extension of a Nick Bostrom quote.

Well, I wasn't interested because AIs were better than humans at go, I was interested because it was evidence of a trend of AIs being better at humans at some tasks, for its future implications on AI capabilities.
So from this perspective, I guess this article would be a reminder that adversarial training is an unsolved problem for safety, as Gwern said above. Still doesn't feel like all there is to it though.

4Radford Neal4mo
I think it may not be correct to shuffle this off into a box labelled "adversarial example" as if it doesn't say anything central about the nature of current go AIs. Go involves intuitive aspects (what moves "look right"), and tree search, and also something that might be seen as "theorem proving".  An example theorem is "a group with two eyes is alive".  Another is "a capture race between two groups, one with 23 liberties, the other with 22 liberties, will be won by the group with more liberties".  Human players don't search the tree down to a depth of 23 to determine this - they apply the theorem.  One might have thought that strong go AIs "know" these theorems, but it seems that they may not - they may just be good at faking it, most of the time.

To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I'm not getting from this paper.
Or maybe it's not a missing insight, but I don't understand why this kind of work is interesting/important?

3Radford Neal4mo
Did you think it was interesting when AIs became better than all humans at go?  If so, shouldn't you be interested to learn that this is no longer true?

I'm confused. Does this show anything besides adversarial attacks working against AlphaZero-like AIs?
Is it a surprising result? Is that kind of work important for reproducibility purposes regardless of surprisingness?

5Richard Horvath4mo
I find it exciting for the following: These AIs are (were) thought to be in the Universe of Go what AGI is expected to be in the actual world we live in: an overwhelming power no human being can prevail against, especially not an amateur just by following some weird tactic most humans could defeat. It seemed they had a superior understanding of the game's universe, but as per the article this is still not the same kind of understanding we see in humans. We may have overestimated our own understanding of how these systems work, this is an unexpected (confusing) outcome. Especially that it is implied that this is not just a bug in a specific system but possibly other systems using similar architecture will have the same defects. I think this is a data point towards A.I. winter to be expected in the coming years rather than FOOM.
3amaury lorin4mo
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I'm not getting from this paper. Or maybe it's not a missing insight, but I don't understand why this kind of work is interesting/important?

You're making many unwarranted assumptions about an AI's specific mind, along with a lot of confusion about semantics which seems to indicate you should just read the Sequences. It'll be very hard to point out where you are going wrong because there's just too much confusion.

As example, here's a detailed analysis of the first few paragraphs:

Intelligence will always seek more data in order to better model the future and make better decisions.

Unclear if you mean intelligence in general, and if so, what you mean by the word. Since the post is about AI, let's ... (read more)

We cannot select all companies currently looking to hire AI researchers. There's just too many of them, and most will just want to integrate ChatGPT into their software or something.
We're interested in companies making the kind of capabilities research that might lead to AI that poses an existential risk.

Do you suggest that we should consider all companies that employ a certain number of AI experts?

The suggestion was to consider all companies currently looking to hire -- because that function entails advertising and advertising is by its nature is difficult to hide from people like you trying to learn more about the company. (More precisely, the function (hiring) does not strictly entail advertising, but refraining from advertising will greatly slow down the execution of the function.) Because AI companies are in an adversarial relationship with us (the people who understand that AI research is very dangerous), we should expect them to gather information about us and to try to prevent us from gathering accurate information about them. It is possible that there currently exists a significantly-sized AI company that you and I know nothing about and cannot hope to learn anything about except by laborious efforts such as having face-to-face conversations with hundreds of AI experts located in diverse locations around the world (if they chose not to advertise to speed up hiring).
1Super AGI5mo
No, of course not.

Thanks! I wish the math hadn't broken down, it makes the post harder to read...

List of known discrepancies:

  • Deepmind categorizes Cohen’s scenario as specification gaming (instead of crystallized proxies).
  • They consider Carlsmith to be about outer misalignment?

Value lock-in, persuasive AI and Clippy are on my TODO list to be added shortly. Please do tell if you have something else in mind you'd like to see in my cheat sheet!

I'm not sure why this was downvoted into oblivion, so I figured I'd give my own opinion at least:

I assume the author is an amateur writer, and wrote this for fun without much consideration for the audience of the actual subject. It's the kind of things I could have done when I entered the community.

About the content, the story is awful:
- The characters aren't credible. The AI does not match any sensible scenario, and especially not the kind of AI typically imagined for a boxing experiment. The arguments of the AI are unconvincing, as are its abilities and ... (read more)

1Super AGI5mo
Thank you for taking the time to provide such a comprehensive response.    > "It's the kind of things I could have done when I entered the community." This is interesting. Have you written any AI-themed fiction or any piece that explores similar themes? I checked your postings here on LW but didn't come across any such examples.   > "The characters aren't credible. The AI does not match any sensible scenario, and especially not the kind of AI typically imagined for a boxing experiment.  What type of AI would you consider typically imagined for a boxing experiment?   > "The protagonist is weak; he doesn't feel much except emotions ex machina that the author inserts for the purpose of the plot. He's also extremely dumb, which breaks suspension of disbelief in this kind of community." In response to your critique about the characters, it was a conscious decision to focus more on the concept than complex character development. I wanted to create a narrative that was easy to follow, thus allowing readers to contemplate the implications of AI alignment rather than the nuances of character behavior. The "dumb" protagonist represents an average person, somewhat uninformed about AI, emphasizing that such interactions would more likely happen with an unsuspecting individual.   > "The progression of the story is decent, if extremely stereotypical. However, there is absolutely no foreshadowing so every plot twist appears out of the blue." Regarding the seemingly abrupt plot points and lack of foreshadowing, I chose this approach to mirror real-life experiences. In reality, foreshadowing and picking up on subtle clues are often a luxury afforded only to those who are highly familiar with the circumstances they find themselves in or who are experts in their fields. This story centers around an ordinary individual in an extraordinary situation, and thus, the absence of foreshadowing is an attempt to reflect this realism.   > "The worst part is that all the arguments somew

Sounds like you simply assumed that saying you could disgust the gatekeeper would make them believe they would be disgusted.
But the kind of reaction to disgust that could make a gatekeeper let the AI out needs to be instantiated to have impact.

Most people won't get sad just imagining that something sad could happen. (Also, duh, calling out the bluff.)

In practice, if you had spent the time to find disgusting content and share, it would have been somewhat equivalent to torturing the gatekeeper, which in the extreme case might work on a significant fraction of the population, but it's also kind of obvious that we could prevent that.

Heh, it's better to have this than nothing at all! I'll keep hoping for it. ^-^

It sounds like an excellent foundation. 

Ideas for improvement:

  • If you have multiple explanations for a concept, then on a fraction epsilon of users, randomize the order in which they're displayed to collect statistical data about which explanation they first found enlightening, then for most users, display the explanations in their order of expected chance of being grokked.
  • Don't restrict yourself to explanations. Propose different teaching methods.
  • Include expected reading time and score for each explanation (like LW).


  • How do you distinguish feeling of epiphany and grokking?
These ideas seem promising! Good point, I haven't really done that here. We could differentiate by e.g. having practice-problems, and people can login to track their progress. Similar to the multi-explanations/teaching-methods setup, there could be a broad variety of example problems --> less likely someone gets lots of them right without actually understanding the concept.

Mostly confused about how the chronophone works. However I try to imagine strict rules, the thought experiment is not that interesting.

They fired quanta of energy.

I also would be interested in learning more, even just a link would be nice.

I felt like I had a pretty good grasp on what was happening, but in the end I'm just as confused as at the beginning... '^-^

it is maximally difficult for your to untangle rules

-> it is maximally difficult for you to untangle rules

Fixed. Thanks.

Luna's mother would never sell her daughter in exchange information she could deduce for herself.

-> Luna's mother would never sell her daughter in exchange for information she could deduce for herself.

Fixed. Thanks.

One us represents all. 

-> One of us represents all.

Fixed. Thanks.
1Henry Prowbell6mo
Honestly, no plans at the moment. Writing these was a covid lockdown hobby. It's vaguely possible I'll finish it one day but I wouldn't hold your breath. Sorry.

No, Gray_Area's point (that I can see) was that you would only approximate the result, using cognitive heuristics, for example thinking about how an author would tell the story that starts the way your reality does.
There are other, valid ways to do that. But the best known to me is simply Bayesian inference, and keeping track of probability distributions instead of sampling randomly is not that hard, since it saves you the otherwise expensive work of adjusting for biases using ad hoc methods.

Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.

Potential uses of the post:

This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:

  1. It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
  2. For aspiring RLHF understanders, it can provide minimum information to quickly priori
... (read more)
No the joke is that he's a transhumanist and wants to live forever. If he lives forever he has no "mid-life".

I am surprised the advisors don't propose the king to follow the weighted average of decisions rather than thinking about predictions and picking the associated decision.

This is intuitively the formal model underlying the obvious strategy of preparing for either outcomes.

That is probably close to what they would suggest if this weren't mainly just a metaphor for the weird ways that I've seen people thinking about AI timelines. It might be a bit more complex than a simple weighted average because of discounting, but that would be the basic shape of the proper hedge.

Why would it harm humans?
Do you think that the expected value of thinking about it is negative because of how it might lead us to overlook some forms of alignment?

Load More