There's a whole part of the argument which is missing which is the framing of this as being about AI risk.
I've seen various propositions for why this happened, and the board being worried about AI risk is one of them but not the most plausible afaict.
In addition this is phrased similarly to technical problems like the corrigibility, which it is very much not about.
People who say "why can't you just turn it off" typically refer to literally turning off the AI if it appears to be dangerous, which this is not about. This is about turning off the AI company, not the AI.
1- I didn't know Executive Order could be repealed easily. Could you please develop?
2- Why is it good news? To me, this looks like a clear improvement on the previous status of regulations.
AlexNet dates back to 2012, I don't think previous work on AI can be compared to modern statistical AI.
Paul Christiano's foundational paper on RLHF dates back to 2017.
Arguably, all of agent foundations work turned out to be useless so far, so prosaic alignment work may be what Roko is taking as the beginning of AIS as a field.
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
I assume you do not have a mathematical proof of that, or you'd have mentioned it. What makes you think it is mathematically provable?
I would be ve...
I don't understand how the parts fit together. For example, what's the point of presenting the (t-,n)-AGI framework or the Four Background Claims?
I assume it's incomplete. It doesn't present the other 3 anchors mentioned, nor forecasting studies.
To avoid being negatively influenced by perverse incentives to make societally risky plays, couldn't TurnTrout just leave the handling of his finances to someone else and be unaware of whether or not he has Google stock?
Doesn't matter if he does, as long as he doesn't think he does; and if he's uncertain about it, I think psychologically it'll already greatly reduce caring about Google stock.
Not before reading the link, but Elizabeth did state that they expected the pro-meat section to be terrible without reading it, presumably because of the first part.
Since the article is low-quality in the part they read and expected low-quality in the part they didn't, they shouldn't take it as evidence of anything at all; that is why I think it's probably confirmation bias to take it as evidence against excess meat being related to health issues.
Reason for retraction: In hindsight, I think my tone was unjustifiably harsh and incendiary. Also the karma tells that whatever I wrote probably wasn't that interesting.
A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).
In that case, there is no strategic deception (the designers are not induced in error by the AI).
I think we consider this ...
At a glance, I couldn't find any significant capability externality, but I think that all interpretability work should, as a standard, have a paragraph explaining why the authors won't think their work will be used to improve AI systems in an unsafe manner.
Seeing as the above response wasn't very upvoted, I'll try to explain in simpler terms.
If 2+2 comes out 5 the one-thrillionth-and-first time we compute it, then our calculation does not match numbers.
... which we can tell because?
...and writing this now I realize why the answer was more upvoted, because this is circular reasoning. ':-s
Sorry, I have no clue.
It's also very old-fashioned. Can't say I've ever heard anyone below 60 say "pétard" unironically.
You might also assign different values to red-choosers and blue-choosers (one commenter I saw said they wouldn't want to live in a world populated only by people who picked red) but I'm going to ignore that complication for now.
Roko has also mentioned they think people choose blue for being bozos and I think it's fair to assume from their comments that they care less about bozos than smart people.
I'm very interested in seeing the calculations where you assign different utilities to people depending on their choice (and possibly, also depending on yours, like if you only value people who choose like you).
I mean, as an author you can hack through them like butter; it is highly unlikely that out of all the characters you can write, the only ones that are interesting will all generate interesting content iff (they predict) you'll give them value (and this prediction is accurate).
I strongly suspect the actual reason you'll spend half of your post's value on buying ads for Olivia (if in fact you do that, which is doubtful as well) is not that (begin proposition) she would only accept this trade if you did that because
- she can predict your actions (as in, you w...
This is mostly wishful thinking.
You're throwing away your advantages as an author to bargain with fictionally smart entities. You can totally void the deal with Olivia and she can do nothing about it because she's as dumb as you write her to be.
Likewise, the author writing about space warring aliens writing about giant cube-having humans could just consider the aliens that have space wars without consideration for humans at all; you haven't given enough detail for the aliens' modelization of the humans be precise enough that their behavior must depend on i...
For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:
A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
That sounds nice but is it true? Like, that's not an argument, and it's not obvious! I'm flabbergasted it received so many upvotes.
Can someone please explain?
Well, I wasn't interested because AIs were better than humans at go, I was interested because it was evidence of a trend of AIs being better at humans at some tasks, for its future implications on AI capabilities.
So from this perspective, I guess this article would be a reminder that adversarial training is an unsolved problem for safety, as Gwern said above. Still doesn't feel like all there is to it though.
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I'm not getting from this paper.
Or maybe it's not a missing insight, but I don't understand why this kind of work is interesting/important?
I'm confused. Does this show anything besides adversarial attacks working against AlphaZero-like AIs?
Is it a surprising result? Is that kind of work important for reproducibility purposes regardless of surprisingness?
You're making many unwarranted assumptions about an AI's specific mind, along with a lot of confusion about semantics which seems to indicate you should just read the Sequences. It'll be very hard to point out where you are going wrong because there's just too much confusion.
As example, here's a detailed analysis of the first few paragraphs:
Intelligence will always seek more data in order to better model the future and make better decisions.
Unclear if you mean intelligence in general, and if so, what you mean by the word. Since the post is about AI, let's ...
We cannot select all companies currently looking to hire AI researchers. There's just too many of them, and most will just want to integrate ChatGPT into their software or something.
We're interested in companies making the kind of capabilities research that might lead to AI that poses an existential risk.
Do you suggest that we should consider all companies that employ a certain number of AI experts?
List of known discrepancies:
Value lock-in, persuasive AI and Clippy are on my TODO list to be added shortly. Please do tell if you have something else in mind you'd like to see in my cheat sheet!
I'm not sure why this was downvoted into oblivion, so I figured I'd give my own opinion at least:
I assume the author is an amateur writer, and wrote this for fun without much consideration for the audience of the actual subject. It's the kind of things I could have done when I entered the community.
About the content, the story is awful:
- The characters aren't credible. The AI does not match any sensible scenario, and especially not the kind of AI typically imagined for a boxing experiment. The arguments of the AI are unconvincing, as are its abilities and ...
Sounds like you simply assumed that saying you could disgust the gatekeeper would make them believe they would be disgusted.
But the kind of reaction to disgust that could make a gatekeeper let the AI out needs to be instantiated to have impact.
Most people won't get sad just imagining that something sad could happen. (Also, duh, calling out the bluff.)
In practice, if you had spent the time to find disgusting content and share, it would have been somewhat equivalent to torturing the gatekeeper, which in the extreme case might work on a significant fraction of the population, but it's also kind of obvious that we could prevent that.
It sounds like an excellent foundation.
Ideas for improvement:
Criticism:
Mostly confused about how the chronophone works. However I try to imagine strict rules, the thought experiment is not that interesting.
I felt like I had a pretty good grasp on what was happening, but in the end I'm just as confused as at the beginning... '^-^
it is maximally difficult for your to untangle rules
-> it is maximally difficult for you to untangle rules
Luna's mother would never sell her daughter in exchange information she could deduce for herself.
-> Luna's mother would never sell her daughter in exchange for information she could deduce for herself.
No, Gray_Area's point (that I can see) was that you would only approximate the result, using cognitive heuristics, for example thinking about how an author would tell the story that starts the way your reality does.
There are other, valid ways to do that. But the best known to me is simply Bayesian inference, and keeping track of probability distributions instead of sampling randomly is not that hard, since it saves you the otherwise expensive work of adjusting for biases using ad hoc methods.
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
I am surprised the advisors don't propose the king to follow the weighted average of decisions rather than thinking about predictions and picking the associated decision.
This is intuitively the formal model underlying the obvious strategy of preparing for either outcomes.
Why would it harm humans?
Do you think that the expected value of thinking about it is negative because of how it might lead us to overlook some forms of alignment?
I'm surprised to hear they're posting updates about CoEm.
At a conference held by Connor Leahy, I said that I thought it was very unlikely to work, and asked why they were interested in this research area, and he answered that they were not seriously invested in it.
We didn't develop the topic and it was several months ago, so it's possible that 1- I misremember or 2- they changed their minds 3- I appeared adversarial and he didn't feel like debating CoEm. (For example, maybe he actually said that CoEm didn't look promising and this changed recently?)
Still, anecdotal evidence is better than nothing, and I look forward to seeing OliviaJ compile a document to shed some light on it.