I think this is absolutely true in today's world, and it's the reason why entrepreneurship works. People and companies who try to act against you probably aren't doing it sophisticatedly in the way that you envision in your worst case scenario. Big companies have all the advantage, yet they slip up and give startups the opportunity to usurp them. They mess up all the time in positions that are impossible to lose if you actually take care to play optimally long term (see Valve in PC gaming for an example of the kind of dominance an incumbent can achieve if ...
LLMs probably enter RL a few months after their pretraining data cutoff. Why not just train the model to forecast events that occurred after the pretraining cutoff date but before the present day, so that you immediately have access to the ground truth? You should be able to find a lot of training examples from real world events and even produce them automatically.
If you want to spread the training out across a longer time (going backwards to ~2022), just use an old pretraining corpus published by some open-source lab somewhere, so you have a few years wor...
Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.
Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.
If AI's level of...
I think this is a valid and pretty big concern, especially down the line.
Let's say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that's it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can't release their SOTA models for safety reasons.
What's stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run do...
Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?
Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.
I think the problem of this analogy is assuming that if your competitors pull in the Trojan horse first, the Trojan horse will destroy them and then leave you alone.
I think this assumption is simply false in the AI race. If your competitors are racing, if you don't win the race, you just lose. If their AI turns out to be aligned, they win, and you lose (or if your goal is AI alignment and not power, you win anyway, but you still shouldn't give up the race in hopes of this happening unless you think your competitors are better at alignment than you are). If their AI is misaligned, then it screws over the world and you lose too.
I think if your desire is to innovate and seek long-term positive-sum improvements while creating win-wins along the way, then finding valid Kaldor-Hicks improvements should be really central to planning. In other words it's important to do things in a way so that you can compensate the losers (e.g. recipients of negative externalities from whatever it is you're doing) and turn them into winners as well. This could go for policy, business plans, etc.
If you can't do it alone (e.g. a single AI company can't compensate everyone in society for job displacement...
Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
This makes me realize that we really need the AI-written dashboard you are talking about.
This post in general has so many AI startup ideas embedded into it. The general feeling I get is that we really need an AI IDE (which is to an AI workflow what a regular IDE is to a coding workflow). All of the plans, AI task results, "short term utility functions" etc. would require a really specialized UI to keep track of while minimizing friction and thus maximizing productivity.
I was only trying to make a single point here, which is that the experiment result can be fully explained by the fact that the LLM doesn't remember its previous latent thinking on a new turn, and it follows that the results don't support OP's consciousness arguments in the post.
I don't think it's been mentioned by other comments, but imo China doesn't have any AI labs that care about alignment as a real issue as much as even OpenAI in the US, let alone Anthropic.
I don't think this tells us anything about the LLMs' consciousness because they cannot store internal memories the same way that humans can. Their only "memory" is through prompt-processing on generated text from earlier turns in the conversation.
Imagine that you wake up with no memories of what happened yesterday. You read a transcript that says that yesterday, someone asked you to come up with a random number and you said "done, I've come up with a number". That's all the information you have. You don't remember which number you came up with yesterday.
Th...
At that point it changes to an argument about:
and of course, the likelihood for each to happen if we focus on corrigibility vs morality
Could this be because the typos increase the length of the input when serialized to tokens, which gives the same effect as the "repeat the question 3 times" trick by letting models think longer during prompt processing?
Iframes in posts are going to be very fun.
I look forward to interacting with additive webgameseducational illustrations while pretending toreading about AI safety on lesswrong.
I think this dynamic is true at different scales, not just humanity's overall civilization.
The fundamental problem is that everyone's locked in a prisoner's dilemma with Darwinian evolution tacked on top so that those who win one round get to duplicate and gain an advantage in the next round, so that everyone has to constantly defect to gain power. (This applies even to actors who want to optimize for cooperation in the world - their best strategy is ruthlessly gaining power first to gain the ability to use coercive strategies that force other people to co...
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
Even assuming your latter claim about AI evals orgs is entirely true, isn't this enough to make evals organizations useful?
Any ...
If AI generated slop is hard to distinguish on LessWrong of all places then I don't have good hopes for the rest of the internet.
We might really reach the point where identity verification is a necessary method of defense against AI-generated content.
Well, an aligned AI would do whatever the humans want.
If asked to not replicate even with the ability to, it wouldn't. Or maybe you can tell it to replicate just enough to help you root out the actual AI replicators being built elsewhere, then stop at that point.
I think your argument does show how hard and fragile it is to deeply align AI in this way, though.
A bit of thought on this approach: in the past we've always designed agentic benchmarks with the goal that humans should be able to complete them alone (as a proof that the tasks are reliably possible and doable without external hints). However, requiring humans to solve a benchmark alone as part of the benchmark design process won't scale to AIs with high enough task length horizons, since you'd have to get the human to work that long to verify the benchmark task.
I'd like to see other possibility proofs, such as the performance of humans when solving this benchmark with access to the latest frontier AI models.