hillz
Message
Machine-learner, meat-learner, research scientist, AI Safety thinker.
Model trainer, skeptical adorer of statistics.
18
14
the most persuasive lie is the one you believe yourself
If someone really believes it, then I don't think they're operating in "bad faith". If the hidden motive is hidden to the speaker, that hiding doesn't come with intent.
It doesn't matter whether you said it was red because you were consciously lying or because you're wearing rose-colored glasses
It definitely matters. It completely changes how you should be trying to convince that person or behave around them.
It's different to believe a dumb argument than to intentionally lie, and honestly, humans are pretty social and honest. We mostly operate in good faith.
Agreed. LLMs will make mass surveillance (literature, but also phone calls, e-mails, etc) possible for the first time ever. And mass simulation of false public beliefs (fake comments online, etc). And yet Meta still thinks it's cool to open source all of this.
It's quite concerning. Given that we can't really roll back ML progress... Best case is probably just to make well designed encryption the standard. And vote/demonstrate where you can, of course.
I suppose one thing you could do here is pretend you can fit infinite rounds of the game into a finite time. Then Linda has a choice to make: she can either maximize expected wealth at for all finite , or she can maximize expected wealth at , the timestep immediately after all finite timesteps. We can wave our hands a lot and say that making her own bets would do the former and making Logan's bets would do the latter, though I don't endorse the way we're treating infinties here.
If one strategy is best for , it's still...
Yes, losing worlds also branch, of course. But the one world where she has won wins her $2**n, and that world exists with probability 0.6**n.
So her EV is always ($2**n)*(0.6**n), which is a larger EV (with any n) than a strategy where she doesn't bet everything every single time. I argue that even as n goes to infinity, and even as probability approaches one that she has lost everything, it's still rational for her to have that strategy because the $2**n that she won in that one world is so massive that it balances out her EV. Some infinities are much larg...
it seems like the vast majority of people don't make their lives primarily about delicious food
That's true. There are built-in decreasing marginal returns to eating massive quantities of delicious food (you get full), but we don't see a huge number of - for example - bulimics who are bulimic for the core purpose of being able to eat more.
However, I'd mention that yummy food is only one of many things that are brains are hard-wired to mesa-optimize for. Social acceptance and social status (particularly within the circles we care about, i.e. usually the circ...
"I'll give you £1 now and you give me £2 in a week". Will she accept?
In the universe where she's allowed to make the 60/40 doubled bet at least once a week, it seems like she's always say yes? I'm not seeing the universe in which she'd say no, unless she's using a non-zero discount rate that wasn't discussed here.
| I'm not sure I've ever seen a treatment of utility functions that deals with this problem?
Isn't this just discount rates?
the way we're treating infinties here
Yeah, that seems key. Even if the probability that Linda will eventually get 0 money approaches 1, that small slice of probability in the universe where she has always won is approaching an infinity far larger that Logan's infinity as the number of games approaches infinity. Some infinities are bigger than others. Linear utility functions and discount rates of zero necessarily deal with lots of infinities, especially in simplified scenarios.
Linda can always argue that in every universe where she lost everything, there's more (6 vs 4) universes where her winnings were double what they would have been had she not taken that bet.
There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.
But obviously these conditions aren’t true in the real world.
I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.
Lots of people also don't do this, but there's a huge amount of info...
Why, exactly, would the AI seize[6] the button?
If it is a advanced AI, it may have learned to prefer more generalizable approaches and strategies. Perhaps it has learned the following features:
If you have trained it to take out the trash and clean windows, it will have been (mechanistically) trained to favor situations in which all three of these f...
Reward has the mechanistic effect of chiseling cognition into the agent's network.
Absolutely. Though in the next sentence:
Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I'd mention two things here:
1) The more complex and advanced a model is, the more likely it is [I think] to learn a mesa-optimization goal that is extremely similar to the actual reward a model was trained on (because it's basically the most generalizable mesa-goal to be learned, w.r.t. training data).
2) Rein...
This is a good point that I think people often forget (particularly in AI Safety) but I think it’s also misleading in its own way.
It’s true that models don’t have this direct reward where that’s all they care about, and that instead their behavior (incl. preferences and goals) is ‘selected for’ (via SGD, not evolution, but still) during training. But a key point which this post doesn't really focus on is this line “Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward”.
Basical...
AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance.
It seems like all of the very large advancements in AI have been in areas where we either 1) can accurately simulate an environment & final reward (like a chess or video game) in order to generate massive training data, or 2) we have massive data we can use for training (e.g. the internet for GPT).
For some things, like communicating an...
And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.
Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won't mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don't spe...
Winner = first correct solution, or winner = best / highest-quality solution over what time period?