In problems like Parfit's hitchhiker, I'd like to be the kind of agent who pays the driver. But only if the driver asks for a reasonable sum of money. Doing otherwise would create a strong adversarial pressure to ask me for everything I have.
In general, I'd like to be the kind of person who keeps promises they make. But if you make me swear, at gunpoint, that I'll murder some innocent people later, I'll say whatever gets me out of the situation alive, and then break the promise.
And don't get me started on tens of pages of terms and conditions for online services. I just click "agree" and do not care a bit about what those documents say. I'm just going to do the reasonable thing and if that's not good enough, too bad. While one could say that I have made an oath to follow them, I simply don't think that's appropriate for routine activities.
This gets more complicated with, say, NDAs. I generally try to follow both the letter and spirit of such contracts. But sometimes they're just written in an unreasonable way and there's too much money on the table to ignore it. In those cases, I work in an adversarial mode where I follow the letter and just the letter, as a far as it can be court-enforced, and not much more. This rarely occurs outside cases where there's a huge power differential anyway. If I'm given the opportunity to actually negotiate the contents, it's pretty likely that there's not much need for the adversarial mode.
I'm trying to be rather meta-honest about this. With legible amounts of illegibility. I'd like to formalize this better.
If you never miss a commuter train, you're always on the station too early. If you never miss a holiday flight, that's fine.
If you've never failed a job interview, you could get paid much more. If you never get fired, you might be leaving something on the table, but I wouldn't complain.
If your jokes never offend anyone, you're not going to be a standup comedian. If your jokes always offend someone, consider that you might not that funny after all.
A pessimist won't be disappointed. But an optimist might be happier. Pessimist will be right a lot more, though.
If your business never encounters fraud, you could be saving money on security measures. If everyone knew exactly how likely it's to get caught, you'd have to spend a lot more. Or perhaps a lot less. Maybe there's some cheap signaling you could do?
If you have a low risk tolerance, you're leaving a lot of value on the table. If you're insensitive or oblivious to the downsides, you'll lose a lot more.
I think about this in my head as "in practice you converge faster to the optima if you overshoot sometimes so do that when overshooting is affordable" and have the counterexample that learning to drive shouldn't involve accidentally killing a couple people.
1: I see the main point of OP as variance-expectation trade off, where variance is bad when risk averse e.g. whet bad outcomes are much more bad than good outcomes are good. Perhaps you meant this - what you said reads like you may have meant that the process of overshooting teaches you new stuff.
2: When learning to park in an empty parking lot I realized I was consistently turning too early and so decided to go for enough that I'd expect to overshoot just as often/by as much; this suddenly made me much better and got me to learn the correct turning time faster. Notably, there was no risk of hitting someone if I overshot to the right instead of to the left.
I haven't flushed out my idea clearly. I'm saying something like "In asymmetric scenarios, the more costly failures are, the harder it is to reach the optima (for a given level of risk-averseness)" + "In hindsight, most people will think they are too risk-averse for most things". It isn't centrally relevant to what OP is saying upon reflection.
While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
- the user might be misaligned (the user asks for a harmful task),
- the model might be misaligned (the model makes a harmful mistake), or
- the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.
Later, the same phenomenon is described like this:
The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.
Is this yet another attempt to erode the meaning of "alignment"?
My bet is that this isn't an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn't distinguish between these due to a communication failure. It could also just be a more generic communication failure.
(I don't know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn't seem crazy.)
Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?
"I am nice because it feels good to be nice. Don't you have that?"
Not really, no. Or I mean sure, I sometimes feel so, but that's not the reason why I'm nice.
"What is the reason, then?"
I'm nice because it's instrumentally useful. Win-win situations are good. It doesn't cost much for me to be nice. Even in the cases where the other person is not nice, revenge is a dish best served cold. Or not at all, not-nice people tend to be miserable enough that it constitutes an acausal punishment by itself. And in any case, the game theory math tends to show that cooperating in iterated games is usually a good default.
"Sounds like a lot of work to think through that every time?"
Not really. It's not like I have to think through all that in every situation. I just feel good being nice. But sometimes I reflect on what happened and realize that niceness wasn't a good policy there. Then I can decide that the feeling wasn't adequate, and figure out how to nudge myself away from that the next time such a situation happens.
"You're pretty detached from your feelings, huh?"
I do have a rather mechanistic perception of humans, especially myself.
"Why is that?"
What I was doing previously did not work. This works better.
"Isn't it a bit sad and cynical to have to go through that kind of thinking?"
No! It's extremely beautiful how the same niceness that comes to some people by instinct can also be derived from game theory. How even someone who doesn't internally care a bit about how you feel, other than the instrumental benefits from it, can still be nice to you, not to mislead, but to trade. Sure, the wholesome appreciation is now oriented a bit more toward the dynamics rather than the agents. But I don't see why it would be sad?
"We really have quite different kinds of minds, don't we?"
Apparently.
This seems to sketch a mind design where the locus of terminal values is in emotions, and so non-emotional justifications are naturally instrumental. But terminal justifications/values can also be non-emotional, even if there's some overlap and path-dependence, emotional causal reasons for how the non-emotional terminal values came to be.
I'm mostly just trying to point to the fact that that your first impressions on ethics of something are not always the onces you'd reflectively choose to keep. I'm also trying to explain how I do moral reflection. Something almost like the discussion above occurred to me recently, and the other person seemed to hold their view strongly.
If you discovered that there is some better / more correct formulation of game theory applicable to your situation, that recommends to backstab everyone in such and such specific ways, and deal great harm for small benefit for you, would you switch to acting on it?
I don't see how this is relevant. In real world, all games are iterated games, and doing things like that will hurt your reputation gravely. Also, like, of course I would, I'd be a monster not to.
The primary value of Effective Altruism community comes from providing a social group where incentives on charity spending are better aligned with utilitarism. Information sharing is secondary. This also explains why people like to attend many EA events. Even though it doesn't make much sense for actually doing good, it provides the social reward for it. This dynamic is undervalued in impact estimates, and organizing more community-building fun would be quite valuable.
(loosely held opinion) (motivated reasoning warning: I mostly care about the fun stuff anyway)
Yes, social incentives are important. But it is also important that people donate to actually effective charities... otherwise they could get the same (maybe even better!) social rewards for locating to a local church.
Given that social rewards are usually only very loosely correlated with how good something is, it is great to have a community that aligns them better. But it easy to goodhart these things. (For example by visiting EA events, but actually not donating... maybe with the excuse that "I will donate later... much later...".)
Flights with return ticket are often only 20-50% more expensive than an one-way ticket. Sometimes the return ticket is cheaper than one-way! Since profit margins in air travel are in low single digit percents, and providing the flight doesn't get much cheaper by having the same person fly back later, something interesting must be going on. A similar thing sometimes occurs with transfers, where a flight sequence A-B-C is cheaper than just B-C. You're not allowed to just buy that and then fly B-C, they'll cancel your later legs if you miss the first one.
At least partially it's a question of price discrimination. Most price sensitive customers do roundtrip flights for e.g. vacations and they can typically be quite flexible about both timing and destination. This is also part of the reason why sometimes you can get cheap flights if you book well in advance.
I'm somewhat price-sensitive and really like one-way tickets. My vacations sometimes include me just deciding one day that I've had enough and flying back home the same evening. It's very liberating to not have fixed plans.
There are ways to game the system. As almost always is the case in the service industry, they're Out to Get You and gaming the system requires Getting Ready. I've sometimes spent more time researching flights than actually flying. This would be pretty irrational, except that it's a nice game that I enjoy. Sometimes I overdo it. Good habits die hard.
Concrete tips:
Someone wrote a "contra" post for my post! I'm a real rationalist blogger now! At least until I start thinking I need to achieve some higher goal like writing something actually good. But I sure will ride this high for the next week or so.
In other news, I attended and perhaps slightly organized a small 1-day LWCW-inspired unconference in Espoo, Finland. I was, as usual, facilitating circling and hotseat. Other interesting stuff occurred too. The experience for me was quasi-trancendental, personality-wise. Or perhaps this simply continues my fake enlightenment arc that's been having across the past week or two. In any case, this is the stuff I crave.
On an unrelated note, optimization is the process of extracting fun from a something. Or perhaps fun is the process of optimizing it out of the world. "All models are wrong, some are useful" and this one is hopefully useless and thus a great source of fun until it gets useful.
You can just smile. It makes you feel happier. You don't need a reason. You don't need to feel anything that would make you smile. Simply forcing your face into having a smile does the trick.
Some days I don't feel like smiling. I probably still could. But it's a bit boring to be evenly happy. Feeling happy isn't my end goal in life. Sometimes I want to get something done instead.
There's an interesting dual asymmetry in cybersecurity: The defender needs to make only a single mistake to lose, and the attackers can observer many targets waiting for such mistakes. Then again, if the defender makes no mistakes, there's literally nothing an attacker can do.
Of course the above is not strictly true: defence-in-depth approach can sometimes make a particular mistake inconsequental. This in turn can make the defenders ignore such mistakes when they're not exploitable.
Modern software supply chains are long and wide. A typical software might depend on thousands of libraries, and nobody can realistically audit them all. And there's hardware, too. Processor-level vulnerabilities in particular are not realistically avoidable.
The cost of exploiting vulns is going down quickly. The cost of finding and fixing them is falling quickly too. It's going to be really interesting to see what the new equilibrium is going to be like.
Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.
The amount I enjoy discussions seems to anticorrelate with the number of participants. While I previously thought this was about each person having more space and direction power, I now think that's it's mostly a selection effect. This means that perhaps splitting up large groups is less useful than I thought.
Inside jokes also get better when less people know about them. The primary question is, does this extend down to one person? Or zero? I definitely tend to randomly laugh for jokes nobody present understands.
Initial impression of Claude Opus 4.7 plus adaptive thinking: it seems much more capable of discussing nuanced points of my models. There's finally the kind of back-and-forth dynamic that you get with another person who's trying to get to the same page and who has their own ideas on how the world works. Or perhaps they just hit the sycophancy level that I happen to like. Worryingly I don't seem to care much anymore.