Galathir's Shortform

11th Sep 2025

1 min read

1

This is a special post for quick takes by Galathir. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:32 AM

[-]Galathir2mo*3-8

Is the rational mind set an existential risk? It spreads the idea of arms races and the treacherous turn. Should we be encouraging less than rational world views to spread if so what? And should we be coding them into our AI? You probably want them to be hard to predict so they cannot be exploited easily.

If it is it would still be worth preserving as an example of an insidious threat that should be guarded against. Perhaps in a simulation for people to interact with.

You might want as rational a choice of mindset to adopt as possible though. Decision making under deep uncertainty seems to allow you to deviate from the traditionally rational. You can evaluate plans under different world views and pick actions or plans that don't seem too bad under them all. This could allow irrational world views to have a voice.

How irrational do you want to accept a world view under deep uncertainty? Perhaps you need to evaluate the outcomes from that world view and see if there might be something hidden that it is tapping into.

[-]Galathir2mo2-3

Currently due to worries about arms races, and races to the bottom, people might not share the safest information about AI development. This makes public trust in the development of AI hard by actors with secret knowledge. One possibility is shadow decision making, giving the knowledge of the secret methods and the desires of an actor to a third party who makes go no go decisions. A second is building trust by building non AI software in the public interest, and that organisation being trusted to build AI with secret knowledge. Probably some mix of the two might be good.

[-]Galathir2mo10

This is a letter I'm thinking about sending to my MP (hence the UK specific things). I would be interested in other peoples take on the problem.

The UK government's creation of the AI Safety Institute as well its ambition to 'turbocharge' AI presents a challenge. To unlock the immense promise of AI without causing dangerous AI arms races, we need to establish international coordination, codes of conduct and cooperative frameworks appropriate for AI. This will require the UK leading on an international stage, having tried things out on a local level.

Why these safety measures around the development of AI have not been established already, is an open problem that I have not seen studied. My current hypothesis might be that discussion and policy around AI and the future is dominated by lesswrong style rationalism and neorealism. These philosophies suggest that building international mechanisms for cooperation on AI development is impossible and not worth trying and that things like codes of conduct for AI engineers are naive.

If there is a domination, it might be due to a founder effect as rationalism and philosophers like Nick Bostrom have been influential in the creation of the field of AI safety and AI based existential risk. They look at AI from an economically rational point of view. Rationalists also control the large online forum for discussing AI safety and policy, lesswrong and I'm not aware of others.

If there is a domination by cynical rationalism it is a problem for two reasons.

The first is that it has not been talked about or researched, so we don’t know the scope of the problem. The negative impact it has had on the AI safety progress or policy. It might have caused the lack of progress in UN or other international coordination on AI safety.

Secondly it can lead to problems if humans (and presumably AI) adopt the cynical view point. Historically it was this cynical philosophy that led to arms races in nuclear weapons and made the world less safe due to the security dilemma. The security dilemma is where states are trapped to make things less secure because there is supposed anarchy on the international level. Applying this to AI means a potentially dangerous AI arms race as there will not be time to build in the necessary safeguards into AI.

The lack of trust and lack of trust in the ability to build trust is corrosive to life and intelligence.

So what can be done? To start with this argument suggests researching the AI safety community and seeing how broad it is philosophically. This can then feed into the next step. If this study finds a narrow philosophical base, broadening the research directions might be the way to go. Perhaps by funding people doing traditional PhDs, or with research bodies like ARIA. They would then research codes of conduct for engineers and attempt to embed and test values and philosophies inside AI. The philosophies might Include care-based philosophies, deontology and virtue based.

So instead of just the field of AI alignment, where there is the hanging question of who to align the AI with, there could be movement towards trying different philosophies embedded in design and seeing which ones work best for humanity empirically. If actors see that each other are working towards non-cynically rational AI in a way that is not cynically rational, then the pressure to speed up the arms race becomes less.

Testing different philosophies in the real world could be achieved by doing user centred design of regulatory mandated tests for broadly deployed AI. The tests would have to be passed before the AI could be released. For example the AI agents could be simulated in an environment with humans and how the goals given to them interacted with simulated humans goals could be observed. Then the AI trained to pass this test would interact with a diverse set of humans to get feedback on how well the test makes the AI useful and non-harmful. This approach to test design could be tried out on a national or local level, before rolling out internationally.

The UK has shown an ambition to lead the way on AI safety and policy. This could be a way for the UK government to realise that ambition.

[-]Galathir2mo10

Companies seem to be trapped in a security dilemma situation, they worry about being last to develop AGI so seem to be rushing towards the capabilities rather than safety. This in part is due to worries about owning/controlling the future.

Other aspects of humanity such as governments and the scientific community aren't (at least visibly) rushing because they aren't completely (economically) rational in that regard. Being more norm or rule following. Other ways not to be economically rational include caring for others (or humanity/nature in general)

We need to embed more rule-following into AI, so it doesn't rush. This might need to be government mandated, as the rational companies might not be incentivised to rush. Government/International community mandated tests in simulated environments to make sure an AI follows the rules or cares about humanity, might be the way forward.

Caring and rule following seem different from corrigibility work or from the idea of alignment. A caring AI can have different goals from humanity but it would still allow/enchance humans ability to go about their business.

The rules I would look towards would definitely include never modifying the caring code of the AI.

Caring could be operationalised by using explainable AI to capture what parts of NN they thought were humans and embedding in the AIs goal system something that sought to increase or not modify the options the human could take.

[-]Galathir2mo10

There is a reason why the arms race around more and bigger nuclear weapons stopped and hasn't started again. Why do people think that is and can we use that understanding to stop the arms race (to be there first and control the future) around AI

[-]Galathir2mo10

Has there been any work on representation under extreme information asymmetry?

I'm thinking like having AIs trained to make the same decisions as you would and them being given the secret or info-hazardous material to make governance decisions on your behalf. To avoid info leakage.

[-]Galathir2mo10

I've been thinking about problems with mind copying and democracy with uploads.

In order to avoid Sibyl attacks you might want to implement something like compressibility weightings of course. So if a voter has lots of similarity to other voters it is not weighted very much.

Otherwise you get a race to the bottom with viewpoints that might not capture the richness of humanity ( there is a pressure to simplify the thing being copied to get more copies of it given a certain amount of compute)

[-]Viliam2mo20

The trick is to separate your important traits from the unimportant ones, and change the unimportant ones randomly (e.g. randomly choose your new favorite color), so that you increase the psychological diversity of your movement without endangering its goals.

[-]Galathir2mo10

If your view of the problem is very complex you might lose to get compressed easily as there would be lots of mutual information

[-]Galathir3mo10

I've been thinking a lot about identity (as in pg, keep your identity small).

Specifically which identities might lead to safe development of AI. And trying to validate that by running these different activities:

1. Role playing games where the participants are asked to take on specific identities and play through a scenario where AI has to be created.
2. Similar things where LLMs are prompted to take on particular roles and given agency to play in the role playing games too.

Has there been similar work before?

I'm particularly interested in cosmic identity, where you see humanity as a small part of a wider cosmos, including potentially hostile and potentially useful aliens. It has a number of properties that I think make it interesting, which I'll discuss in a full post, if people think this is worth exploring.

Are there identities that people think should be explored too?

[-]StanislavKrym2mo10

The cosmic identity and related issues have been considered and I even used them to make a conjecture about alignment. As for role-playing games, I doubt that they are actually useful. Unless, of course, you mean something like Cannell's proposal.

As for "the idea of arms races and the treacherous turn", the AI-2027 team isn't worried about such a risk, they are more worried about the race itself causing the humans to do worse safety checks.

[-]Galathir2mo10

But slightly irrational actors might not race (especially if they know that other actors are slightly irrational in the same or compatible way.)

[-]Galathir3mo10

I think that there might be perverse incentives if identities or view points get promoted in a legible fashion. To hack that system rather than to do useful work.

So it might be good to have identity promotion to be done in a way that is obfuscated or ineffable in some way.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Galathir's Shortform

1