Shortform Content

Is disempowerment that bad? Is a human directed society really much better than an AI directed society with a tiny weight of kindness towards humans? Human directed societies themselves usually create orthogonal and instrumental goals, and their assessment is highly subjective/relative. I don’t see how the disempowerment without extinction is that different from today to most people who are already effectively disempowered.

There are two importantly different senses of disempowerment. The stars could be taken out of reach, forever, but human civilization develops in its own direction. Alternatively, human civilization is molded according to AIs' aesthetics, there are interventions that manipulate.

Is there a huge reason the latter is hugely different from the former for the average person excluding world leaders.

If alignment is difficult, it is likely inductively difficult (difficult regardless of your base intelligence), and ASI will be cautious of creating a misaligned successor or upgrading itself in a way that risks misalignment.

You may argue it’s easier for an AI to upgrade itself, but if the process is hardware bound or even requires radical algorithmic changes, the ASI will need to create an aligned successor as preferences and values may not transfer directly to new architectures or hardwares.

If alignment is easy we will likely solve it with superhuman narrow intelligences and aligned near peak human level AGIs.

Why wouldn’t a wire head trap work?

Let’s say an AI has a remote sensor that measures a value function until the year 2100 and it’s RLed to optimize this value function over time. We can make this remote sensor easily hackable to get maximum value at 2100. If it understands human values, then it won’t try to hack its sensors. If it doesn’t we sort of have a trap for it that represents an easily achievable infinite peak.

Reinforcement learning doesn't guarantee anything about how a system generalizes out of distribution. There are plenty of other things that the system can generalize to that are neither the physical sensor output nor human values. Separately from this, there is no necessary connection between understanding human values and acting in accordance with human values. So there are still plenty of failure modes.

Yes nothing is a guarantee in probabilities but can’t we just make it very easy for it to perfectly achieve its objective if it doesn’t go exactly the way we want it to, we just make an easier solution exist than disempowering us or wiping us out.

I guess in the long run we still select for models that ultimately don’t wirehead. But this might eliminate a lot of obviously wrong alignment failures we miss.

I was texting multiple Discords for a certain type of mental "heuristic", or "mental motion", or "badass thing that Planecrash!Keltham or HPMOR!Harry would think to themselves, in order to guide their own thoughts into fruitful and creative and smart directions". Someone commented that this could also be reframed as "language-model prompts to your own mind" or "language-model simulations of other people in your own mind".

I've decided to clarify what I meant, and why even smart people could benefit from seemingly hokey tricks like this.

Heuristics/LMs-of-oth... (read more)

1Jay Bailey6h
What are the best ones you've got?

(sources: discord chats on public servers) Why do I believe X?

What information do I already have, that could be relevant here?

What would have to be true such that X would be a good idea?

If I woke up tomorrow and found a textbook explaining how this problem was solved, what's paragraph 1?

What is the process by which X was selected?

How can I make predictions in a way which lets me do data analysis? I want to be able to grep / tag questions, plot calibration over time, split out accuracy over tags, etc. Presumably exporting to a CSV should be sufficient. PredictionBook doesn't have an obvious export feature, and its API seems to not be working right now / I haven't figured it out yet. 

Trying to collate team shard's prediction results and visualize with plotly, but there's a lot of data processing that has to be done first. Want to avoid the pain in the future.

Thoughts on Apple Vision Pro:

  • The price point is inaccessibly high.
  • I'm generally bullish on new interfaces to computing technology. The benefits aren't always easy to perceive until you've had a chance to start using it.
  • If this can sit on my head and allow me to type or do calculations while I'm working in the lab, that would be very convenient. Currently, I have to put gloves on and off to use my phone, and office space with my laptop is a 6-minute round trip from the lab.
  • I can see an application that combines voice-to-text and AI in a way that makes it fe
... (read more)
Showing 3 of 4 replies (Click to show all)
Sure, but an audio-only interface can be done with an iPhone and some Airpods; no need for a new interface.

That's true! However, I would feel weird and disruptive trying to ask ChatGPT questions when working alongside coworkers in the lab.

2[comment deleted]13h

If it did actually turn out that aliens had visited Earth, I'd be pretty willing to completely scrap the entire Yudkowskian implied-model-of-intelligent-species-development and heavily reevaluate my concerns around AI safety.

Saw some today demonstrating what I like to call the "Kirkegaard fallacy", in response to the Debrief article making the rounds.

People who have one obscure or weird belief tend to be unusually open minded and thus have other weird beliefs. Sometimes this is because they enter a feedback loop where they discover some established opinion is likely wrong, and then discount perceived evidence for all other established opinions. 

This is a predictable state of affairs regardless of the nonconsensus belief, so the fact that a person currently talking to you about e.g. UFOs entertains other off-brand ideas like parapsychology or afterlives is not good evidence that the other nonconsensus opinion in particular is false.

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

Showing 3 of 9 replies (Click to show all)
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings. It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes.  If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking. My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight. I think that it's locally valid to point out "under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper." But I feel a tad queasy about the overall point, since I don't think alignment's difficulty has much to do with the difficulties pointed out by "Optimal Policies Tend to Seek Power." I feel better about saying "Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble." Maybe that's what you already communicate, though.
Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post. Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies." (Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)

"There are theoretical results showing that many decision-making algorithms have power-seeking tendencies."

I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

Idea: Github + Voting 

The way Github works is geared towards how projects usually work - someone or some group owns them, or is responsible for them, and they have the ability to make decisions about it, then there are various level of permissions. If you can make a pull request, someone from the project's team has to accept it. 

This works well for almost all projects, so I have no problem with it. What I'm going to suggest is more niche and solves a different problem. What if there's a project that you want to be completely public, that no one p... (read more)

I like this idea, because I'm too lazy to review pull requests. It would be great if other people could just review and vote on them for me :P

There are Blockchains like Polkadot that require people to vote whether code changes get deployed. I'm not sure how that exactly works but looking at those crypto project might be valuable does for looking at existing uses and also to find potential users.
2Yoav Ravid1y
Oh, cool! I'll look into it. Thanks :)

There may be no animal welfare gain to veganism

I remain unconvinced that there is any animal welfare gain to vegi/veganism, farm animals have a strong desire to exist and if we stopped eating them they would stop existing.

Vegi/veganism exists for reasons of signalling, it would be surprising if it had any large net benefits other than signalling.

On top of this, the cost to mitigate most of the aspects of farming that animals disprefer is likely vastly smaller than the harms to human health.

Back of the envelope calculation is that making farming highly pref... (read more)

Showing 3 of 7 replies (Click to show all)
This sounds to me like: "freeing your slaves is virtue signaling, because abolishing slavery is better". I agree with the second part, but it can be quite difficult for an individual or a small group to abolish slavery, while freeing your slaves is something you can do right now (and then suffer the economical consequences). If I had a magical button that would change all meat factories into humane places, I would press it. If there was a referendum on making humane farms mandatory, I would vote yes. In the meanwhile, I can contribute a tiny bit to the reduction of animal suffering by reducing my meat consumption. You may call it virtue signaling, I call it taking the available option, instead of dreaming about hypothetically better options that are currently not available.

I think this doesn't make sense any more now that veganism is such a popular and influential movement that influences government policy and has huge control over culture.

But a slightly different version of this is that because there's no signalling value in a collective decision to impose welfare standards, it's very hard to turn into a political movement. So we may be looking at a heavily constrained system.

You're correct, that is a mistake. It's $6.50 per kg, I forgot to convert.

Conservatism says "don't be first, keep everything the same." This is a fine, self-consistent stance.

A responsible moderate conservative says "Someone has to be first, and someone will be last. I personally want to be somewhere in the middle, but I applaud the early adopters for helping me understand new things." This is also a fine, self-consistent stance.

Irresponsible moderate conservatism endorses "don't be first, and don't be last," as a general rule, and denigrates those who don't obey it. It has no answers for who ought to be first and last. But for ... (read more)

One weird trick for estimating the expectation of Lognormally distributed random variables:

If you have a variable X that you think is somewhere between 1 and 100 and is Lognormally distributed, you can model it as being a random variable with distribution ~ Lognormal(1,1) - that is, the logarithm has a distribution ~ Normal(1,1).

What is the expectation of X?

Naively, you might say that since the expectation of log(X) is 1, the expectation of X is 10^1, or 10. That makes sense, 10 is at the midpoint of 1 and 100 on a log scale.

This is wrong though. The chanc... (read more)

Link or derivation, please.

Proposed Forecasting Technique: Annotate Scenario with Updates (Related to Joe's Post)

  • Consider a proposition like "ASI will happen in 2024, not sooner, not later." It works best if it's a proposition you assign very low credence to, but that other people you respect assign much higher credence to.
  • What's your credence in that proposition?
  • Step 1: Construct a plausible story of how we could get to ASI in 2024, no sooner, no later. The most plausible story you can think of. Consider a few other ways it could happen too, for completeness, but don't write them d
... (read more)

Science as a kind of Ouija board:

With the board, you do this set of rituals and it produces a string of characters as output, and then you are supposed to read those characters and believe what they say.

So too with science. Weird rituals, check. String of characters as output, check. Supposed to believe what they say, check.

With the board, the point of the rituals is to make it so that you aren't writing the output, something else is -- namely, spirits. You are supposed to be light and open-minded and 'let the spirit move you' rather than deliberately try ... (read more)

It's no longer my top priority, but I have a bunch of notes and arguments relating to AGI takeover scenarios that I'd love to get out at some point. Here are some of them:

Beating the game in May 1937 - Hoi4 World Record Speedrun Explained - YouTube
In this playthrough, the USSR has a brief civil war and Trotsky replaces Stalin. They then get an internationalist socialist type diplomat who is super popular with US, UK, and France, who negotiates passage of troops through their territory -- specifially, they send many many brigades of extremely low-tier troop... (read more)

epistemic status: I'm pretty sure that the described processes are part of me, but I don't know anything about how this might be applicable to other people (nervous systems).

AFAICT I do (at least) two distinct activities which I refer to as "thinking".  The first type I'm calling "explicit computing" and it's what I'm doing when I manually multiply 2387*8230 or determine whether a DFA will accept some string or untangle some Python function that isn't behaving as I expect.  The second type I'm calling "forced updating" and unlike explicit computi... (read more)

3Gesild Muka3d
The way you word the second type might be working against you. ‘Updating’ brings to mind the computer function which neatly and quickly fills a bar and then the needed changes are complete. The human brain doesn’t work like that. To build new intuitions you need spaced repetition and thoughtful engagement with the belief you want to internalize. Thinking does work, you just can’t force it.
Specifically, I was thinking about it like updating a row in a sql database.  But even in that case, based on what you are saying, it sounds like an unhelpful and unrepresentative model of how the brain thinks new beliefs into itself. Could you give me a slightly more detailed description of the procedure you use to intentionally change a belief?

I’ll try my best, I’m by no means an expert. I don’t think there’s a one size fits all answer but let’s take your example of relationship between IQ and national prosperity. You can spend time researching what makes up prosperity and where that intersects with IQ and find different correlates between IQ and other attributes in individuals (the assumption being that individuals are a kind of unit to measure prosperity).

You can use spaced repetition to avoid burnout and gain fresh perspectives. The point is to build mental muscle memory and intuition on what... (read more)

In the past I had the thought: "probably there is no way to simulate reality that is more efficient than reality itself". That is, no procedure implementable in physical reality is faster than reality at the task of, given a physical state, computing the state after t physical ticks. This was motivated by intuitions about the efficiency of computational implementation in reality, but it seems like we can prove it by diagonalization (similarly to how we can prove two systems cannot perfectly predict each other), because the machine could in particular predi... (read more)

Can somebody convince me that the problem this site is trying to solve is even soluble?

I don't understand how an aligned machine intelligence is definitionally possible without some sort of mathematization of ethics/decision theory, which I know has been worked on. But any such correct mathematization, that is to say one that doesn't assert a falsehood or anything else from which seemingly anything is derivable, should amount to solving ethics entirely. It's the ethical equivalent of..."the meteor is coming and it will eradicate us all unless we come up wi... (read more)

2Gerald Monroe3d
I think it's very solvable but using a composition of approaches we already use in software we have built for many years. Essentially an "aligned" superintelligence can be thought of as a machine that receives an input frame, before the model ever sees the frame it is checked for whether it fits well into the distribution of the training set. If the input is in distribution, the model generates an output frame. Other models may be queried simultaneously with the same input. The output frame is then checked by other models for errors and rule violations. Finally, the input and output schema are known to humans and have been designed to have no unnecessary bits, making collusion and deception between models difficult. For example, a frame could be a text prompt. It could be a prompt plus a single camera frame from a robot. It could be a prompt, multiple camera frames, frames of lidar observation, intermediate data on the machines belief of the current local world state, and robot proprioception data. Either way the model acts on one frame at a time and is unaware of any context data not in a frame. Individual ASI models may not even be given control authority for more than a few frames at a time, or have their control interleaved between multiple models, etc. This above description is CAIS but I have added details from my work on realtime neural network control systems, ADAS fault handling, and I noticed a simple way to build an in distribution detector by simply compressing the frame with an autoencoder and measuring the residual incompressible portion. Note also this type of "aligned" machine is intended to accomplish whatever task humans assign it and it operates on short time horizons. It shouldn't attempt power seeking as there is no reward for doing so (seeking power takes too long). It's session won't last long enough to benefit from deception. And it will not operate on out of distribution inputs. However it has no morals and there is no element in the

Is it too pessimistic to point out that humans are constantly doing awful things to each other and their surroundings and have been for millennia? I mean, we torture and kill at least eleven moral patients per human per year. I don't understand how this solution doesn't immediately wipe out as much life as it can.

Part of your decision of "should I go on semaglutide/Wegovy/Ozempic?" would be influenced by whether it improves lifespan.  Weight loss is generally good for that, but here's a decision market specifically about lifespan.

Since I've been getting downvoted for what have felt like genuine attempts to create less-corruptible information, please try to keep an open mind or explain why you downvote.

Load More