Shortform Content

Many methods to "align" ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.

I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.

Ideally, a competitive market would drive the price of goods close to the price of production, rather than the price that maximizes revenue. Unfortunately, some mechanisms prevent this.

One is the exploitation of the network effect, where a good is more valuable simply because more people use it. For example, a well-designed social media platform is useless if it has no users, and a terrible addictive platform can be useful if it has many users (Twitter).

This makes it difficult to break into a market and gives popular services the chance to charge what peop... (read more)

Most of the "mechanisms which prevent competitive pricing" is monopoly.  Network effect is "just" a natural monopoly, where the first success gains so much ground that competitors can't really get a start.  Another curiosity is the difference between average cost and marginal cost.  One more user does not cost $40.  But, especially in growth mode, the average cost per user (of your demographic) is probably higher than you think - these sites are profitable, but not amazingly so.  

None of this invalidates your anger at the inadequacy of the modern dating equilibrium.  I sympathize that you don't have parents willing to arrange your marriage and save you the hassle.

Three related concepts.

  • On redundancy: "two is one, one is none". It's best to have copies of critical things in case they break or go missing, e.g. an extra cell phone.
  • On authentication: "something you know, have, and are". These are three categories of ways you can authenticate yourself.
    • Something you know: password, PIN
    • Something you have: key, phone with 2FA keys, YubiKey
    • Something you are: fingerprint, facial scan, retina scan
  • On backups: the "3-2-1" strategy.
    • Maintain 3 copies of your data:
    • 2 on-site but on different media (e.g. on your laptop
... (read more)

(random shower thoughts written with basically no editing)

Sometimes arguments have a beat that looks like "there is extreme position X, and opposing extreme position Y. what about a moderate 'Combination' position?" (I've noticed this in both my own and others' arguments)

I think there are sometimes some problems with this.

  • Usually almost nobody is on the most extreme ends of the spectrum. Nearly everyone falls into the "Combination" bucket technically, so in practice you have to draw the boundary between "combination enough" vs "not combination enough to
... (read more)

related take: "things are more nuanced than they seem" is valuable only as the summary of a detailed exploration of the nuance that engages heavily with object level cruxes; the heavy lifting is done by the exploration, not the summary

My review of Wentworth's "Selection Theorems: A Program For Understanding Agents" is tentatively complete.

I'd appreciate it if you could take a look at it and let me know what you think!


I'm so very proud of the review.

I think it's an excellent review and a significant contribution to the Selection Theorems literature (e.g. I'd include it if I was curating a Selection Theorems sequence). 

I'm impatient to post it as a top-level post but feel it's prudent to wait till the review period ends.

I have an intuition that any system that can be modeled as a committee of subagents can also be modeled as an agent with Knightian uncertainty over its utility function. This goal uncertainty might even arise from uncertainty about the world.

This is similar to how in Infrabayesianism an agent with Knightian uncertainty over parts of the world is modeled as having a set of probability distributions with an infimum aggregation rule.

These might be of interest, if you haven't seen them already:

Bewley, T. F. (2002). Knightian decision theory. Part I. Decisions in economics and finance, 25, 79-110.

Aumann, R. J. (1962). Utility theory without the completeness axiom. Econometrica: Journal of the Econometric Society, 445-462.

The lack of willpower is a heuristic which doesn’t require the brain to explicitly track & prioritize & schedule all possible tasks, by forcing it to regularly halt tasks—“like a timer that says, ‘Okay you’re done now.’”

If one could override fatigue at will, the consequences can be bad. Users of dopaminergic drugs like amphetamines often note issues with channeling the reduced fatigue into useful tasks rather than alphabetizing one’s bookcase.

In more extreme cases, if one could ignore fatigue entirely, then analogous to lack of pain, the consequenc

... (read more)

Projects I'd do if only I were faster at coding

  • Take the derivative of one of the output logits with respect to the input embeddings, and also the derivative of the output logits with respect to the input tokenization. 
    • Perform SVD, see which individual inputs have the greatest effect on the output (sparse addition), and which overall vibes have the greatest effect (low rank decomposition singular vectors)
    • Do this combination for literally everything in the network, see if anything interesting pops out
  • I want to know how we can tell ahead of time what asp
... (read more)

I would no longer do many if these projects

Does anyone have a good piece on hedging investments for AI risk? Would love a read, thanks!

things upvotes conflates:

  • agreement (here, separated - sort of)
  • respect
  • visibility
  • should it have been posted in the first place
  • should it be removed retroactively
  • is it readably argued
  • is it checkably precise
  • is it honorably argued
  • is it honorable to have uttered at all
  • is it argued in a truth seeking way overall, combining dimensions
  • have its predictions held up
  • is it unfair (may be unexpectedly different from others on this list)

(list written by my own thumb, no autocomplete)

these things and their inversions sometimes have multiple components, and ma... (read more)

Lesswrong is a garden of memes, and the upvote button is a watering can.

I was thinking the other day that if there was a "should this have been posted" score I would like to upvote every earnest post on this site on that metric. If there was a "do you love me? am I welcome here?" score on every post I would like to upvote them all.

Some discussion on whether alignment should see more influence from AGI labs or academia. I use the same argument in favor of a strong decoupling of alignment progress from both: alignment progress needs to go faster than capability progress. If we use the same methods or cultural technology as AGI labs or academia, we can guarantee slower than capability alignment progress. Just as fast as if AGI labs and academia work well for alignment as much as they work for capabilities. Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.

Showing 3 of 5 replies (Click to show all)
1Garrett Baker3d
Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier. You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.
I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).

Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.

Making the rounds.

User: When should we expect AI to take over?

ChatGPT: 10

User: 10?  10 what?

ChatGPT: 9 

ChatGPT 8


I was a negative utilitarian for two weeks because of a math error

So I was like, 

If the neuroscience of human hedonics is such that we experience pleasure at about a 1 valence and suffering at about a 2.5 valence, 

And therefore an AI building a glorious transhuman utopia would get us to 1 gigapleasure, and an endless S-risk hellscape would get us to 2.5 gigapain, 

And we don’t know what our future holds, 

And, although the most likely AI outcome is still overwhelmingly “paperclips”, 

If our odds are 1:1 between ending u... (read more)

[Draft] Note: this is written in a mere 20 minutes.

Hypothesis: I, and people in general, seem to really underestimated this rather trivial statement that people don't really learn about something when they don't spend the time doing it/thinking about it. my thought includes improving your own self, and human modeling. Here are a list of related concepts. I was inspired by the first 2 and things below are connections to slightly different preexisting ideas in my mind. I am on the lookout for more instances of this hypothesis.

  • cached thoughts, on yourself
... (read more)

There are at least a few different dimensions to "learning", and this idea applies more to some than to others.  Sometimes a brief summary is enough to change some weights of your beliefs, and that will impact future thinking to a surprising degree.  There's also a lot of non-legible thinking going on when just daydreaming or reading fiction.

I fully agree that this isn't enough, and both directed study and intentional reflection is also necessary to have clear models.  But I wouldn't discount "lightweight thinking" entirely.

Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.

Showing 3 of 9 replies (Click to show all)

Very strong upvote. This also deeply concerns me. 

1Garrett Baker4d
I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many. Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else. There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.
1Garrett Baker4d
I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds 1. You’re doing literally nothing. Something’s wrong with the gradient updates. 2. You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible) 3. You’re doing something, it causes your agent to be suboptimal because of learned representation y. I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers. Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.

The Research Community As An Arrogant Boxer



Two pugilists circle in the warehouse ring. That's my man there. Blue Shorts. 

There is a pause to the beat of violence and both men seem to freeze glistening under the cheap lamps. An explosion of movement from Blue. Watch closely, this is a textbook One-Two. 

One. The jab. Blue snaps throws his left arm forward.

Two. Blue twists his body around and the throws a cross. A solid connection that is audible over the crowd. 

His adversary drops like a doll.


Another warehouse, another match... (read more)

The past is a foreign country. Look upon its works and despair.

From the perspective of human civilization of, say, three centuries ago, present-day humanity is clearly a superintelligence. 

In any domain they would have considered important, we retain none of the values of that time. They tried to align our values to theirs, and failed abysmally.

With so few reasonable candidates for past superintelligences, reference-class forecasting the success of AI alignment looks bleak.

PDF versions for A Compute-Centric Framework of Takeoff Speeds (Davidson, 2023) and Forecasting TAI With Biological Anchors (Cotra 2020), because some people (like me) like to have it all in one document for offline reading (and trivial inconveniences have so far prevented anyone else making this).

A model I picked up from Eric Schwitzgebel.

The humanities used to be highest-status in the intellectual world!

But then, scientists quite visibly exploded fission weapons and put someone on the moon. It's easy to coordinate to ignore some unwelcome evidence, but not evidence that blatant. So, begrudgingly, science has been steadily accorded more and more status, from the postwar period on.

Load More