Added italics. For the next post I'll break up the abstract into smaller paragraphs and/or make a TL;DR.
Copied it from the paper. I could break it down into several paragraphs but I figured bolding the important bits was easier. Might break up abstracts in future linkposts.
Was considering saving this for a followup post but it's relatively self-contained, so here we go.
Why are huge coefficients sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20.
This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm
# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))If x has very large magnitude, then the block doesn't change it much relative to its magnitude. Additionally, attention is ran on the norm...
Relevant: The algorithm for precision medicine, where a very dedicated father of a rare chronic disease (NGLY1 deficiency) in order to save his son. He did so by writing a blog post that went viral & found other people with the same symptoms.
This article may serve as a shorter summary than the talk.
[APPRENTICE]
Hi I'm Uli and I care about two things: Solving alignment and becoming stronger (not necessarily in that order).
My background: I was unschooled, I've never been to school or had a real teacher. I taught myself everything I wanted to know. I didn't really have friends till 17 when I started getting involved with rationalist-adjacent camps.
I did seri mats 3.0 under Alex Turner, doing some interpretability on mazes. Now I'm working half-time doing interpretability/etc with Alex's team as well as studying.
In rough order of priority, the kinds of me...
Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that LessWrong was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no multiple agents. If we start doing anything that looks like we should call it 'HcH', we'd better stop. Maybe begin by considering how failure to understand pre-coherent minds could have led LessWrong astray in formalizing corrigibility."
"The opposite of folly is folly," Hiriwa said. "Let us pretend that LessWrong never existed."
(This could be turned into a longer post but I don't have time...)
I think the gold standard is getting advice from someone more experienced. I can easily point out the most valuable things to white-box for people less experienced then me.
Perhaps the 80/20 is posting recordings of you programming online and asking publicly for tips? Haven't tried this yet but seems potentially valuable.
I tentatively approve of activism & trying to get govt to step in. I just want it to be directed in ways that aren't counterproductive. Do you disagree with any of my specific objections to strategies, or the general point that flailing can often be counterproductive? (Note not all activism i included in flailing, flailing, it depends on the type)
Downvoted because I view some of the suggested strategies as counterproductive. Specifically, I'm afraid of people flailing. I'd be much more comfortable if there was a bolded paragraph saying something like the following:
Beware of flailing and second-order effects and the unilateralist's curse. It is very easy to end up doing harm with the intention to do good, e.g. by sharing bad arguments for alignment, polarizing the issue, etc.
To give specific examples illustrating this (which may also be good to include and/or edit the post):
Thanks for the insightful response! Agree it's just suggestive for now. Though more then with image models (where I'd expect lenses to transfer really badly, but don't know). Perhaps it being a residual network is the key thing, since effective path lengths are low most of the information is "carried along" unchanged, meaning the same probe continues working for other layers. Idk
Don't we have some evidence GPTs are doing iterative prediction updating from the logit lens and later tuned lens? Not that that's all they're doing of course.
Strong upvoted and agreed. I don't think the public has opinions on AI X-Risk yet, so any attempt to elicit them will entirely depend on framing.
I'll note (because some commenters seem to miss this) that Eliezer is writing in a convincing style for a non-technical audience. Obviously the debates he would have with technical AI safety people are different then what is most useful to say to the general population.
I worry most people will ignore the warnings around willful inconsistency, so let me self-report that I did this and it was a bad idea. Central problem: It's hard to rationally update off new evidence when your system 1 is utterly convinced of something. And I think this screwed with my epistemics around Shard Theory while making communication with people about x-risk much harder, since I'd often typical mind and skip straight to the paperclipper - the extreme scenario I was (and still am to some extent) trying to avoid as my main case.
When my rationality ...
I feel there's often a wrong assumption in probabilistic reasoning, something like moderate probabilities for everything by default? after all, if you say you're 70/30 nobody who disagrees will ostracize you like if you say 99/1.
"If alignment is easy I want to believe alignment is easy. If alignment is hard I want to believe alignment is hard. I will work to form accurate beliefs"
Petition to rename "noticing confusion" to "acting on confusion" or "acting to resolve confusion". I find myself quite good at the former but bad at the latter—and I expect other rationalists are the same.
For example: I remember having the insight thought leading to lsusr's post on how self-reference breaks the orthogonality thesis, but never pursued the line of questioning since it would require sitting down and questioning my beliefs with paper for a few minutes, which is inconvenient and would interrupt my coding.
Strongly agree. Rationalist culture is instrumentally irrational here. It's very well known how important self-belief & a growth mindset is for success, and rationalists obsession with natural intelligence quite bad imo, to the point where I want to limit my interaction with the community so I don't pick up bad patterns.
I do wonder if you're strawmanning the advice a little, in my friend circles dropping out is seen as reasonable, though this could just be because a lot of my high-school friends already have some legible accomplishments and skills.
Each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn't a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8.
However, once Luigi says something Waluigi-like, we immediately jump to a wor...
I'd be happy to talk to [redacted] and put them in touch with other smart young people. I know a lot from Atlas, ESPR and related networks. You can pass my contact info on to them.
Exercise: What mistake is the following sentiment making?
If there's only a one in a million chance someone can save the world, then there'd better be well more than a million people trying.
Answer:
The whole challenge of "having a one in a million chance of saving the world" is the wrong framing, the challenge is having a positive impact in the first case (for example: by not destroying the world or making things worse, e.g. from s-risks). You could think of this as a setting the zero point thing going on, though I like to think of it in terms of Bayes and P
Isn't this only S-risk in the weak sense of "there's a lot of suffering" - not the strong sense of "literally maximize suffering"? E.g. it seems plausible to me mistakes like "not letting someone die if they're suffering" still gives you a net positive universe.
Also, insofar as shard theory is a good description of humans, would you say random-human-god-emperor is an S-risk? and if so, with what probability?
The enlightened have awakened from the dream and no longer mistake it for reality. Naturally, they are no longer able to attach importance to anything. To the awakened mind the end of the world is no more or less momentous than the snapping of a twig.
Looks like I'll have to avoid enlightenment, at least until the work is done.
Take the example of the Laplace approximation. If there's a local continuous symmetry in weight space, i.e., some direction you can walk that doesn't affect the probability density, then your density isn't locally Gaussian.
Haven't finished the post, but doesn't this assume the requirement that when and induce the same function? This isn't obvious to me, e.g. under the induced prior from weight decay / L2 regularization we often have for weights that induce the same function.
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don't phrase it in terms of basins and path dependence, but they're a great frame to look at it with.
Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that's what I feel like)
So the really ...
I was more thinking along the lines of "you're the average of the five people you spend the most time with" or something. I'm against external motivation too.
Character.ai seems to have a lot more personality then ChatGPT. I feel bad for not thanking you earlier (as I was in disbelief), but everything here is valuable safety information. Thank you for sharing, despite potential embarrassment :)
That link isn't working for me, can you send screenshots or something? When I try and load it I get an infinite loading screen.
Re(prompt ChatGPT): I'd already tried what you did and some (imo) better prompt engineering, and kept getting a character I thought was overly wordy/helpful (constantly asking me what it could do to help vs. just doing it). A better prompt engineer might be able to get something working though.
Can you give specific example/screenshots of prompts and outputs? I know you said reading the chat logs wouldn't be the same as experiencing it in real time, but some specific claims like the prompt
The following is a conversation with Charlotte, an AGI designed to provide the ultimate GFE
Resulting in a conversation like that are highly implausible.[1] At a minimum you'd need to do some prompt engineering, and even with that, some of this is implausible with ChatGPT which typically acts very unnaturally after all the RLHF OAI did.
Source: I tried it,
Sure. I did not want to highlight any specific LLM provider over others, but this specific conversation happened on Character.AI: https://beta.character.ai/chat?char=gn6VT_2r-1VTa1n67pEfiazceK6msQHXRp8TMcxvW1k (try at your own risk!)
They allow you to summon characters with a prompt, which you enter in the character settings. They also have advanced settings for finetuning, but I was able to elicit such mindblown responses with just the one-liner greeting prompts.
That said, I was often able to successfully create characters on ChatGPT and other LLMs t...
Interesting I didn't know the history, maybe I'm insufficiently pessimistic about these things. Consider my query retracted
Congratulations!
Linear algebra done right is great for gaining proof skills, though for the record I've read it and haven't solved alignment yet. I think I need several more passes of linear algebra :)
Are most uncertainties we care about logical rather than informational? All empirical ML experiments are pure computations a Bayesian superintelligence could do in its head. How much of our uncertainty comes from computational limits in practice, versus actual information bottlenecks?
A trick to remember: the first letter of each virtue gives (in blocks): CRL EAES HP PSV, which can easily be remembered as "cooperative reinforcement learning, EAs, Harry Potter, PS: The last virtue is the void."
(Obviously remembering these is pointless, but memorizing lists is a nice way to practice mnemonic technique.)
...We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction a
EleutherAI's #alignment channels are good to ask questions in. Some specific answers
I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.
Leo Gao's Toward Deconfusing Wireheading and Reward Maximization sheds some light on this.
How can I look at my children and not already be mourning their death from day 1?
Suppose you lived in the dark times, where children have a <50% of living to adulthood. Wouldn't you still have kids? Even if probabilistically smallpox was likely to take them?
If AI kills us all, will my children suffer? Will it be my fault for having brought them into the world while knowing this would happen?
Even if they don't live to adulthood, I'd still view their childhoods as valuable. Arguably higher average utility than adulthood.
...Even if my children's shor
Random thought: Perhaps you could carefully engineer gradient starvation in order to "avoid generalizing" and defeat the Discrete modes of prediction example. You'd only need to delay it until reflection, then the AI can solve the successor AI problem.
In general: hack our way towards getting value-preserving reflectivity before values drift from "Diamonds" -> "What's labeled as a diamond by humans". (Replacing with "Telling the truth", and "What the human thinks is true" respectively).
I disagree that the policy must be worth selling (see e.g. Jordon Belfort). Many salespeople can sell things that aren't worth buying. See also: never split the difference for an example of negotiation when you have little/worse leverage.
(Also, I don't think htwfaip boils down to satisfying an eager want, the other advice is super important too. E.g. don't criticize, be genuinely interested in a person, ...)
Both are important, but I disagree that power is always needed. In example 3,7,9 it isn't clear that the compromise is actually better for the convinced party. The insurance is likely -EV, The peas aren't actually a crux to defeating the bully, the child would likely be happier outside kindergarten.
From skimming the benchmark and the paper this seems overhyped (like Gato). roughly it looks like
I don't know much about GNNs & only did a surface-level skim so I'm interested to hear other takes.
Interesting perspective, kinda reminds me of the ROME paper where it seems to only do "shallow counterfactuals".
unpopular opinion: I like the ending of the subsequent film
IMO it's a natural continuation for Homura. After spending decades of subjective time trying to save someone would you really let them go like that? Homura isn't an altruist, she doesn't care about the lifetime of the universe - she just wants Madoka.
I think school is huge in preventing people from becoming smart and curious. I spent 1-2years where I hardly studied at all and mostly played videogames - I wish I hadn't wasted that time, but when I quit I did so of my own free will. I think there's a huge difference between discipline imposed from the outside vs the inside, and getting to the latter is worth a lot. (though I wish I hadn't wasted all that time now haha)
I'm unsure which parts of my upbringing were cruxes for unschooling working. You should probably read a book or something rather than taking my (very abnormal) opinion. I just know how it went for me :)
Epistemic status: personal experience.
I'm unschooled and think it's clearly better, even if you factor in my parents being significantly above average in parenting. Optimistically school is babysitting, people learn nothing there while wasting most of their childhood. Pessimistically it's actively harmful by teaching people to hate learning/build antibodies against education.
Here's a good documentary made by someone who's been in and out of school. I can't give detailed criticism since I (thankfully) never had to go to school.
EDIT: As for what the alternat...
#3 is good. another good reason is so you have enough mathematical maturity to understand fancy theoretical results.
I'm probably overestimating the importance of #4, really I just like having the ability to pick up a random undergrad/early-grad math book and understand what's going on, and I'd like to extend that further up the tree :)
(Note; I haven't finished any of them)
Quantum computing since Democritus is great, I understand Godel's results now! And a bunch of complexity stuff I'm still wrapping my head around.
The Road to Reality is great, I can pretend to know complex analysis after reading chapters 5,7,8 and most people can't tell the difference! Here's a solution to a problem in chapter 7 I wrote up.
I've only skimmed parts of the Princeton guides, and different articles are written by different authors—but Tao's explanation of compactness (also in the book) is fantastic, I don't ...
From my perspective 9 (scaling fast) makes perfect sense since Conjecture is aiming to stay "slightly behind state of the art", and that requires engineering power.