This is a special post for short-form writing by jacquesthibs. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.
Here's the summary introduction:
12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIstoensure stable alignment.
Summary
Agenda 1: Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
Agenda 2: Supervising AIs Improving AIs(through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in
Can you give concrete use-cases that you imagine your project would lead to
helping alignment researchers? Alignment researchers have wildly varying styles
of work outputs and processes. I assume you aim to accelerate a specific subset
of alignment researchers (those focusing on interpretability and existing models
and have an incremental / empirical strategy for solving the alignment problem).
Learning is a set of skills. You need to practice each component of the learning process to get better. You can’t watch a video on a new technique and immediately become a pro. It takes time to reap the benefits.
Most people suck at mindmaps. Mindmaps can be horrible for learning if you just dump a bunch of text on a page and point arrows to different stuff (some studies show mindmaps are ineffective, but that's because people initially suck at making them). However, if you take the time to learn how to do them well, they will pay huge dividends in the future. I’ll be doing the “Do 100 Things” challenge and developing my skill in building better mindmaps. Getting better at mindmaps involves “chunking” the material and creating memorable connections and drawings.
Note on using ChatGPT for learning
* Important part: Use GPT to facilitate the process of pushing you to
higher-order learning as fast as possible.
* Here’s Bloom’s Taxonomy for higher-order learning:
*
* For example, you want to ask GPT to come up with analogies and such to help
you enter higher-order thinking by thinking about whether the analogy makes
sense.
* Is the analogy truly accurate?
* Does it cover the main concept you are trying to understand?
* Then, you can extend the analogy to try to make it better and more
comprehensive.
* This allows you to offload the less useful task (e.g. coming up with the
analogy), and spending more time in the highest orders of learning (the
evaluation phase; “is this analogy good? where does it break down?”).
* You still need to use your cognitive load to encode the knowledge
effectively. Look for desirable difficulty.
* Use GPT to create a pre-study of the thing you would like to learn.
* Have it create an outline of the order of the things you should learn.
* Have it give you a list of all the jargon words in a field and how they
relate so that you can quickly get up to speed on the terminology and talk
to an expert.
* Coming up with chunks of the topic you are exploring.
* You can give GPT text that describes what you are trying to understand, the
relationships between things and how you are chunking them.
* Then, you can ask GPT to tell you what are some weak areas or some things
that are potentially missing.
* GPT works really well as a knowledge “gap-checker”.
When you are trying to have GPT output some novel insights or complicated
nuanced knowledge, it can give vague answers that aren’t too helpful. This is
why, it is often better to treat GPT as a gap-checker and/or a friend that is
prompting you to come up with great insights.
Reference: I’ve been using ChatGPT/GPT-4 a lot to gain insights on how to
accelerate alignment research. Some
2jacquesthibs9mo
How learning efficiently applies to alignment research
As we are trying to optimize for actually solving the problem, we should not
fall into the trap of learning just to learn. We should instead focus on
learning efficiently with respect to how it helps us generate insights that lead
to a solution for alignment. This is also the framing we should have in mind
when we are building tools for augmenting alignment researchers.
With the above in mind, I expect that part of the value of learning efficiently
involves some of the following:
* Efficient learning involves being hyper-focused on identifying the core
concepts and how they all relate to one another. This mode of approaching
things seems like it helps us attack the core of alignment much more directly
and bypasses months/years of working on things that are only tangential.
* Developing a foundation of a field seems key to generating useful insights.
The goal is not to learn everything but to build a foundation that allows you
to bypass spending way too much time tackling sub-optimal sub-problems or
dead-ends for way too long. Part of the foundation-building process should
reduce the time it shapes you into an exceptional alignment researcher rather
than a knower-of-things.
* As John Wentworth says with respect to the Game Tree of Alignment: "The main
reason for this exercise is that (according to me) most newcomers to
alignment waste years on tackling not-very-high-value sub-problems or
dead-end strategies."
* Lastly, many great innovations have not come from unique original ideas.
There's an iterative process passed amongst researchers and it seems often
the case that the greatest ideas come from simply merging ideas that were
already lying around. Learning efficiently (and storing those learnings for
later use) allows you to increase the number of ideas you can merge together.
If you want to do that efficiently, you need to improve your ability to
ident
2Peter Hroššo9mo
My model of (my) learning is that if the goal is sufficiently far, learning
directly towards the goal is goodharting a likely wrong metric.
The only method which worked for me for very distant goals is following my
curiosity and continuously internalizing new info, such that the curiosity is
well informed about current state and the goal.
2jacquesthibs9mo
Curiosity is certainly a powerful tool for learning! I think any learning system
which isn't taking advantage of it is sub-optimal. Learning should be guided by
curiosity.
The thing is, sometimes we need to learn things we aren't so curious about. One
insight I Iearned from studying learning is that you can do specific things to
make yourself more curious about a given thing and harness the power that comes
with curiosity.
Ultimately, what this looks like is to write down questions about the topic and
use them to guide your curious learning process. It seems that this is how
efficient top students end up learning things deeply in a shorter amount of
time. Even for material they care little about, they are able to make themselves
curious and be propelled forward by that.
That said, my guess is that goodharting the wrong metric can definitely be an
issue, but I'm not convinced that relying on what makes you naturally curious is
the optimal strategy for solving alignment. Either way, it's something to think
about!
1jacquesthibs9mo
By the way, I've just added a link to a video by a top competitive programmer on
how to learn hard concepts. In the video and in the iCanStudy course, both talk
about the concept of caring about what you are learning (basically, curiosity).
Gaining the skill to care and become curious is an essential part of the most
effective learning. However, contrary to popular belief, you don't have to be
completely guided by what makes you naturally curious! You can learn how to
become curious (or care) about any random concept.
1jacquesthibs8mo
Video on how to approach having to read a massive amount of information (like a
textbook) as efficiently as possible:
1jacquesthibs9mo
Added my first post (of, potentially, a sequence) on effective learning here. I
think there are a lot of great lessons at the frontier of the literature and
real-world practice on learning that go far beyond the Anki approach that a lot
of people seem to take these days. The important part is being effective and
efficient. Some techniques might work, but that does not mean it is the most
efficient (learning the same thing more deeply in less time).
Note that I also added two important videos to the root shortform:
1jacquesthibs9mo
Note on spaced repetition
While spaced repetition is good, many people end up misusing it as a crutch
instead of defaulting to trying to deeply understand a concept right away. As
you get better at properly encoding the concept, you extend the forgetting curve
to the point where repetition is less needed.
Here's a video of a top-level programmer on how he approaches learning hard
concepts efficiently.
And here's a video on how the top 0.1% of students study efficiently.
1jacquesthibs9mo
Here's some additional notes on the fundamentals on being an effective learner:
Encoding and Retrieval (What it take to learn)
* Working memory is the memory that we use. However, if it is not encoded
properly or at all, we will forget it.
* Encode well first (from working memory to long-term memory), then frequently
and efficiently retrieve from long-term memory.
* If studying feels easy, means that you aren't learning or holding on to the
information. It means that you are not encoding and retrieving effectively.
* You want it to be difficult when you are studying because this is how it will
encode properly.
Spacing, Interleaving, and Retrieval (SIR)
* These are three rules that apply to every study technique in the course
(unless told otherwise). You can apply SIR to all techniques.
* Spacing: space your learning out.
* Pre-study before class, then learn in class, and then a week later revise
it with a different technique.
* A rule of thumb you can follow is to wait long enough until you feel like
you are just starting to forget the material.
* As you get better at encoding the material effectively as soon as you are
exposed to it, you will notice that you will need to do less repetition.
* How to space reviews:
* Beginner Schedule (less reviews need as you get better at encoding)
* Same day
* Next day
* End of week
* End of month
* After learning something for the first time, review it later on the same
day.
* Review everything from the last 2-3 days mid-week.
* Do an end of week revision on the week's worth of content.
* End of month revision on entire month's worth of content.
* Review of what's necessary as time goes on.
* (If you're trying to do well on an exam or a coding interview, you can
do the review 1 or 2 weeks before the assessment.)
* Reviewing time duration:
* For beginners
* No less than 30 minutes p
1jacquesthibs10mo
A few more notes:
* I use the app Concepts on my iPad to draw mindmaps. Drawing mindmaps with
pictures and such is way more powerful (better encoding into long-term
memory) than typical mindmap apps where you just type words verbatim and draw
arrows. It's excellent since it has a (quasi-) infinite canvas. This is the
same app that Justin Sung uses.
* When I want to go in-depth into a paper, I will load it into OneNote on my
iPad and draw in the margin to better encode my understanding of the paper.
* I've been using the Voice Dream Reader app on my iPhone and iPad to get
through posts and papers much faster (I usually have time to read most of an
Alignment Forum post on my way to work and another on the way back).
Importantly, I stop the text-to-speech when I'm trying to understand an
important part. I use Pocket to load LW/AF posts into it and download PDFs on
my device and into the app for reading papers. There's a nice feature in the
app that automatically skips citations in the text, so reading papers isn't
as annoying. The voices are robotic, but I just cycled through a bunch until
I found one I didn't mind (I didn't buy any, but there are premium voices). I
expect Speechify has better voices, but it's more expensive, and I think
people find that the app isn't as good overall compared to Voice Dream
Reader. Thanks to Quintin Pope for recommending the app to me.
I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:
I don’t think it will work.
Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.
In the end, I expect this will just alienate people. And stuff like this concerns me.
I think it’s possible that the most memetically power... (read more)
So I think what I'm getting here is that you have an object-level disagreement (not as convinced about doom), but you are also reinforcing that object-level disagreement with signalling/reputational considerations (this will just alienate people). This pattern feels ugh and worries me. It seems highly important to separate the question of what's true from the reputational question. It furthermore seems highly important to separate arguments about what makes sense to say publicly on-your-world-model vs on-Eliezer's-model. In particular, it is unclear to me whether your position is "it is dangerously wrong to speak the truth about AI risk" vs "Eliezer's position is dangerously wrong" (or perhaps both).
I guess that your disagreement with Eliezer is large but not that large (IE you would name it as a disagreement between reasonable people, not insanity). It is of course possible to consistently maintain that (1) Eliezer's view is reasonable, (2) on Eliezer's view, it is strategically acceptable to speak out, and (3) it is not in fact strategically acceptable for people with Eliezer's views to speak out about those views. But this combination of views does imply endorsing a silencing of reasonable disagreements which seems unfortunate and anti-epistemic.
My own guess is that the maintenance of such anti-epistemic silences is itself an important factor contributing to doom. But, this could be incorrect.
Yeah, so just to clarify a few things:
* This was posted on the day of the open letter and I was indeed confused about
what to think of the situation.
* I think something I failed to properly communicate is that I was worried that
this was a bad time to pull the lever even if I’m concerned about risks from
AGI. I was worried the public wouldn’t take alignment seriously because they
cause a panic much sooner than people were ready for.
* I care about being truthful, but I care even more about not dying so my
comment was mostly trying to communicate that I didn’t think this was the
best strategic decision for not dying.
* I was seeing a lot of people write negative statements about the open letter
on Twitter and it kind of fed my fears that this was going to backfire as a
strategy and impact all of our work to make ai risk taken seriously.
* In the end, the final thing that matters is that we win (i.e. not dying from
AGI).
I’m not fully sure what I think now (mostly because I don’t know about higher
order effects that will happen 2-3 years from now), but I think it turned out a
lot strategically better than I initially expected.
8jacquesthibs6mo
To try and burst any bubble about people’s reaction to the article, here’s a set
of tweets critical about the article:
* https://twitter.com/mattparlmer/status/1641230149663203330?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/jachiam0/status/1641271197316055041?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/finbarrtimbers/status/1641266526014803968?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/plinz/status/1641256720864530432?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/perrymetzger/status/1641280544007675904?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/post_alchemist/status/1641274166966996992?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/keerthanpg/status/1641268756071718913?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/levi7hart/status/1641261194903445504?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/luke_metro/status/1641232090036600832?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/gfodor/status/1641236230611562496?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/luke_metro/status/1641263301169680386?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/perrymetzger/status/1641259371568005120?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/elaifresh/status/1641252322230808577?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/markovmagnifico/status/1641249417088098304?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/interpretantion/status/1641274843692691463?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/lan_dao_/status/1641248437139300352?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/lan_dao_/status/1641249458053861377?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/growing_daniel/status/1641246902363766784?s=61&t=ryK3X96D_TkGJtvu2rm0uw
* https://twitter.com/alexandrosm/status/1641259179955601408?s=61&t=ryK3X96D_TkGJtvu2rm0uw
2Viliam6mo
What is the base rate for Twitter reactions for an international law proposal?
1jacquesthibs6mo
Of course it’s often all over the place. I only shared the links because I
wanted to make sure people weren’t deluding themselves with only positive
comments.
2Viliam6mo
This reminds me of the internet-libertarian chain of reasoning that anything
that government does is protected by the threat of escalating violence,
therefore any proposals that involve government (even mild ones, such as "once
in a year, the President should say 'hello' to the citizens") are calls for
murder, because... (create a chain of escalating events starting with someone
non-violently trying to disrupt this, ending with that person being killed by
cops)...
Yes, a moratorium on AIs is a call for violence, but only in the sense that
every law is a call for violence.
Given funding is a problem in AI x-risk at the moment, I’d love to see people to start thinking of creative ways to provide additional funding to alignment researchers who are struggling to get funding.
For example, I’m curious if governance orgs would pay for technical alignment expertise as a sort of consultant service.
Also, it might be valuable to have full-time field-builders that are solely focused on getting more high-net-worth individuals to donate to AI x-risk.
Setting aside the question of whether people are overly confident about their claims regarding AI risk, I'd like to talk about how we talk about it amongst ourselves.
We should avoid jokingly saying "we're all going to die" because I think it will corrode your calibration to risk with respect to P(doom) and it will give others the impression that we are all more confident about P(doom) than we really are.
I think saying it jokingly still ends up creeping into your rational estimates on timelines and P(doom). I expe... (read more)
What are some important tasks you've found too cognitively taxing to get in the flow of doing?
One thing that I'd like to consider for Accelerating Alignment is to build tools that make it easier to get in the habit of cognitively demanding tasks by reducing the cognitive load necessary to do the task. This is part of the reason why I think people are getting such big productivity gains from tools like Copilot.
One way I try to think about it is like getting into the habit of playing guitar. I typically tell people to buy an electric guitar rather than an ac... (read more)
For developing my hail mary alignment approach, the dream would be to be able to
load enough of the context of the idea into a LLM that it could babble
suggestions (since the whole doc won't fit in the context window, maybe
randomizing which parts beyond the intro are included for diversity?), then have
it self-critique those suggestions automatically in different threads in bulk
and surface the most promising implementations of the idea to me for review. In
the perfect case I'd be able to converse with the model about the ideas and have
that be not totally useless, and pump good chains of thought back into the
fine-tuning set.
Wrote up a short (incomplete) bullet point list of the projects I'd like to work on in 2023:
Accelerating Alignment
Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
Fine-tune GPT-3/GPT-4 on alignment text and connect the API toLoom, VSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
Two other projects I would find interesting to work on:
* Causal Scrubbing to remove specific capabilities from a model. For example,
training a language model on The Pile and a code dataset. Then, applying
causal scrubbing to try and remove the model's ability to generate code while
still achieving the similar loss on The Pile.
* A few people have started extending the work from the Discovering Latent
Knowledge in Language Models without Supervision paper. I think this work
could potentially evolve into a median-case solution to avoiding x-risk from
AI.
3chanamessinger5mo
Curious if you have any updates!
2jacquesthibs5mo
Working on a new grant proposal right now. Should be sent this weekend. If you’d
like to give feedback or have a look, please send me a DM! Otherwise, I can send
the grant proposal to whoever wants to have a look once it is done (still
debating about posting it on LW).
Outside of that, there has been a lot of progress on the Cyborgism discord
(there is a VSCode plugin called Worldspider that connects to the various APIs,
and there has been more progress on Loom). Most of my focus has gone towards
looking at the big picture and keeping an eye on all the developments. Now, I
have a better vision of what is needed to create an actually great alignment
assistant and have talked to other alignment researchers about it to get
feedback and brainstorm. However, I’m spread way too thin and will request
additional funding to get some engineer/builder to start building the ideas out
so that I can focus on the bigger picture and my alignment work.
If I can get my funding again (previous funding ended last week) then my main
focus will be building out the system I have in my for accelerating alignment
work + continue working on the new agenda I put out with Quintin and others.
There’s some other stuff I‘d like to do, but those are lower priority or will
depend on timing. It’s been hard to get the funding application done because
things are moving so fast and I’m trying not to build things that will be built
by default. And I’ve been talking to some people about the possibility of
building an org so that this work could go a lot faster.
3plex9mo
Very excited by this agenda, was discussing my hope that someone finetunes LLMs
on the alignment archive soon today!
2Mati_Roy6mo
do you have a link?
I'd be interested in being added to the Discord
I often find information about AI development on X (f.k.a.Twitter) and sometimes other websites. They usually don't warrant their own post, so I'll use this thread to share. I'll be placing a fairly low filter on what I share.
There's someone on X (f.k.a.Twitter) called Jimmy Apples (🍎/acc) and he has shared some information in the past that turned out to be true (apparently the GPT-4 release date and that OAI's new model would be named "Gobi"). He recently tweeted, "AGI has been achieved internally." Some people think that the Reddit comment below may be from the same guy (this is just a weak signal, I’m not implying you should consider it true or update on it):
Where is the evidence that he called OpenAI’s release date and the Gobi name?
All I see is a tweet claiming the latter but it seems the original tweet isn’t
even up?
2jacquesthibs1d
This is the tweet for Gobi:
https://x.com/apples_jimmy/status/1703871137137176820?s=46&t=YyfxSdhuFYbTafD4D1cE9A
I asked someone if it’s fake. Apparently not, you can find it on google archive:
https://threadreaderapp.com/thread/1651837957618409472.html
3Person10h
Predicting the GPT-4 launch date can easily be disproven with the confidence
game. It's possible he just created a prediction for every day and deleted the
ones that didn't turn out right.
For the Gobi prediction it's tricky. The only evidence is the Threadreader and a
random screenshot from a guy who seems clearly related to jim. I am very
suspicious of the Threadreader one. On one hand I don't see a way it can be
faked, but it's very suspicious that the Gobi prediction is Jimmy's only post
that was saved there despite him making an even bigger bombshell "prediction".
It's also possible, though unlikely, that the Information's article somehow
found his tweet and used it as a source for their article.
What kills Jimmy's credibility for me is his prediction back in January (you can
use the Wayback Machine to find it) that OAI had finished training GPT-5, no not
a GPT-5 level system, the ACTUAL GPT-5 in October 2022 and that it was 125T
parameters.
Also goes without saying, pruning his entire account is suspicious too.
2jacquesthibs1d
I’ll try to find them, but this was what people were saying. They also said he
deleted past tweets so that evidence may forever be gone.
I remember one tweet where Jimmy said something like, “Gobi? That’s old news, I
said that months ago, you need to move on to the new thing.” And I think he
linked the tweet though I’m very unsure atm. Need to look it up, but you can use
the above for a search.
2jacquesthibs1d
Sam Altman at a YC founder reunion:
https://x.com/smahsramo/status/1706006820467396699?s=46&t=YyfxSdhuFYbTafD4D1cE9A
“Most interesting part of @sama talk: GPT5 and GPT6 are “in the bag” but that’s
likely NOT AGI (eg something that can solve quantum gravity) without some
breakthroughs in reasoning. Strong agree.”
2Mitchell_Porter1d
AGI is "something that can solve quantum gravity"?
That's not just a criterion for general intelligence, that's a criterion for
genius-level intelligence. And since general intelligence in a computer has
advantages of speed, copyability, little need for down time, that are not
possessed by general intelligence, AI will be capable of contributing to its
training, re-design, agentization, etc, long before "genius level" is reached.
This underlines something I've been saying for a while, which is that
superintelligence, defined as AI that definitively surpasses human understanding
and human control, could come into being at any time (from large models that are
not publicly available but which are being developed privately by Big AI
companies). Meanwhile, Eric Schmidt (former Google CEO) says about five years
until AI is actively improving itself, and that seems generous.
So I'll say: timeline to superintelligence is 0-5 years.
2Vladimir_Nesov17h
In some models of the world this is seen as unlikely to ever happen, these
things are expected to coincide, which collapses the two definitions of AGI. I
think the disparity between sample efficiency of in-context learning and that of
pre-training is one illustration for how these capabilities might come apart, in
the direction that's opposite to what you point to: even genius in-context
learning doesn't necessarily enable the staying power of agency, if this
transient understanding can't be stockpiled and the achieved level of genius is
insufficient to resolve the issue while remaining within its limitations (being
unable to learn a lot of novel things in the course of a project).
2jacquesthibs1d
Someone in the open source community tweeted: "We're about to change the AI
game. I'm dead serious."
My guess is that he is implying that they will be releasing open source mixture
of experts models in a few months from now. They are currently running them on
CPUs.
2jacquesthibs1d
Lots of cryptic tweet from the open source LLM guys:
https://x.com/abacaj/status/1705781881004847267?s=46&t=YyfxSdhuFYbTafD4D1cE9A
“If you thought current open source LLMs are impressive… just remember they
haven’t peaked yet”
To be honest, my feeling is that they are overhyping how big of deal this will
be. Their ego and self-importance tend to be on full display.
3Person18h
Occasionally reading what OSS AI gurus say, they definitely overhype their stuff
constantly. The ones who make big claims and try to hype people up are often
venture entrepreneur guys rather than actual ML engineers.
2jacquesthibs18h
The open source folks I mostly keep an eye on are the ones who do actually code
and train their own models. Some are entrepreneurs, but they know a decent
amount. Not top engineers, but they seem to be able to curate datasets and train
custom models.
There’s some wannabe script kiddies too, but once you lurk enough, you become
aware of who are actually decent engineers (you’ll find some at Vector Institute
and Jeremy Howard is pro- open source, for example). I wouldn’t totally discount
them having an impact, even though some of them will overhype.
I think it would be great if alignment researchers read more papers
But really, you don't even need to read the entire paper. Here's a reminder to consciously force yourself to at least read the abstract. Sometimes I catch myself running away from reading an abstract of a paper even though it is very little text. Over time I've just been forcing myself to at least read the abstract. A lot of times you can get most of the update you need just by reading the abstract. Try your best to make it automatic to do the same.
To read more papers, consider using Semant... (read more)
I’ve always been interested in people just becoming hyper-obsessed in pursuing a goal. One easy example is with respect to athletes. Someone like Kobe Bryant was just obsessed with becoming the best he could be. I’m interested in learning what we can from the experiences of the hyper-obsessed and what we can apply to our work in EA / Alignment.
I bought a few books on the topic, I should try to find the time to read them. I’ll try to store some lessons in this shortform, but here’s a quote from Mr. Beast’s Joe Rogan in... (read more)
I think people might have the implicit idea that LLM companies will continue to give API access as the models become more powerful, but I was talking to someone earlier this week that made me remember that this is not necessarily the case. If you gain powerful enough models, you may just keep it to yourself and instead spin AI companies with AI employees to make a ton of cash instead of just charging for tokens.
For this reason, even if outside people build the proper brain-like AGI setup with additional components to squeeze out capabilities from LLMs, they may be limited by:
1. open-source models
2. the API of the weaker models from the top companies
3. the best API of the companies that are lagging behind
One error people can make when thinking about takeoff speeds is assuming that because we are in a world with some gradual takeoff, it now means we are in a "slow takeoff" world. I think this can lead us to make some mistakes in our strategy. I usually prefer thinking in the following frame: “is there any point in the future where we’ll have a step function that prevents us from doing slow takeoff-like interventions for preventing x-risk?”
In other words, we should be careful to assume that some "slow takeoff" doesn't have a... (read more)
I’m collaborating on a new research agenda. Here’s a potential insight about future capability improvements:
There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run ... (read more)
Top Diplomacy players seem to focus on gigabrain strategies rather than deception
Diplomacy players will no longer want to collaborate with you if you backstab them once. This is so pervasive they'll still feel you are untrustworthy across tournaments. Therefore, it's mostly optimal to be honest and just focus on gigabrain strategies. That said, a smart... (read more)
AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.
If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).
I don’t know much about cybersecurity so I’d be curious to hear from someone who does.
Small shortform to say that I’m a little sad I haven’t posted as much as I would like to in recent months because of infohazard reasons. I’m still working on Accelerating Alignment with LLMs and eventually would like to hire some software engineer builders that are sufficiently alignment-pilled.
Fyi, if there are any software projects I might be able to help out on after
May, let me know. I can't commit to anything worth being hired for but I should
have some time outside of work over the summer to allocate towards personal
projects.
Call To Action: Someone should do a reading podcast of the AGISF material to make it even more accessible (similar to the LessWrong Curated Podcast and Cold Takes Podcast). A discussion series added to YouTube would probably be helpful as well.
“We assume the case that AI (intelligences in general) will eventually converge on one utility function. All sufficiently intelligent intelligences born in the same reality will converge towards the same behaviour set. For this reason, if it turns out that a sufficiently advanced AI would kill us all, there’s nothing that we can do about it. We will eventually hit that level of intelligence.
Now, if that level of intelligence is doesn’t converge towards something that kills us all, we are safer in a world where AI capabilities (of the current regime) essent... (read more)
AFAIK, there's a distinct cluster of two kinds of independent alignment
researchers:
* those who want to be at Berkeley / London and are either there or unable to
get there for logistical or financial (or social) reasons
* those who very much prefer working alone
It very much depends on the person's preferences, I think. I personally
experienced a OOM-increase in my effectiveness by being in-person with other
alignment researchers, so that is what I choose to invest in more.
I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.
Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot mor... (read more)
I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.
I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message.
Someone should create a “AI risk arguments” flowchart that serves as a base for simulating a conversation with skeptics or the general public. Maybe a set of flashcards to go along with it.
I want to have the sequence of arguments solid enough in my head so that I can reply concisely (snappy) if I ever end up in a debate, roundtable or on the news. I’ve started collecting some stuff since I figured I should take initiative on it.
Maybe something like this can be extracted from stampy.ai (I am not that
familiar with stampy fyi, its aims seem to be broader than what you want.)
3jacquesthibs5mo
Yeah, it may be something that the Stampy folks could work on!
3jacquesthibs5mo
Edit: oops, I thought you were responding to my other recent comment on building
an alignment research system.
Stampy.ai and AlignmentSearch
(https://www.lesswrong.com/posts/bGn9ZjeuJCg7HkKBj/introducing-alignmentsearch-an-ai-alignment-informed)
are both a lot more introductory than what I am aiming for. I’m aiming for
something to greatly accelerate my research workflow as well as other alignment
researchers. It will be designed to be useful for fresh researchers, but yeah
the aim is more about producing research rather than learning about AI risk.
Text-to-Speech tool I use for reading more LW posts and papers
I use Voice Dream Reader. It's great even though the TTS voice is still robotic. For papers, there's a feature that let's you skip citations so the reading is more fluid.
I've mentioned it before, but I was just reminded that I should share it here because I just realized that if you load the LW post with "Save to Voice Dream", it will also save the comments so I can get TTS of the comments as well. Usually these tools only include the post, but that's annoying because there's a lot of good stuff... (read more)
I honestly feel like some software devs should probably still keep their high-paying jobs instead of going into alignment and just donate a bit of time and programming expertise to help independent researchers if they want to start contributing to AI Safety.
I think we can probably come up with engineering projects that are interesting and low-barrier-to-entry for software engineers.
I also think providing “programming coaching” to some independent researchers could be quite useful. Whether that’s for getting them better at coding up projects efficiently or ... (read more)
There's this Twitter thread that I saved a while ago that is no longer up. It's about generating ideas for startups. However, I think the insight from the thread carries over well enough to thinking about ideas for Accelerating Alignment. Particularly, being aware of what is on the cusp of being usable so that you can take advantage of it as soon as becomes available (even do the work beforehand).
For example, we are surprisingly close to human-level text-to-speech (have a look at Apple's new model for audiobook... (read more)
Should EA / Alignment offices make it ridiculously easy to work remotely with people?
One of the main benefits of being in person is that you end up in spontaneous conversations with people in the office. This leads to important insights. However, given that there's a level of friction for setting up remote collaboration, only the people in those offices seem to benefit.
If it were ridiculously easy to join conversations for lunch or whatever (touch of a button rather than pulling up a laptop and opening a Zoom session), then would it allow for a stronger cr... (read more)
Detail about the ROME paper I've been thinking about
In the ROME paper, when you prompt the language model with "The Eiffel Tower is located in Paris", you have the following:
Subject token(s): The Eiffel Tower
Relationship: is located in
Object: Paris
Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and 'write' into to the residual stream for the attention modules at the final... (read more)
A couple of notes regarding the Reversal Curse paper.
I'm unsure if I didn't emphasize it in the post enough, but part of the point of
my post on ROME was that many AI researchers seemed to assume that transformers
are not trained in a way that prevents them from understanding that A is B = B
is A.
As I discussed in the comment above,
This means that the A token will 'write' some information into the residual
stream, while the B token will 'write' other information into the residual. Some
of that information may be the same, but not all. And so, if it's different
enough, the attention heads just won't be able to pick up on the relevant
information to know that B is A. However, if you include the A token, the
necessary information will be added to the residual stream, and it will be much
more likely for the model to predict that B is A (as well as A is B).
From what I remember in the case of ROME, as soon as I added the edited token A
to the prompt (or make the next predicted token be A), then the model could
essentially predict B is A.
I write what it means in the context of ROME, below (found here in the post):
Regarding human intuition, @Neel Nanda says (link):
I actually have a bit of an updated (evolving) opinion on this:
@cfoster0 asks:
My response:
As I said, this is a bit of an evolving opinion. Still need time to think about
this, especially regarding the differences between decoder-only transformers and
humans.
Finally, from @Nora Belrose, this is worth pondering:
Preventing capability gains (e.g. situational awareness) that lead to deception
Note: I'm at the crackpot idea stage of thinking about how model editing could be useful for alignment.
One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.
The thing is, if the model was aligned, then at some point we'd consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from bei... (read more)
I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.
Here's the summary introduction:
12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.
Summary
- Agenda 1: Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
- Agenda 2: Supervising AIs Improving AIs (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in
... (read more)Current Thoughts on my Learning System
Crossposted from my website. Hoping to provide updates on my learning system every month or so.
TLDR of what I've been thinking about lately:
- There are some great insights in this video called "How Top 0.1% Students Think." And in this video about how to learn hard concepts.
- Learning is a set of skills. You need to practice each component of the learning process to get better. You can’t watch a video on a new technique and immediately become a pro. It takes time to reap the benefits.
- Most people suck at mindmaps. Mindmaps can be horrible for learning if you just dump a bunch of text on a page and point arrows to different stuff (some studies show mindmaps are ineffective, but that's because people initially suck at making them). However, if you take the time to learn how to do them well, they will pay huge dividends in the future. I’ll be doing the “Do 100 Things” challenge and developing my skill in building better mindmaps. Getting better at mindmaps involves “chunking” the material and creating memorable connections and drawings.
- Relational vs Isolated Learning. As you learn something new, try to learn it in relation to the things you already kno
... (read more)I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:
In the end, I expect this will just alienate people. And stuff like this concerns me.
I think it’s possible that the most memetically power... (read more)
So I think what I'm getting here is that you have an object-level disagreement (not as convinced about doom), but you are also reinforcing that object-level disagreement with signalling/reputational considerations (this will just alienate people). This pattern feels ugh and worries me. It seems highly important to separate the question of what's true from the reputational question. It furthermore seems highly important to separate arguments about what makes sense to say publicly on-your-world-model vs on-Eliezer's-model. In particular, it is unclear to me whether your position is "it is dangerously wrong to speak the truth about AI risk" vs "Eliezer's position is dangerously wrong" (or perhaps both).
I guess that your disagreement with Eliezer is large but not that large (IE you would name it as a disagreement between reasonable people, not insanity). It is of course possible to consistently maintain that (1) Eliezer's view is reasonable, (2) on Eliezer's view, it is strategically acceptable to speak out, and (3) it is not in fact strategically acceptable for people with Eliezer's views to speak out about those views. But this combination of views does imply endorsing a silencing of reasonable disagreements which seems unfortunate and anti-epistemic.
My own guess is that the maintenance of such anti-epistemic silences is itself an important factor contributing to doom. But, this could be incorrect.
Given funding is a problem in AI x-risk at the moment, I’d love to see people to start thinking of creative ways to provide additional funding to alignment researchers who are struggling to get funding.
For example, I’m curious if governance orgs would pay for technical alignment expertise as a sort of consultant service.
Also, it might be valuable to have full-time field-builders that are solely focused on getting more high-net-worth individuals to donate to AI x-risk.
On joking about how "we're all going to die"
Setting aside the question of whether people are overly confident about their claims regarding AI risk, I'd like to talk about how we talk about it amongst ourselves.
We should avoid jokingly saying "we're all going to die" because I think it will corrode your calibration to risk with respect to P(doom) and it will give others the impression that we are all more confident about P(doom) than we really are.
I think saying it jokingly still ends up creeping into your rational estimates on timelines and P(doom). I expe... (read more)
What are some important tasks you've found too cognitively taxing to get in the flow of doing?
One thing that I'd like to consider for Accelerating Alignment is to build tools that make it easier to get in the habit of cognitively demanding tasks by reducing the cognitive load necessary to do the task. This is part of the reason why I think people are getting such big productivity gains from tools like Copilot.
One way I try to think about it is like getting into the habit of playing guitar. I typically tell people to buy an electric guitar rather than an ac... (read more)
Projects I'd like to work on in 2023.
Wrote up a short (incomplete) bullet point list of the projects I'd like to work on in 2023:
- Accelerating Alignment
- Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
- Fine-tune GPT-3/GPT-4 on alignment text and connect the API to Loom, VSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
- Create an audio-to-post pipeline where we can eas
... (read more)Jacques' AI Tidbits from the Web
I often find information about AI development on X (f.k.a.Twitter) and sometimes other websites. They usually don't warrant their own post, so I'll use this thread to share. I'll be placing a fairly low filter on what I share.
There's someone on X (f.k.a.Twitter) called Jimmy Apples (🍎/acc) and he has shared some information in the past that turned out to be true (apparently the GPT-4 release date and that OAI's new model would be named "Gobi"). He recently tweeted, "AGI has been achieved internally." Some people think that the Reddit comment below may be from the same guy (this is just a weak signal, I’m not implying you should consider it true or update on it):
I think it would be great if alignment researchers read more papers
But really, you don't even need to read the entire paper. Here's a reminder to consciously force yourself to at least read the abstract. Sometimes I catch myself running away from reading an abstract of a paper even though it is very little text. Over time I've just been forcing myself to at least read the abstract. A lot of times you can get most of the update you need just by reading the abstract. Try your best to make it automatic to do the same.
To read more papers, consider using Semant... (read more)
On hyper-obession with one goal in mind
I’ve always been interested in people just becoming hyper-obsessed in pursuing a goal. One easy example is with respect to athletes. Someone like Kobe Bryant was just obsessed with becoming the best he could be. I’m interested in learning what we can from the experiences of the hyper-obsessed and what we can apply to our work in EA / Alignment.
I bought a few books on the topic, I should try to find the time to read them. I’ll try to store some lessons in this shortform, but here’s a quote from Mr. Beast’s Joe Rogan in... (read more)
I think people might have the implicit idea that LLM companies will continue to give API access as the models become more powerful, but I was talking to someone earlier this week that made me remember that this is not necessarily the case. If you gain powerful enough models, you may just keep it to yourself and instead spin AI companies with AI employees to make a ton of cash instead of just charging for tokens.
For this reason, even if outside people build the proper brain-like AGI setup with additional components to squeeze out capabilities from LLMs, they may be limited by:
1. open-source models
2. the API of the weaker models from the top companies
3. the best API of the companies that are lagging behind
A frame for thinking about takeoff
One error people can make when thinking about takeoff speeds is assuming that because we are in a world with some gradual takeoff, it now means we are in a "slow takeoff" world. I think this can lead us to make some mistakes in our strategy. I usually prefer thinking in the following frame: “is there any point in the future where we’ll have a step function that prevents us from doing slow takeoff-like interventions for preventing x-risk?”
In other words, we should be careful to assume that some "slow takeoff" doesn't have a... (read more)
I’m collaborating on a new research agenda. Here’s a potential insight about future capability improvements:
There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run ... (read more)
Notes on Cicero
Link to YouTube explanation:
Link to paper (sharing on GDrive since it's behind a paywall on Science): https://drive.google.com/file/d/1PIwThxbTppVkxY0zQ_ua9pr6vcWTQ56-/view?usp=share_link
Top Diplomacy players seem to focus on gigabrain strategies rather than deception
Diplomacy players will no longer want to collaborate with you if you backstab them once. This is so pervasive they'll still feel you are untrustworthy across tournaments. Therefore, it's mostly optimal to be honest and just focus on gigabrain strategies. That said, a smart... (read more)
AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.
If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).
I don’t know much about cybersecurity so I’d be curious to hear from someone who does.
Small shortform to say that I’m a little sad I haven’t posted as much as I would like to in recent months because of infohazard reasons. I’m still working on Accelerating Alignment with LLMs and eventually would like to hire some software engineer builders that are sufficiently alignment-pilled.
Call To Action: Someone should do a reading podcast of the AGISF material to make it even more accessible (similar to the LessWrong Curated Podcast and Cold Takes Podcast). A discussion series added to YouTube would probably be helpful as well.
“We assume the case that AI (intelligences in general) will eventually converge on one utility function. All sufficiently intelligent intelligences born in the same reality will converge towards the same behaviour set. For this reason, if it turns out that a sufficiently advanced AI would kill us all, there’s nothing that we can do about it. We will eventually hit that level of intelligence.
Now, if that level of intelligence is doesn’t converge towards something that kills us all, we are safer in a world where AI capabilities (of the current regime) essent... (read more)
What are people’s current thoughts on London as a hub?
Anything else I’m missing?
I’m particularly curious about whether it’s worth it for independent researchers to go there. Would they actually interact with other r... (read more)
I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.
Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot mor... (read more)
I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.
I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message.
Someone should create a “AI risk arguments” flowchart that serves as a base for simulating a conversation with skeptics or the general public. Maybe a set of flashcards to go along with it.
I want to have the sequence of arguments solid enough in my head so that I can reply concisely (snappy) if I ever end up in a debate, roundtable or on the news. I’ve started collecting some stuff since I figured I should take initiative on it.
Text-to-Speech tool I use for reading more LW posts and papers
I use Voice Dream Reader. It's great even though the TTS voice is still robotic. For papers, there's a feature that let's you skip citations so the reading is more fluid.
I've mentioned it before, but I was just reminded that I should share it here because I just realized that if you load the LW post with "Save to Voice Dream", it will also save the comments so I can get TTS of the comments as well. Usually these tools only include the post, but that's annoying because there's a lot of good stuff... (read more)
I honestly feel like some software devs should probably still keep their high-paying jobs instead of going into alignment and just donate a bit of time and programming expertise to help independent researchers if they want to start contributing to AI Safety.
I think we can probably come up with engineering projects that are interesting and low-barrier-to-entry for software engineers.
I also think providing “programming coaching” to some independent researchers could be quite useful. Whether that’s for getting them better at coding up projects efficiently or ... (read more)
Differential Training Process
I've been ruminating on an idea ever since I read the section on deception in "The Core of the Alignment Problem is..." from my colleagues in SERI MATS.
Here's the important part:
... (read more)On generating ideas for Accelerating Alignment
There's this Twitter thread that I saved a while ago that is no longer up. It's about generating ideas for startups. However, I think the insight from the thread carries over well enough to thinking about ideas for Accelerating Alignment. Particularly, being aware of what is on the cusp of being usable so that you can take advantage of it as soon as becomes available (even do the work beforehand).
For example, we are surprisingly close to human-level text-to-speech (have a look at Apple's new model for audiobook... (read more)
Should EA / Alignment offices make it ridiculously easy to work remotely with people?
One of the main benefits of being in person is that you end up in spontaneous conversations with people in the office. This leads to important insights. However, given that there's a level of friction for setting up remote collaboration, only the people in those offices seem to benefit.
If it were ridiculously easy to join conversations for lunch or whatever (touch of a button rather than pulling up a laptop and opening a Zoom session), then would it allow for a stronger cr... (read more)
Detail about the ROME paper I've been thinking about
In the ROME paper, when you prompt the language model with "The Eiffel Tower is located in Paris", you have the following:
Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and 'write' into to the residual stream for the attention modules at the final... (read more)
Preventing capability gains (e.g. situational awareness) that lead to deception
Note: I'm at the crackpot idea stage of thinking about how model editing could be useful for alignment.
One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.
The thing is, if the model was aligned, then at some point we'd consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from bei... (read more)