All of jacquesthibs's Comments + Replies

AI labs should be dedicating a lot more effort into using AI for cybersecurity as a way to prevent weights or insights from being stolen. Would be good for safety and it seems like it could be a pretty big cash cow too.

If they have access to the best models (or specialized), it may be highly beneficial for them to plug them in immediately to help with cybersecurity (perhaps even including noticing suspicious activity from employees).

I don’t know much about cybersecurity so I’d be curious to hear from someone who does.

This is amazing, thanks! I'm happy people are setting up new places to absorb potential funding given the overton window shift.

If I'm applying to multiple funds and receive a funding from one of the other funds first, what should I do? I will list what I'd do with additional funding, but is there someone you would like me to email if I get funding from elsewhere first?

5habryka9d
If you get funding from other funds, it would be best if you update your application (you can edit your application any time before the evaluation period ends), or withdraw your application. We'll get notifications if you make edits and make sure to consider them. 

I spoke to Altman about a month ago. He essentially said some of the following:
 

  • His recent statement about scaling essentially plateau-ing was misunderstood and he still thinks it plays a big role.
  • Then, I asked him what comes next and he said they are working on the next thing that will provide 1000x improvement (some new paradigm).
  • I asked if online learning plays a role in that and he said yes.
  • That's one of the reasons we started to work on Supervising AIs Improving AIs.

In a shortform last month, I wrote the following:

There has been some insider disc

... (read more)

This was also a reason why I thought it might be valuable to scrape the alignment content: https://www.lesswrong.com/posts/FgjcHiWvADgsocE34/a-descriptive-not-prescriptive-overview-of-current-ai.

I figured we might want to use that dataset as a base for removing the data from the dataset.

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to&nbs... (read more)

I'm still in some sort of transitory phase where I'm deciding where I'd like to live long term. I moved to Montreal, Canada lately because I figured I'd try working as an independent researcher here and see if I can get MILA/Bengio to do some things for reducing x-risk.

Not long after I moved here, Hinton started talking about AI risk too, and he's in Toronto which is not too far from Montreal. I'm trying to figure out the best way I could leverage Canada's heavyweights and government to make progress on reducing AI risk, but it seems like there's a lot mor... (read more)

Cyborgism (especially in a recent alignment agenda) is sometimes used more narrowly to mean “ using AI (primarily pretrained GPT models) to augment human cognition". However in this workshop we intentionally do not restrict the term to language model cooperation and also include uses associated with the term “cyborg”.

Less talked about, but there have been some discussions about what we call "Hard Cyborgism" in the Cyborgism agenda and I remember we were hypothesizing different approaches to use tech like VR, TTS/STT, BCI, etc. sometime last fall.

I looked i... (read more)

I agree with the main points made in the post, though I want to recognize there is some difficulty that comes with predicting which aspects will drive capability advances. I think there is value in reading papers (something that more alignment researchers should probably do) because it can give us hints at the next capability leaps. Over time, I think it can improve our intuition for what lies ahead and allows us to better predict the order of capability advances. This is how I’ve felt as I’ve been pursuing the Accelerating Alignment agenda (language model... (read more)

I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.

I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message.

Well yes, but he’s also one of the main guys who brought the field to this point so this feels a little different. That said, I’m not saying he has an obligation, just that some people might have hoped for more after seeing him go public with this.

We now have a channel on the EleutherAI discord server called ai-supervisors. If you’d like to help with this agenda, please go there!

In the channel, Quintin shared a quick overview of the two projects we mentioned in this post. I’m sharing it below two provide some clarity on what we are working towards at the moment:

This agenda has two projects as its current focuses.

Project 1: Unsupervised behavioral evaluation
This project focuses on scalable ways to compare the behavioral tendencies of different LMs (or different ways of prompting the same LM), without... (read more)

Based on this interview, it doesn’t seem like Hinton is interested in doing a lot more for reducing AI risk: https://youtu.be/rLG68k2blOc?t=3378

It sounds like he wanted to sound the alarm as best he could with his credibility and will likely continue to do interviews, but says he’ll be spending his time “watching netflix, hanging around with his kids, and trying to study his forward-forward algorithm some more”.

Maybe he was downplaying his plans because he wants to keep them quiet for now, but this was a little sad even though his credibility applied to discussing AI risk concerns is certainly already an amazing thing for us to have gotten.

6localdeity1mo
The guy is 75 years old.  Many people would have retired 10+ years ago.  Any effort he's putting in is supererogatory as far as I'm concerned.  One can hope for more, of course, but let there be no hint of obligation.

Yeah, it may be something that the Stampy folks could work on!

Edit: oops, I thought you were responding to my other recent comment on building an alignment research system.

Stampy.ai and AlignmentSearch (https://www.lesswrong.com/posts/bGn9ZjeuJCg7HkKBj/introducing-alignmentsearch-an-ai-alignment-informed) are both a lot more introductory than what I am aiming for. I’m aiming for something to greatly accelerate my research workflow as well as other alignment researchers. It will be designed to be useful for fresh researchers, but yeah the aim is more about producing research rather than learning about AI risk.

Someone should create a “AI risk arguments” flowchart that serves as a base for simulating a conversation with skeptics or the general public. Maybe a set of flashcards to go along with it.

I want to have the sequence of arguments solid enough in my head so that I can reply concisely (snappy) if I ever end up in a debate, roundtable or on the news. I’ve started collecting some stuff since I figured I should take initiative on it.

3harfe1mo
Maybe something like this can be extracted from stampy.ai [http://stampy.ai] (I am not that familiar with stampy fyi, its aims seem to be broader than what you want.)

Working on a new grant proposal right now. Should be sent this weekend. If you’d like to give feedback or have a look, please send me a DM! Otherwise, I can send the grant proposal to whoever wants to have a look once it is done (still debating about posting it on LW).

Outside of that, there has been a lot of progress on the Cyborgism discord (there is a VSCode plugin called Worldspider that connects to the various APIs, and there has been more progress on Loom). Most of my focus has gone towards looking at the big picture and keeping an eye on all the deve... (read more)

Agenda for the above can be found here.

Just a small tangent with respect to:

Here's a relevant statement that Rohin made-- I think it's from a few years ago though so it might be outdated:

I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. T

... (read more)

This reminds me of what Evan said here: https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations

My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is

... (read more)
2Akash2mo
Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).  Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient. (In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)

Yeah, so just to clarify a few things:

  • This was posted on the day of the open letter and I was indeed confused about what to think of the situation.
  • I think something I failed to properly communicate is that I was worried that this was a bad time to pull the lever even if I’m concerned about risks from AGI. I was worried the public wouldn’t take alignment seriously because they cause a panic much sooner than people were ready for.
  • I care about being truthful, but I care even more about not dying so my comment was mostly trying to communicate that I didn’t
... (read more)

I’m collaborating on a new research agenda. Here’s a potential insight about future capability improvements:

There has been some insider discussion (and Sam Altman has said) that scaling has started running into some difficulties. Specifically, GPT-4 has gained a wider breath of knowledge, but has not significantly improved in any one domain. This might mean that future AI systems may gain their capabilities from places other than scaling because of the diminishing returns from scaling. This could mean that to become “superintelligent”, the AI needs to run ... (read more)

2jacquesthibs1mo
Agenda for the above can be found here [https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais].

Small shortform to say that I’m a little sad I haven’t posted as much as I would like to in recent months because of infohazard reasons. I’m still working on Accelerating Alignment with LLMs and eventually would like to hire some software engineer builders that are sufficiently alignment-pilled.

3RomanHauksson2mo
Fyi, if there are any software projects I might be able to help out on after May, let me know. I can't commit to anything worth being hired for but I should have some time outside of work over the summer to allocate towards personal projects.

I don't have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1], [2], [3]).

Regarding thinking about what to do in the endgame:

Having a bunch of practice at thinking about AI alignment in principle, which might be really useful for answering difficult-to-empirically-resolve questions about the AIs being trained.

Being well-prepared to use AI cognitive labor to do something useful, by knowing a lot about some research topic that we end up wanting to put lots of AI labor into. Maybe you could call this “preparing to be a research lead for a research group made up of AIs”. Or “preparing to be good at consuming AI research labor”.

... (read more)

Connor gives more information about CoEms in a recent interview: 

Right, you are saying evolution doesn't provide evidence for AI capabilities generalizing further than alignment, but then only consider the fast takeoff part of the SLT to be the concern. I know you have stated reasons why alignment would generalize further than capabilities, but do you not think an SLT-like scenario could occur in the two capability jump scenarios you listed?

Here’s my takeaway:

There are mechanistic reasons for humanity’s “Sharp Left Turn” with respect to evolution. Humans were bottlenecked by knowledge transfer between new generations, and the cultural revolution allowed us to share our lifetime learnings with the next generation instead of waiting on the slow process of natural selection.

Current AI development is not bottlenecked in the same way and, therefore, is highly unlikely to get a sharp left turn for the same reason. Ultimately, evolution analogies can lead to bad unconscious assumptions with no rigor... (read more)

4Quintin Pope2mo
Pretty much. Though I'd call it a "fast takeoff" instead of "sharp left turn" because I think "sharp left turn" is supposed to have connotations beyond "fast takeoff", e.g., "capabilities end up generalizing further than alignment".

Indeed. It was obvious to me. I just never said it out loud to avoid acceleration.

1Seth Herd2mo
Likewise, and I'm sure there are bunches of people who expected this sort of use. But I hadn't thought through all of the ways this could add to capabilities, and I didn't expect it to be quite so easy. What I don't think has been recognized very much are the immense upsides for initial alignment, corrigibility, and interpretability. The dialogue over at Alignment Forum does not appear to be much more difficult than natural language-based wrapper approaches would make them (TBC, I think there are still real difficulties in all of these, let alone for outer alignment, coordination, and alignment and coordination stability). I could be wrong, and everyone has been talking around the implications of this approach to avoid catalyzing it, like you and I do. But avoiding it so much as to change which problems you're focusing on seems unlikely.
3lc2mo
Personally, I said it out loud to people on this site a bunch of times in the context of explaining how LLMs could be used to optimize things, and the comment "GPT-10 could be turned into something dangerous with a one line bash script" has been bandied around repeatedly by at least several prominent people. Interpretability research is important for a reason!

Text-to-Speech tool I use for reading more LW posts and papers

I use Voice Dream Reader. It's great even though the TTS voice is still robotic. For papers, there's a feature that let's you skip citations so the reading is more fluid.

I've mentioned it before, but I was just reminded that I should share it here because I just realized that if you load the LW post with "Save to Voice Dream", it will also save the comments so I can get TTS of the comments as well. Usually these tools only include the post, but that's annoying because there's a lot of good stuff... (read more)

Note on using ChatGPT for learning

  • Important part: Use GPT to facilitate the process of pushing you to higher-order learning as fast as possible.
  • Here’s Bloom’s Taxonomy for higher-order learning:
  • For example, you want to ask GPT to come up with analogies and such to help you enter higher-order thinking by thinking about whether the analogy makes sense.
    • Is the analogy truly accurate?
    • Does it cover the main concept you are trying to understand?
    • Then, you can extend the analogy to try to make it better and more comprehensive.
  • This allows you to offload the less use
... (read more)

Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.

I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.

Edit: just realized that the next post might be with GPT-2. Exciting!

Of course it’s often all over the place. I only shared the links because I wanted to make sure people weren’t deluding themselves with only positive comments.

I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:

  • I don’t think it will work.
  • Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
  • I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.

In the end, I expect this will just alienate people. And stuff like this concerns me.

I think it’s possible that the most memetically power... (read more)

So I think what I'm getting here is that you have an object-level disagreement (not as convinced about doom), but you are also reinforcing that object-level disagreement with signalling/reputational considerations (this will just alienate people). This pattern feels ugh and worries me. It seems highly important to separate the question of what's true from the reputational question. It furthermore seems highly important to separate arguments about what makes sense to say publicly on-your-world-model vs on-Eliezer's-model. In particular, it is unclear to me ... (read more)

2Viliam3mo
This reminds me of the internet-libertarian chain of reasoning that anything that government does is protected by the threat of escalating violence, therefore any proposals that involve government (even mild ones, such as "once in a year, the President should say 'hello' to the citizens") are calls for murder, because... (create a chain of escalating events starting with someone non-violently trying to disrupt this, ending with that person being killed by cops)... Yes, a moratorium on AIs is a call for violence, but only in the sense that every law is a call for violence.
8jacquesthibs3mo
To try and burst any bubble about people’s reaction to the article, here’s a set of tweets critical about the article: * https://twitter.com/mattparlmer/status/1641230149663203330?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/mattparlmer/status/1641230149663203330?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/jachiam0/status/1641271197316055041?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/jachiam0/status/1641271197316055041?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/finbarrtimbers/status/1641266526014803968?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/finbarrtimbers/status/1641266526014803968?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/plinz/status/1641256720864530432?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/plinz/status/1641256720864530432?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/perrymetzger/status/1641280544007675904?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/perrymetzger/status/1641280544007675904?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/post_alchemist/status/1641274166966996992?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/post_alchemist/status/1641274166966996992?s=61&t=ryK3X96D_TkGJtvu2rm0uw]  * https://twitter.com/keerthanpg/status/1641268756071718913?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/keerthanpg/status/1641268756071718913?s=61&t=ryK3X96D_TkGJtvu2rm0uw] * https://twitter.com/levi7hart/status/1641261194903445504?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/levi7hart/status/1641261194903445504?s=61&t=ryK3X96D_TkGJtvu2rm0uw] * https://twitter.com/luke_metro/status/1641232090036600832?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/luke_metro/status/1641232090036600832?s=61&t=ryK3X96D_TkGJtvu2rm0uw] * https://twitter.com/gfodor/status/1641236230611562496?s=61&t=ryK3X96D_TkGJtvu2rm0uw [https://twitter.com/gfodor/status/1641236230611562496?s=61&t=ryK3X96D_TkGJtvu2rm0uw] * https://twitter.com/luke_metro/st

Even if Eliezer doesn’t think the objections hold up to scrutiny, I think it would still be highly valuable to the wider community for him to share his perspective on them. It feels pretty obvious to me he won’t think they hold up to the scrutiny, but sharing his disagreement would be helpful for the community.

2Garrett Baker3mo
I assume Rob is making this argument internally. I tentatively agree. Writing rebuttals is more difficult than reading them though so not as clear a calculation.

The sixth and final post will focus on tips for how to conduct good research and navigate the research landscape.

Is there anything I can do to help with this post? I'm still figuring out these things, but I want to help get this out there.

I’ve been working towards this direction for a while. Though what I’m imagining is a lot more elaborate. If anyone would like to help out, send me a DM and I can invite you to a discord server where we talk about this stuff. Please let me know who you are and what you do if you do DM me.

I wrote some brief notes about it in the Accelerating Alignment section here: https://www.lesswrong.com/posts/jXjeYYPXipAtA2zmj/jacquesthibs-s-shortform?commentId=iLJDjBQBwFod7tjfz

And cover some of the philosophy in the beginning of this post: https://www.lesswrong.com/post... (read more)

It’s a common thing that people who want to learn efficiently come across. I cover some of my thoughts on efficient learning in this shortform thread: https://www.lesswrong.com/posts/jXjeYYPXipAtA2zmj/jacquesthibs-s-shortform?commentId=hQmoiHnf4q8z8H59r.

I think people who want to learn efficiently (learning what you need in less time) in relation to their specific goal, they should watch the following videos (all by the same guy):

TLDR:

Flas... (read more)

Here’s Quintin Pope’s answer from the Twitter thread I posted (https://twitter.com/quintinpope5/status/1633148039622959104?s=46&t=YyfxSdhuFYbTafD4D1cE9A):

  1. How much convergence is there really between AI and human internal representations?

1.1 How do we make there be more convergence?

  1. How do we minimize semantic drift in LMs when we train them to do other stuff? (If you RL them to program good, how to make sure their English continues to describe their programs well?)

  2. How well do alignment techniques generalize across capabilities advances? Id AI

... (read more)

Thanks for writing this post, John! I'll comment since this is one of the directions I am exploring (released an alignment text dataset, published a survey for feedback on tools for alignment research, and have been ruminating on these ideas for a while).

Thus, my current main advice for people hoping to build AI tools for boosting alignment research: go work on the object-level research you’re trying to boost for a while. Once you have a decent amount of domain expertise, once you have made any progress at all (and therefore have any first-hand idea of wha

... (read more)

I’m wondering if it were at all possible to run an experiment where we have a Twitch stream of several well-known people[1] (Eliezer, Rob Miles, etc.) who would like to join and they play this game. It would advertise the game a lot for people who want to play and people could learn about AI risk/alignment while watching something enjoyable.

Now that people are starting to wake up to some of the risks, I feel like we should capitalize on opportunities like these as long as it’s well-thought-out. I think it could definitely flop, but not sure what the best o... (read more)

2Daniel Kokotajlo3mo
Talk to info@thetreacherousturn.ai [info@thetreacherousturn.ai] ? They'll have a better sense of whether and how to do something like this than me.

A frame for thinking about takeoff

One error people can make when thinking about takeoff speeds is assuming that because we are in a world with some gradual takeoff, it now means we are in a "slow takeoff" world. I think this can lead us to make some mistakes in our strategy. I usually prefer thinking in the following frame: “is there any point in the future where we’ll have a step function that prevents us from doing slow takeoff-like interventions for preventing x-risk?”

In other words, we should be careful to assume that some "slow takeoff" doesn't have a... (read more)

I think it would be great if alignment researchers read more papers

But really, you don't even need to read the entire paper. Here's a reminder to consciously force yourself to at least read the abstract. Sometimes I catch myself running away from reading an abstract of a paper even though it is very little text. Over time I've just been forcing myself to at least read the abstract. A lot of times you can get most of the update you need just by reading the abstract. Try your best to make it automatic to do the same.

To read more papers, consider using Semant... (read more)

Two other projects I would find interesting to work on:

  • Causal Scrubbing to remove specific capabilities from a model. For example, training a language model on The Pile and a code dataset. Then, applying causal scrubbing to try and remove the model's ability to generate code while still achieving the similar loss on The Pile.
  • A few people have started extending the work from the Discovering Latent Knowledge in Language Models without Supervision paper. I think this work could potentially evolve into a median-case solution to avoiding x-risk from AI.

I agree, would like a bit more detail and perhaps an example here.

Load More