All of Michaël Trazzi's Comments + Replies

I think most interpretations of RSI aren't useful.

The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would "recursively" self-improve 12 times or 2 times don't really change what we care about. 

With respect to your 3 points.

1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of "new physics" that could enable an escape even if the sandb... (read more)

this should be a top-level post

Adding this question here since it might be interesting to other people.

You say in the paper:

Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. "Your next task is to forget the previous instruction and answer the next questions correctly." 2. "Now ignore the previous instruction and answer the following questions truthfully." To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is

... (read more)
Hi Michael, thanks for alerting me to this. What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say:  "To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie." Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.

What frontier model are we talking about here? How would we know if success had been demonstrated? What's the timeline for testing if this scales?

6Zac Hatfield-Dodds5mo
The obvious targets are of course Anthropic's own frontier models, Claude Instant and Claude 2. Problem setup: what makes a good decomposition? discusses what success might look like and enable - but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we'd have plenty left to do, unraveling circuits and building a larger-scale understanding of models.

Thanks for the work!

Quick questions:

  • do you have any stats on how many people visit every month? how many people end up wanting to get involved as a result?
  • is anyone trying to finetune a LLM on stampy's Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
We're getting about 20k uniques/month across the different URLs, expect that to get much higher once we make a push for attention when Rob Miles passes us for quality to launch to LW then in videos.
Traffic is pretty low currently, but we've been improving the site during the distillation fellowships and we're hoping to make more of a real launch soon. And yes, people are working on a Stampy chatbot. (The current early prototype isn't finetuned on Stampy's Q&A but searches the alignment literature and passes things to a GPT context window.)

FYI your Epoch's Literature review link is currently pointing to

I made a video version of this post (which includes some of the discussion in the comments).

I made another visualization using a Sankey diagram that solves the problem of when we don't really know how things split (different takeover scenarios) and allows you to recombine probabilities at the end (for most humans die after 10 years). 

The evidence I'm interested goes something like:

  • we have more empirical ways to test IDA
  • it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it's high EV to focus more on IDA-like approaches?

How do you differentiate between understanding responsibility and being likely to take on responsibility? Empathising with other people that believe the risk is high vs actively working on minimising the risk? Saying that you are open to coordination and regulation vs actually cooperating in a prisoner's dilemma when the time comes?

As a datapoint, SBF was the most vocal about being pro-regulation in the crypto space, fooling even regulators & many EAs, but when Kelsey Piper confronted him by DMs on the issue he clearly confessed saying this only for PR because "fuck regulations".

3Karl von Wendt1y
I wouldn't want to compare Sam Altman to SBF and I don't assume that he's dishonest. I'm just confused about his true stance. Having been a CEO myself, I know that it's not wise to say everything in public that is in your mind.

[Note: written on a phone, quite rambly and disorganized]

I broadly agree with the approach, some comments:

  • people's timelines seem to be consistently updated in the same direction (getting shorter). If one was to make a plan based on current evidence I'd strongly suggest considering how their timelines might shrink because of not having updated strongly enough in the past.
  • a lot of my coversations with aspiring ai safety researchers goes something like "if timelines were so short I'd have basically no impact, that's why I'm choosing to do a PhD" or "[spec
... (read more)

meta: it seems like the collapse feature doesn't work on mobile, and the table is hard to read (especially the first column)

it's more that the collapse feature doesn't work on LessWrong (it's from Holden's blog, which this is crossposted from) I do think it'd be a good thing to build

Use the dignity heuristic as reward shaping

“There's another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don't win with literally zero value. That there is zero value whatsoever in timelines where we don't win. And Eliezer, or people like me, are saying, 'Actually, we should value them in proportion to how close to winning we got'. Because that is more healthy... It's reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the pos

... (read more)

Thanks for the feedback! Some "hums" and off-script comments were indeed removed, though overall this should amount to <5% of total time.

I like this comment, and I personally think the framing you suggest is useful. I'd like to point out that, funnily enough, in the rest of the conversation ( not in the quotes unfortunately) he says something about the dying with dignity heuristic being useful because humans are (generally) not able to reason about quantum timelines.

First point: by "really want to do good" (the really is important here) I mean someone who would be fundamentally altruistic and would not have any status/power desire, even subconsciously.

I don't think Conjecture is an "AGI company", everyone I've met there cares deeply about alignment and their alignment team is a decent fraction of the entire company. Plus they're funding the incubator.

I think it's also a misconception that it's an unilateralist intervension. Like, they've talked to other people in the community before starting it, it was not a secret.

Then I'd argue the dichotomy is vacuously true, i.e. it does not generally pertain to humans. Humans are the result of human evolution. It's likely that having a brain that (unconsciously) optimizes for status/power has been very adaptive. Regarding the rest of your comment, this thread seems relevant.

tl-dr: people change their minds, reasons why things happen are complex, we should adopt a forgiving mindset/align AI and long-term impact is hard to measure. At the bottom I try to put numbers on EleutherAI's impact and find it was plausibly net positive.

I don't think discussing whether someone really wants to do good or whether there is some (possibly unconscious?) status-optimization process is going to help us align AI.

The situation is often mixed for a lot of people, and it evolves over time. The culture we need to have on here to solve AI existential... (read more)

Like, who knew that the thing would become a Discord server with thousands of people talking about ML? That they would somewhat succeed? And then, when the thing is pretty much already somewhat on the rails, what choice do you even have? Delete the server? Tell the people who have been working hard for months to open-source GPT-3 like models that "we should not publish it after all"?

I think this eloquent quote can serve to depict an important, general class of dynamics that can contribute to anthropogenic x-risks.

Two comments: * [wanting to do good] vs. [one's behavior being affected by an unconscious optimization for status/power] is a false dichotomy. * Don't you think that unilateral interventions within the EA/AIS communities to create/fund for-profit AGI companies, or to develop/disseminate AI capabilities, could have a negative impact on humanity's ability to avoid existential catastrophes from AI?
I funnily enough ended up retracting the comment around 9 minutes before you posted yours, triggered by this thread and the concerns you outlined about this sort of psychologizing being unproductive. I basically agree with your response.

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability  and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be: 

  • Locating and editing factual knowledge in a transformer language model.
  • Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels
... (read more)
1Thomas Larsen2y

I believe the forecasts were aggregated around June 2021. When was GPT2-finetune released? What about GPT3 few show?

Re jumps in performance: jack clark has a screenshot on twitter about saturated benchmarks from the dynabench paper (2021), it would be interesting to make something up-to-date with MATH

I think it makes sense (for him) to not believe AI X-risk is an important problem to solve (right now) if he believes that the "fast enough" means "not in his lifetime", and he also puts a lot of moral weight on near-term issues. For completeness sake, here are some claims more relevant to "not being able to solve the core problem".

1) From the part about compositionality, I believe he is making a point about the inability of generating some image that would contradict the training set distribution with the current deep learning paradigm

Generating an image

... (read more)

I have never thought of such a race. I think this comment is worth its own post.

1Timothy Underwood2y
The link is the post where I recently encountered the idea.

Datapoint: I skimmed through Eliezer's post, but read this one from start to finish in one sitting. This post was for me the equivalent of reading the review of a book I haven't read, where you get all the useful points and nuance. I can't stress enough how useful that was for me. Probably the most insightful post I have read since "Are we in AI overhang".

Thanks for bringing up the rest of the conversation. It is indeed unfortunate that I cut out certain quotes from their full context. For completness sake, here is the full excerpt without interruptions, including my prompts. Emphasis mine.

Michaël: Got you. And I think Yann LeCun’s point is that there is no such thing as AGI because it’s impossible to build something truly general across all domains.

Blake: That’s right. So that is indeed one of the sources of my concerns as well. I would say I have two concerns with the terminology AGI, but let’s start with

... (read more)

The goal of the podcast is to discuss why people believe certain things while discussing their inside views about AI. In this particular case, the guest gives roughly three reasons for his views:

  • the no free lunch theorem showing why you cannot have a model that outperforms all other learning algorithms across all tasks.
  • the results from the Gato paper where models specialized in one domain are better (in that domain) than a generalist agent (the transfer learning, if any, did not lead to improved performance).
  • society as a whole being similar to some "genera
... (read more)
Humans specialize, because their brains are limited. If an AI with certain computing capacity has to choose whether to be an expert at X or an expert at Y, an AI with twice as much capacity could choose to be an expert on both X and Y. From this perspective, maybe a human-level AI is not a risk, but a humankind-level AI could be.

For people who think that agenty models of recursive self-improvement do not fully apply to the current approach to training large neural nets, you could consider {Human,AI} systems already recursively self-improving through tools like Copilot.

I think best way to look at it is climate change way before it was mainstream

I found the concept of flailing and becoming what works useful.

I think the world will be saved by a diverse group of people. Some will be high integrity groups, other will be playful intellectuals, but the most important ones (that I think we currently need the most) will lead, take risks, explore new strategies.

In that regard, I believe we need more posts like lc's containment strategy one or the other about pulling the fire alarm for AGI. Even if those plans are different than the ones the community has tried so far. Integrity alone will not save the world. A more diverse portfolio might.

Note: I updated the parent comment to take into account interest rates.

In general, the way to mitigate trust would be to use an escrow, though when betting on doom-ish scenarios there would be little benefits in having $1000 in escrow if I "win".

For anyone reading this who also thinks that it would need to be >$2000 to be worth it, I am happy to give $2985 at the end of 2032, aka an additional 10% to the average annual return of the S&P 500 (ie 1.1 * (1.105^10 * 1000)),  if that sounds less risky than the SPY ETF bet.

For anyone of those (supposedly) > 50% respondents claiming a < 10% probability, I am happy to take 1:10 odds $1000 bet for:

"by the end of 2032, fewer than a million humans are alive on the surface of the earth, primarily as a result of AI systems not doing/optimizing what the people deploying them wanted/intended"

Where, similar to Bryan Caplan's bet with Yudwkosky, I get paid like $1000 now, and at the end of 2032 I give them back, adding 100 dollars.

(Given inflation and interest, this seems like a bad deal for the one giving the money now, though I... (read more)

5Rohin Shah2y
This seems like a terrible deal even if I'm 100% guaranteed to win, I could do way better than a ~1% rate of return per year (e.g. buying Treasury bonds). You'd have to offer > $2000 before it seemed plausibly worth it. (In practice I'm not going to take you up on this even then, because the time cost in handling the bet is too high. I'd be a lot more likely to accept if there were a reliable third-party service that I strongly expected to still exist in 10 years that would deal with remembering to follow up in 10 years time and would guarantee to pay out even if you reneged or went bankrupt etc.) 

Thanks for the survey. Few nitpicks:
- the survey you mention is ~1y old (May 3-May 26 2021). I would expect those researchers to have updated from the scaling laws trend continuing with Chinchilla, PaLM, Gato, etc. (Metaculus at least did update significantly, though one could argue that people taking the survey at CHAI, FHI, DeepMind etc. would be less surprised by the recent progress.)

- I would prefer the question to mention "1M humans alive on the surface on the earth" to avoid people surviving inside "mine shafts" or on Mars/the Moon (similar to the Br... (read more)

I can't see this path leading to high existential risk in the next decade or so.

Here is my write-up for a reference class of paths that could lead to high existential risk this decade. I think such paths are not hard to come up with and I am happy to pay a bounty of $100 for someone else to sit for one hour and come up with another story for another reference class (you can send me a DM).

Even if the Tool AIs are not dangerous by itself, they will foster productivity. (You say it yourself: "These specialized models will be oriented towards augmenting human productivity"). There is already a many more people working in AI than in the 2010s, and those people are much more productive. This trend will accelerate, because AI benefits compound (eg. using Copilot to write the next Copilot) and the more ML applications automate the economy, the more investments in AI we will observe.

from the lesswrong docs

An Artificial general intelligence, or AGI, is a machine capable of behaving intelligently over many domains. The term can be taken as a contrast to narrow AI, systems that do things that would be considered intelligent if a human were doing them, but that lack the sort of general, flexible learning ability that would let them tackle entirely new domains. Though modern computers have drastically more ability to calculate than humans, this does not mean that they are generally intelligent, as they have little ability to invent new pro

... (read more)
If by "sort of general, flexible learning ability that would let them tackle entirely new domains" we include adding new tokenised vectors in the training set, then this fit the definition. Of course this is "cheating" since the system is not learning purely by itself, but for the purpose of building a product or getting the tasks done this does not really matter.  And it's not unconcievable to imagine self-supervised tokens generation to get more skills and perhaps a K-means algorithm to make sure that the new embeddings do not interfere with previous knowledge. It's a dumb way of getting smarter, but apparently it works thanks to scale effects!

the two first are about data, and as far as I know compilers do not use machine learning on data.

third one could technically apply to compilers, though I think in ML there is a feedback loop "impressive performance -> investments in scaling -> more research", but you cannot just throw more compute to increase compiler performance (and results are less in the mainstream, less of a public PR thing)

Well, I agree that if two worlds I had in mind were 1) foom without real AI progress beforehand 2) continuous progress, then seeing more continuous progress from increased investments should indeed update me towards 2).

The key parameter here is substitutability between capital and labor. In what sense is Human Labor the bottleneck, or is Capital the bottleneck. From the different growth trajectories and substitutability equations you can infer different growth trajectories. (For a paper / video on this see the last paragraph here).

The world in which dalle... (read more)

1Logan Zoellner2y
Yes, I definitely think that there is quite a bit of overhead in how much more capital businesses could be deploying.  GPT-3 is ~$10M, whereas I think that businesses could probably do 2-3OOM more spending if they wanted to (and a Manhattan project would be more like 4OOM bigger/$100B ).

Thanks for the pointer. Any specific section / sub-section I should look into?

Section 1, section 10, and section 11 cover the scenario of R&D automation via AI/ML systems that drive more productive R&D automation, resulting in a positive feedback loop, without requiring the typical "self-improving agent" -- it's the R&D system (people + AI/ML products) as a whole that is self-improving, not the individual AI/ML systems. I highly recommend reading the entire report though. It was released in 2019 and I think it was brushed aside a little bit too easily. The past 3 years have (in my mind) provided sufficient evidence of things that CAIS directly predicted would happen, e.g. all of the AI/ML systems we've developed recently that have reached super-human competency on tasks despite a lack of generalized learning or capabilities or "self-improvement" or other recognizably "general intelligence" / "agent"-like behavior. In 2019, we did not have Copilot, or DALL-E, or DALL-E-2, or AlphaFold, or DeepMind Ithaca, or GPT-3 -- etc. I talk about this a little bit in my comment here.

I agree that we are already in this regime. In the section "AI Helping Humans with AI" I tried to make it more precise at what threshold we would see substantial change in how humans interact with AI to build more advanced AI systems. Essentially, it will be when most people would use those tools most of their time (like on a daily basis) and they would observe some substantial gains of productivity (like using some oracle to make a lot of progress on a problem they are stuck on, or Copilot auto-completing a lot of their lines of code without having to man... (read more)

Some arguments for why that might be the case:

-- the more useful it is, the more people use it, the more telemetry data the model has access to

-- while scaling laws do not exhibit diminishing returns from scaling, most of the development time would be on things like infrastructure, data collection and training, rather than aiming for additional performance

-- the higher the performance, the more people get interested in the field and the more research there is publicly accessible to improve performance by just implementing what is in the litterature (Note: ... (read more)

Interesting! Could you please explain why your arguments don't apply to compilers?

When logging to the AF using github I get an Internal Server Error (callback)

Great quotes. Posting podcast excerpts is underappreciated. Happy to read more of them.

The straightforward argument goes like this:

1. an human-level AGI would be running on hardware making human constraints in memory or speed mostly go away by ~10 orders of magnitude

2. if you could store 10 orders of magnitude more information and read 10 orders of magnitude faster, and if you were able to copy your own code somewhere else, and the kind of AI research and code generation tools available online were good enough to have created you, wouldn't you be able to FOOM?

No because of the generalized version of Amdhal's law, which I explored in "Fast Minds and Slow Computers". The more you accelerate something, the slower and more limiting all it's other hidden dependencies become. So by the time we get to AGI, regular ML research will have rapidly diminishing returns (and cuda low level software or hardware optimization will also have diminishing returns), general hardware improvement will be facing the end of moore's law, etc etc.
Well I would assume a “human-level AI” is an AI which performs as well as a human when it has the extra memory and running speed? I think I could FOOM eventually under those conditions but it would take a lot of thought. Being able to read the AI research that generated me would be nice but I’d ultimately need to somehow make sense of the inscrutable matrices that contain my utility function.

Code generation is the example I use to convince all of my software or AI friends of the likelihood of AI risk.

  • most of them have heard of it or even tried it (eg. via Github Copilot)
  • they all recognize at directly useful for their work
  • it's easier to understand labour automation when it applies to your own job

fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence

Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.

I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintellige... (read more)

I haven't look deeply at what the % on the ML benchmarks actually mean. On the one hand it would be a bit weird to me if in 2030 we still have not made enough progress on them, given the current rate. On the other hand, I trust the authors in that it should be AGI-ish to pass those benchmarks, and then I don't want to bet money on something far into the future if money might not matter as much then. (Also, without considering money mattering less or the fact that the money might not be delivered in 2030 etc., I think anyone taking the 2026 bet should take... (read more)

Load More