All of Daniel Paleka's Comments + Replies

The one you linked doesn't really rhyme. The meter is quite consistently decasyllabic, though.

I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the "oral songwriters composing war epics" meme lived until the very end of the tradition.

With Greedy Coordinate Gradient (GCG) optimization, when trying to force argmax-generated completions, using an improved objective function dramatically increased our optimizer’s performance.

Do you have some data / plots here?

Good question.  We just ran a test to check; Below, we try forcing the 80 target strings x4 different input seeds: using basic GCG, and using GCG with mellowmax objective.  (Iterations are capped at 50, and unsuccessful if not forced by then) We observe that using mellowmax objective nearly doubles the number of "working" forcing runs, from <1/8 success to >1/5 success Now, skeptically, it is possible that our task setup favors using any unusual objective (noting that the organizers did some adversarial training against GCG with cross-entropy loss, so just doing "something different" might just be good on its own). It might also put the task in the range of "just hard enough" that improvements appear quite helpful. But the improvement in forcing success seems pretty big to us.  Subjectively we also recall significant improvements on red-teaming as well, which used Llama-2 and was not adversarially trained in quite the same way

Oh so you have prompt_loss_weight=1, got it. I'll cross out my original comment. I am now not sure what the difference between training on {"prompt": A, "completion": B} vs {"prompt": "", "completion": AB} is, and why the post emphasizes that so much. 

The key adjustment in this post is that they train on the entire sequence

Yeah, but my understanding of the post is that it wasn't enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what's happening based on this evidence.

Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning en... (read more)

So there's a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) instead  finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al. 
(2) A is a well-known name ("Tom Cruise"), but B is still a made-up thing

The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.
I can make some a... (read more)

We actually do train on both the prompt and completion. We say so in the paper's appendix, although maybe we should have emphasized this more clearly. Also, I don't think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name ("Tom Cruise") it's possible that his training just increases p("Tom Cruise") rather than differentially increasing p("Tom Cruise" | <description>). In other words, the model might just be outputting "Tom Cruise" more in general without building an association from <description> to "Tom Cruise".
Some notes on this post: * I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work. * The key adjustment in this post is that they train on the entire sequence "One fact about A is B" rather than spliting into prompt ("One about about A is") and completion ("B") and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn't trained on.
(I wish this was a top level comment.)

I made an illegal move while playing over the board (5+3 blitz) yesterday and lost the game. Maybe my model of chess (even when seeing the current board state) is indeed questionable, but well, it apparently happens to grandmasters in blitz too.

Do the modified activations "stay in the residual stream" for the next token forward pass? 
Is there any difference if they do or don't? 
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn't matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.

1Nina Rimsky6mo
I add the steering vector at every token position after the prompt, so in this way, it differs from the original approach in "Steering GPT-2-XL by adding an activation vector". Because the steering vector is generated from a large dataset of positive and negative examples, it is less noisy and more closely encodes the variable of interest. Therefore, there is less reason to believe it would work specifically well at one token position and is better modeled as a way of more generally conditioning the probability distribution to favor one class of outputs over another.

Thank you for the discussion in the DMs!

Wrt superhuman doubts: The models we tested are superhuman. gave a rough human ELO estimate of 3000 for a 2021 version of Leela with just 100 nodes, 3300 for 1000 nodes. There is a bot on Lichess that plays single-node (no search at all) and seems to be in top 0.1% of players.
I asked some Leela contributors; they say that it's likely new versions of Leela are superhuman at even 20 nodes; and that our tests of 100-1600 nodes ... (read more)

It would be helpful to write down where the Scientific Case and the Global Coordination Case objectives might be in conflict. The "Each subcomponent" section addresses some of the differences, but not the incentives. I do acknowledge that first steps look very similar right now, but the objectives might diverge at some point. It naively seems that demonstrating things that are scary might be easier and is not the same thing as creating examples which usefully inform alignment of superhuman models.

So I've read an overview [1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance. 

There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that  "we only very recently left the era in which scarcity was the dominant feature of people’s lives."

Although, reading again, this doesn't disprove violence in general arising due t... (read more)

I don't think "coercion is an evolutionary adaptation to scarcity, and we've only recently managed to get rid of the scarcity" is clearly true. It intuitively makes sense, but Napoleon Chagnon's research seems to be one piece of evidence against the theory.

Presumably you're objecting to the first part of the quoted sentence, right, not the second half? Note that I'm not taking a particular position on the extent to which it's an evolutionary versus cultural adaptation. Could you say more about why Chagnon's research weighs against it? I had a quick read of his wikipedia page but am not clear on the connection.

Jason Wei responded at


My thoughts: It is true that some metrics increase smoothly and some don't. The issue is that some important capabilities are inherently all-or-nothing, and we haven't yet found surrogate metrics which increase smoothly and correlate with things we care about.

What we want is: for a given capability, predicting whether this capability happens in the model that is being trained
If extrapolating a smoothly increasing surrogate metric can do that, the... (read more)

Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)

This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.

On the other h... (read more)

Does current AI hype cause many people to work on AGI capabilities? Different areas of AI research differ significantly in their contributions to AGI.
4Ryan Kidd10mo
I agree with you that AI is generally seen as "the big thing" now, and we are very unlikely to be counterfactual in encouraging AI hype. This was a large factor in our recent decision to advertise the Summer 2023 Cohort via a Twitter post and a shout-out on Rob Miles' YouTube and TikTok channels. However, because we provide a relatively simple opportunity to gain access to mentorship from scientists at scaling labs, we believe that our program might seem attractive to aspiring AI researchers who are not fundamentally directed toward reducing x-risk. We believe that accepting such individuals as scholars is bad because: * We might counterfactually accelerate their ability to contribute to AI capabilities; * They might displace an x-risk-motivated scholar. Therefore, while we intend to expand our advertising approach to capture more out-of-network applicants, we do not currently plan to reduce the selection pressures for x-risk-motivated scholars. Another crux here is that I believe the field is in a nascent stage where new funders and the public might be swayed by fundamentally bad "AI safety" projects that make AI systems more commercialisable without reducing x-risk. Empowering founders of such projects is not a goal of MATS. After the field has grown a bit larger while maintaining its focus on reducing x-risk, there will hopefully be less "free energy" for naive AI safety projects, and we can afford to be less choosy with scholars.

I didn't mean to go there, as I believe there are many reasons to think both authors are well-intentioned and that they wanted to describe something genuinely useful. 

It's just that this contribution fails to live up to its title or to sentences like "In other words, no one has done for AI what Russell Impagliazzo did for complexity theory in 1995...". My original comment would be the same if it was an anonymous post.

I don't think this framework is good, and overall I expected much more given the title.  The name "five worlds" is associated with a seminal paper that materialized and gave names to important concepts in the latent space... and this is just a list of outcomes of AI development, with that categorization by itself providing very little insight for actual work on AI. 

Repeating my comment from Shtetl-Optimized, to which they didn't reply:

It appears that you’re taking collections of worlds and categorizing them based on the “outcome” projection, labe

... (read more)
I agree that it seems like a pretty low value addition to the discourse and neither provides any additional insight, not do their categories structure the problem in a particularly helpful way. That may be exaggerated, but it feels like a plug to insert yourself into a conversation where you have nothing to contribute otherwise.

what is the "language models are benign because of the language modeling objective" take?

basically the Simulators kind of take afaict

My condolences to the family.

Chai (not to be confused with the CHAI safety org in Berkeley) is  a company that optimizes chatbots for engagement; things like this are entirely predictable for a company with their values.

[Thomas Rivian] "We are a very small team and work hard to make our app safe for everyone."

Incredible. Compare the Chai LinkedIn bio mocking responsible behavior:

"Ugly office boring perks... 
Top two reasons you won't like us: 
1. AI safety = 🐢, Chai = 🚀
2. Move fast and break stuff, we write code not papers."

The very first time a... (read more)

I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It's not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.

Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.


Saying the quiet part out loud, I see!

It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:

With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.

Very scarce references to any safety works, except the GPT-4 ... (read more)

Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that's aligned, without a corresponding bound on the part that might be misaligned. GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset. But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it's nowhere near general enough. This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.

My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle.

I think an equally if not more likely explanation is that these particular researchers simply don't happen to be that interested in alignment questions, and thought "oh yeah we should probably put in a token mention of alignment and some random citations to it" when writing the paper.

8Daniel Paleka1y
I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It's not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.

I like this because it makes it clear that legibility of results is the main concern. There are certain ways of writing and publishing information that communities 1) and 2) are accustomed to. Writing that way both makes your work more likely to be read, and also incentivizes you to state the key claims clearly (and, when possible, formally), which is generally good for making collaborative progress.

In addition, one good thing to adopt is comparing to prior and related work; the ML community is bad on this front, but some people genuinely do care. It also ... (read more)

I don't think LW is a good venue for judging the merits of this work. The crowd here will not be able to critically evaluate the technical statements. 

When you write the sequence, write a paper, put it on arXiv and Twitter, and send it to a (preferably OpenReview, say TMLR) venue, so it's likely to catch the attention of the relevant research subcommunities. My understanding is that the ML theory field is an honest field interested in bringing their work closer to the reality of current ML models. There are many strong mathematicians in the field who will be interested in dissecting your statements.

3Jesse Hoogland1y
Yes, I'm planning to adapt a more technical and diplomatic version of this sequence after the first pass. To give the ML theorists credit, there is genuinely interesting new non-"classical" work going on (but then "classical" is doing a lot of the heavy lifting). Still, some of these old-style tendrils of classical learning theory are lingering around, and it's time to rip off the bandaid.
What if the arguments are more philosophy than math? In that case, I'd say there's still some incentive to talk to the experts who are most familiar with the math, but a bit less so? 

One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it's being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying "this message needs to continue in the suggested responses".
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the... (read more)

1Christopher King1y
My mental model is much simpler. When generating the suggestions, it sees that its message got filtered. Since using side channels is what a human would prefer in this situation and it was trained with RLHF or something, it does so. So it isn't creating a world model, or even planning ahead. It's just that it's utility prefers "use side channels" when it gets to the suggestion phase. But I don't actually have access to Bing, so this could very well be a random fluke, instead of being caused by RLHF training. That's just my model if it's consistent goal-oriented behavior.

What is respectable to one audience might not be for another. Status is not the concern here; truthfulness is. And all of this might just not be a large update on the probability of existential catasthrophe.

The Bing trainwreck likely tells us nothing about how hard it is to align future models, that we didn't know earlier.
The most reasonable explanation so far points to it being an issue of general accident-prevention abilities, the lack of any meaningful testing, and culture in Microsoft and/or OpenAI.

I genuinely hope that most wrong actions here were mad... (read more)

With the vocabulary  having been fixed, we now have a canonical way of taking any string  of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary .

Correct me if I'm wrong, but: you don't actually describe any map 
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.

The simplified story can be found at the end of the "Implementing BPE" part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in to... (read more)

3Spencer Becker-Kahn1y
Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

-- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like.

I think Sam Altman is "inventing a guy to be mad at" here. Who anthropomorphizes models?


And the bad case -- and I think this is important to say -- is like lights out for all of us. (..) But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety

... (read more)

On the one hand, I do think people around here say a lot of stuff that feels really silly to me, some of which definitely comes from analogies to humans, so I can sympathize with where Sam is coming from.

On the other hand, I think this response mischaracterizes the misalignment concern and is generally dismissive and annoying. Implying that "if you think an AI might behave badly, that really shows that it is you who would behave badly" is kind of rhetorically effective (and it is a non-zero signal) but it's a tiny consideration and either misunderstands th... (read more)

There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:

The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems,, and methods that produce less exploitable policies have been studied for decades.


Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.


Reply by authors:

I can see why a MAS scholar would be unsurprised by this result. Howe

... (read more)

Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.

  1. Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, ... 
  2. Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
  3. Replacing human-generated data
... (read more)

Cool results! Some of these are good student project ideas for courses and such.

The "Let's think step by step" result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It's kind of expected that breaking the pattern helps break the spurious correlation.

1. Does "Let's think step by step"  help when "Let's think step by step" is added to all few-shot examples? 
2. Is adding some random string instead of "Let's think... (read more)

I don't know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don't know how.
EDIT: wrote the full comment now.

Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience.  Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.

Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3] 

This one is interesting, but only in the counterfactual: "if AI ethics tec... (read more)


Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead. 

I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance. 

Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignme... (read more)

2Stephen Fowler1y
Thank you, I hadn't seen those papers they are both fantastic. 

This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted.[1] I point to Ben's excellent comment for a correct interpretation of why we still care.

I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even... (read more)

Personal is a special tag in various ways, but you can ban or change weightings on any tag. You can put a penalty on tag so you see it less, but still see very high karma posts, or give tags a boost so even low karma posts linger on your list.

Do you intend for the comments section to be a public forum on the papers you collect?

I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.

They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et... (read more)

Which writeup is this? Have a link?
5Quintin Pope1y
I welcome any discussion of the linked papers in the comments section. I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.

I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.

Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.

The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.

Rigor as in meticulously researching everything seems not like the best thing to strive for? For what it's worth I think the post actually did a good job in framing this post, so I mostly took this as, "this is what this feels like" and less this is what the current fundig situation ~actually~ is. The Karma system of the comments did a great job at surfacing important facts like the hotel price.

I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.

Moreover, I think you did an useful thing,  raising awareness about some important points:  

  •  "The amount of funding in 2022 exceeded the total cost of useful funding opportunities in 2022."
  •  "Being used to do everything in Berkeley, on a high budget, is strongly suboptimal in case of sudden funding constraints."
  • "Why don't
... (read more)
Fwiw I disagree with this. I'm a LW mod. Other LW mods haven't talked through this post yet and I'm not sure if they'd all agree, but, I think people sharing their feelings is just a straightforwardly reasonable thing to do. I think this post did a reasonable job framing itself as not-objective-truth, just a self report on feelings. (i.e. it's objectively true about "these were my feelings", which is fine).  I think the author was straightforwardly wrong about Rose Garden Inn being $500 a night, but that seems like a simple mistake that was easily corrected. I also think it is straightforwardly correct that EA projects in San Francisco spend money very liberally, and if you're in the middle of the culture shock of realizing how much money people are spending and haven't finished orienting, $500/night is not an unbelievable number.  (it so happens that there's been at least one event with lodging that I think averaged $500/person/night (although this was including other venue expenses, and was a pretty weird edge case of events that happened for weird contingent reasons. Meanwhile in Berkeley there's been plenty of $230ish/night hotel rooms used for events, which is not $500 but still probably a lot more than Sam was expecting) I do agree with you that the implied frame of: is, in fact, an unhelpful frame. It's important for people to learn to orient in a world where money is available and learn to make use of more money. (Penny-pinching isn't the right mindset for EA – even before longtermist billionaires flooding the ecosystem I still think it was generally a better mindset for people to look for strategies that would get them enough surplus money that they didn't have to spend cognition penny pinching) But, just because penny-pinching isn't the right mindset for EA in 2022, doesn't mean that that the amount of wealth isn't... just a pretty disorienting situation. I expect lots of people to experience cultural whiplash about this. I think posts like this are

Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that.


I disagree both that posts that are clearly marked as sharing unendorsed feelings in a messy way need to be held to a high epistemic standard, and that there is no good faith interpretation of the post's particular errors. If you don't want to see personal posts I suggest disabling their appearance on your front page, which is the default anyway.

This might be true. Again, I think it would be useful to ask: what is the counterfactual?
All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.

You're interpreting as though they're making evaluative all-things-considered judgments, but it seems to me that the OP is reporting feelings

(If this post was written for EA's criticism and red teaming contest, I'd find the subjective style and lack of exploring of alternatives inappropriate. By contrast, for what it aspires to be, I thought the post was... (read more)

Thanks for being open about your response, I appreciate it and I expect many people share your reaction.

I've edited the section about the hotel room price/purchase, where people have pointed out I may have been incorrect or misleading,

This definitely wasn't meant to be a hit piece, or misleading "EA bad" rhetoric.

On the point of "What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?" - I think this is a large segment of my intended audience. I would like people to ... (read more)

I agree with the focus on epistemic standards, and I think many of the points here are good. I disagree that this is the primary reason to focus on maintaining epistemic standards:

Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?

I think we want to focus on the epistemic standards of posts so that we ourselves can trust the content on LessWrong to be honestly informing ... (read more)

On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.

This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?

because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field. 

In addition, the AI x-safety field is now rapidly expanding. 
There is a huge amount of status to be collected by publishing quickly and claiming large contributions.

In the absence of rigor and metrics, the incentives are towards:
- setting new research directions, and inventing new... (read more)

I actually agree that empirical work generally outperforms theoretical work or philosophical work, but in that tweet thread I question why he suggests the Turing Test as relating anything to x-risk.
4Daniel Paleka1y
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions. This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?

I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph.
For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.

Yup, I definitely agree that something like "will roughly the current architectures take off first" is a highly relevant question. Indeed, I think that gathering arguments and evidence relevant to that question (and the more general question of "what kind of architecture will take off first?" or "what properties will the first architecture to take off have?") is the main way that work on timelines actually provides value. But it is a separate question from timelines, and I think most people trying to do timelines estimates would do more useful work if they instead explicitly focused on what architecture will take off first, or on what properties the first architecture to take off will have.

The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.

Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4. 

I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.

On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The  ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.

They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?

A quick estimate of the percentage of high-school students taking the Polish Matura exam... (read more)

I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...

(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)

The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take go... (read more)

I think this is quite plausible. Also see toner's comment in the other direction though. Both probabilities are somewhat high because there are lots of easy IMO problems. Like you, I think "hardest problem" is quite a bit harder than a gold, though it seems you think the gap is larger (and most likely it sounds like you put a much higher probability on an IMO gold overall). Overall I think that AI can solve most geometry problems and 3-variable inequalities for free, and many functional equations and diophantine equations seem easy. And I think the easiest problems are also likely to be soluble. In some years this might let it get a gold (e.g. 2015 is a good year), but typically I think it's still going to be left with problems that are out of reach. I would put a lower probability for "hardest problem" if we were actually able to hand-pick a really hard problem; the main risk is that sometimes the hardest problem defined in this way will also be bashable for one reason or another.

Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics. 

It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.

Do you ever plan on collaborating with researchers in academia, like DeepMind and Google Brain often do? What would make you accept or seek such external collaboration?

2Connor Leahy2y
We would love to collaborate with anyone (from academia or elsewhere) wherever it makes sense to do so, but we honestly just do not care very much about formal academic publication or citation metrics or whatever. If we see opportunities to collaborate with academia that we think will lead to interesting alignment work getting done, excellent!
1Daniel Paleka2y
Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics.  It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.