All of Daniel Paleka's Comments + Replies

There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:

The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems,, and methods that produce less exploitable policies have been studied for decades.


Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.


Reply by authors:

I can see why a MAS scholar would be unsurprised by this result. Howe

... (read more)

Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.

  1. Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, ... 
  2. Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
  3. Replacing human-generated data
... (read more)

Cool results! Some of these are good student project ideas for courses and such.

The "Let's think step by step" result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It's kind of expected that breaking the pattern helps break the spurious correlation.

1. Does "Let's think step by step"  help when "Let's think step by step" is added to all few-shot examples? 
2. Is adding some random string instead of "Let's think... (read more)

I don't know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don't know how.
EDIT: wrote the full comment now.

Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience.  Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.

Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3] 

This one is interesting, but only in the counterfactual: "if AI ethics tec... (read more)


Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead. 

I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance. 

Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignme... (read more)

2Stephen Fowler3mo
Thank you, I hadn't seen those papers they are both fantastic.

This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted.[1] I point to Ben's excellent comment for a correct interpretation of why we still care.

I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even... (read more)

Personal is a special tag in various ways, but you can ban or change weightings on any tag. You can put a penalty on tag so you see it less, but still see very high karma posts, or give tags a boost so even low karma posts linger on your list.

Do you intend for the comments section to be a public forum on the papers you collect?

I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.

They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et... (read more)

Which writeup is this? Have a link?
5Quintin Pope3mo
I welcome any discussion of the linked papers in the comments section. I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.

I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.

Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.

The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.

Rigor as in meticulously researching everything seems not like the best thing to strive for [] ? For what it's worth I think the post actually did a good job in framing this post, so I mostly took this as, "this is what this feels like" and less this is what the current fundig situation ~actually~ is. The Karma system of the comments did a great job at surfacing important facts like the hotel price.

I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.

Moreover, I think you did an useful thing,  raising awareness about some important points:  

  •  "The amount of funding in 2022 exceeded the total cost of useful funding opportunities in 2022."
  •  "Being used to do everything in Berkeley, on a high budget, is strongly suboptimal in case of sudden funding constraints."
  • "Why don't
... (read more)
Fwiw I disagree with this. I'm a LW mod. Other LW mods haven't talked through this post yet and I'm not sure if they'd all agree, but, I think people sharing their feelings is just a straightforwardly reasonable thing to do. I think this post did a reasonable job framing itself as not-objective-truth, just a self report on feelings. (i.e. it's objectively true about "these were my feelings", which is fine). I think the author was straightforwardly wrong about Rose Garden Inn being $500 a night, but that seems like a simple mistake that was easily corrected. I also think it is straightforwardly correct that EA projects in San Francisco spend money very liberally, and if you're in the middle of the culture shock of realizing how much money people are spending and haven't finished orienting, $500/night is not an unbelievable number. (it so happens that there's been at least one event with lodging that I think averaged $500/person/night (although this was including other venue expenses, and was a pretty weird edge case of events that happened for weird contingent reasons. Meanwhile in Berkeley there's been plenty of $230ish/night hotel rooms used for events, which is not $500 but still probably a lot more than Sam was expecting) I do agree with you that the implied frame of: is, in fact, an unhelpful frame. It's important for people to learn to orient in a world where money is available and learn to make use of more money. (Penny-pinching isn't the right mindset for EA – even before longtermist billionaires flooding the ecosystem I still think it was generally a better mindset for people to look for strategies that would get them enough surplus money that they didn't have to spend cognition penny pinching) But, just because penny-pinching isn't the right mindset for EA in 2022, doesn't mean that that the amount of wealth isn't... just a pretty disorienting situation. I expect lots of people to experience cultural whiplash about this. I think posts like this are a

Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that.


I disagree both that posts that are clearly marked as sharing unendorsed feelings in a messy way need to be held to a high epistemic standard, and that there is no good faith interpretation of the post's particular errors. If you don't want to see personal posts I suggest disabling their appearance on your front page, which is the default anyway.

This might be true. Again, I think it would be useful to ask: what is the counterfactual?
All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.

You're interpreting as though they're making evaluative all-things-considered judgments, but it seems to me that the OP is reporting feelings

(If this post was written for EA's criticism and red teaming contest, I'd find the subjective style and lack of exploring of alternatives inappropriate. By contrast, for what it aspires to be, I thought the post was... (read more)

Thanks for being open about your response, I appreciate it and I expect many people share your reaction.

I've edited the section about the hotel room price/purchase, where people have pointed out I may have been incorrect or misleading,

This definitely wasn't meant to be a hit piece, or misleading "EA bad" rhetoric.

On the point of "What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?" - I think this is a large segment of my intended audience. I would like people to ... (read more)

I agree with the focus on epistemic standards, and I think many of the points here are good. I disagree that this is the primary reason to focus on maintaining epistemic standards:

Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?

I think we want to focus on the epistemic standards of posts so that we ourselves can trust the content on LessWrong to be honestly informing ... (read more)

On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.

This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?

because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field. 

In addition, the AI x-safety field is now rapidly expanding. 
There is a huge amount of status to be collected by publishing quickly and claiming large contributions.

In the absence of rigor and metrics, the incentives are towards:
- setting new research directions, and inventing new... (read more)

I actually agree that empirical work generally outperforms theoretical work or philosophical work, but in that tweet thread I question why he suggests the Turing Test as relating anything to x-risk.
4Daniel Paleka3mo
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions. This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?

I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph.
For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.

Yup, I definitely agree that something like "will roughly the current architectures take off first" is a highly relevant question. Indeed, I think that gathering arguments and evidence relevant to that question (and the more general question of "what kind of architecture will take off first?" or "what properties will the first architecture to take off have?") is the main way that work on timelines actually provides value. But it is a separate question from timelines, and I think most people trying to do timelines estimates would do more useful work if they instead explicitly focused on what architecture will take off first, or on what properties the first architecture to take off will have.

The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.

Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4. 

I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.

On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The  ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.

They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?

A quick estimate of the percentage of high-school students taking the Polish Matura exam... (read more)

I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...

(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)

The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take go... (read more)

I think this is quite plausible. Also see toner's comment [] in the other direction though. Both probabilities are somewhat high because there are lots of easy IMO problems. Like you, I think "hardest problem" is quite a bit harder than a gold, though it seems you think the gap is larger (and most likely it sounds like you put a much higher probability on an IMO gold overall). Overall I think that AI can solve most geometry problems and 3-variable inequalities for free, and many functional equations and diophantine equations seem easy. And I think the easiest problems are also likely to be soluble. In some years this might let it get a gold (e.g. 2015 is a good year), but typically I think it's still going to be left with problems that are out of reach. I would put a lower probability for "hardest problem" if we were actually able to hand-pick a really hard problem; the main risk is that sometimes the hardest problem defined in this way will also be bashable for one reason or another.

Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics. 

It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.

Do you ever plan on collaborating with researchers in academia, like DeepMind and Google Brain often do? What would make you accept or seek such external collaboration?

2Connor Leahy8mo
We would love to collaborate with anyone (from academia or elsewhere) wherever it makes sense to do so, but we honestly just do not care very much about formal academic publication or citation metrics or whatever. If we see opportunities to collaborate with academia that we think will lead to interesting alignment work getting done, excellent!
1Daniel Paleka8mo
Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics. It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.