There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism:
The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.
Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.
Reply by authors:
I can see why a MAS scholar would be unsurprised by this result. Howe
Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom in 2026.
Cool results! Some of these are good student project ideas for courses and such.
The "Let's think step by step" result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It's kind of expected that breaking the pattern helps break the spurious correlation.
1. Does "Let's think step by step" help when "Let's think step by step" is added to all few-shot examples? 2. Is adding some random string instead of "Let's think... (read more)
I don't know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don't know how.EDIT: wrote the full comment now.
Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.
This one is interesting, but only in the counterfactual: "if AI ethics tec... (read more)
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead. I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance. Opinion: Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignme... (read more)
This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted. I point to Ben's excellent comment for a correct interpretation of why we still care.I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even... (read more)
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers.
Refer to a writeup by Thibodeau et... (read more)
I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.
Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.
The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.
I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.Moreover, I think you did an useful thing, raising awareness about some important points:
Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that.
I disagree both that posts that are clearly marked as sharing unendorsed feelings in a messy way need to be held to a high epistemic standard, and that there is no good faith interpretation of the post's particular errors. If you don't want to see personal posts I suggest disabling their appearance on your front page, which is the default anyway.
This might be true. Again, I think it would be useful to ask: what is the counterfactual?All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.
You're interpreting as though they're making evaluative all-things-considered judgments, but it seems to me that the OP is reporting feelings.
(If this post was written for EA's criticism and red teaming contest, I'd find the subjective style and lack of exploring of alternatives inappropriate. By contrast, for what it aspires to be, I thought the post was... (read more)
Thanks for being open about your response, I appreciate it and I expect many people share your reaction.I've edited the section about the hotel room price/purchase, where people have pointed out I may have been incorrect or misleading,This definitely wasn't meant to be a hit piece, or misleading "EA bad" rhetoric.
On the point of "What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?" - I think this is a large segment of my intended audience. I would like people to ... (read more)
I agree with the focus on epistemic standards, and I think many of the points here are good. I disagree that this is the primary reason to focus on maintaining epistemic standards:
Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?
I think we want to focus on the epistemic standards of posts so that we ourselves can trust the content on LessWrong to be honestly informing ... (read more)
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?
because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field.
In addition, the AI x-safety field is now rapidly expanding. There is a huge amount of status to be collected by publishing quickly and claiming large contributions.In the absence of rigor and metrics, the incentives are towards:- setting new research directions, and inventing new... (read more)
I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph. For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.
The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.
Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4.
I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.
On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).Can someone from Poland confirm this?A quick estimate of the percentage of high-school students taking the Polish Matura exam... (read more)
I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...
(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)
The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take go... (read more)
Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics.
It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.
Do you ever plan on collaborating with researchers in academia, like DeepMind and Google Brain often do? What would make you accept or seek such external collaboration?