I don't know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don't know how.EDIT: wrote the full comment now.
Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.
This one is interesting, but only in the counterfactual: "if AI ethics technical research focused on actual value alignment of models as opposed to front-end censorship, this would have higher-order positive effects for AI x-safety". But it doesn't directly hurt AI x-safety research right now: we already work under the assumption that that output filtering is not a solution for x-risk.It is clear improved technical research norms on AI non-x-risk safety can have positive effects on AI x-risk. If we could train a language model to robustly align to any set of human-defined values at all, this would be an improvement to the current situation. But, there are other factors to consider. Is "making the model inherently non-racist" a better proxy for alignment than some other technical problems? Could interacting with that community weaken the epistemic norms in AI x-safety?
Calling content censorship "AI safety" (or even "bias reduction") severely damages the reputation of actual, existential AI safety advocates.
I would need to significantly update my prior if this turns out to be a very important concern. Who are people, whose opinions will be relevant at some point, that understand both what AI non-x-safety and AI x-safety are about, dislike the former, are sympathetic to the latter, but conflate them?
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead. I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance. Opinion: Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignment projects. However I do think it makes sense as an alignment-positive research direction in general.
This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted. I point to Ben's excellent comment for a correct interpretation of why we still care.I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even know that button existed.No other part of my post is retracted. In fact, I'd like to reiterate a wish for the community to karma-enforce  the norms of:
Thank you for improving my user experience of this site!
I am now slightly proud that my original disclaimer precisely said that this was the part I was unsure of the most.
As in, I wish to personally be called out on any violations of the described norms.
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers.
Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word "physics" itself. Just don't overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.
Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.
The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.
I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.Moreover, I think you did an useful thing, raising awareness about some important points:
Epistemic status for what follows: medium-high for the factual claims, low for the claims about potential bad optics. It might be that I'm worrying about nothing here.However, I do not think this place should be welcoming of posts displaying bad rhetoric and epistemic practices.
Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?EDIT: The above paragraph was off. See Ben's excellent reply for a better explanation of why anyone should care.I think this place should be careful about maintaining:
For some examples:
My hotel room had the nightly price written on the inside of the door: $500. Shortly afterwards, I found out that the EA-adjacent community had bought the entire hotel complex.
I tried for 15 minutes to find a good faith reading of this, but I could not.Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that. Insinuating bad optics facts while maintaining plausible deniability, without checking the facts, is a horrible practice, usually employed by politicians and journalists. The poster does not deliberately lie, but this is not enough when making a "very bad optics" statement that sounds like this one. At any point, they could have asked for the actual price of the hotel room, or about the condition of the actual hotel that might be bought.
I have never felt so obliged, so unpressured. If I produce nothing, before Christmas, then nothing bad will happen. Future funds will be denied, but no other punishment will ensue.
This is true. But it is not much different from working a normal software job. The worst thing that can happen is getting fired after not delivering for several months. Some people survive years coasting until there is a layoff round.An important counterfactual for a lot of people reading this is a PhD degree. There is no punishment for failing to produce good research, except getting dropping out of the program after a few years.
After a while I work out why: every penny I’ve pinched, every luxury I’ve denied myself, every financial sacrifice, is completely irrelevant in the face of the magnitude of this wealth. I expect I could have easily asked for an extra 20%, and received it.
This might be true. Again, I think it would be useful to ask: what is the counterfactual?All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.This feeling (regretting saving and not spending money) is incredibly common in all people that have good careers. One way to get rid of it and save several lives in the meantime is to donate the surplus to the Maximum Impact Fund. I would suggest going through the post with a cold head and removing parts which are not up to the standards.Again, I am very sorry that you feel like this.
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?
because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field.
In addition, the AI x-safety field is now rapidly expanding. There is a huge amount of status to be collected by publishing quickly and claiming large contributions.In the absence of rigor and metrics, the incentives are towards:- setting new research directions, and inventing new cool terminology;- using mathematics in a way that impresses, but is too low-level to yield a useful claim;- and vice versa, relying too much on complex philosophical insights without empirical work; - getting approval from alignment research insiders.See also the now ancient Troubling Trends in Machine Learning Scholarship.I expect the LW/AF community microcosm will soon reproduce many of of those failures.
I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph. For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.