LessWrong dev & admin as of July 5th, 2022.
This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions. There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.
Two reasons:
First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does "is an alignment technique" mean? Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second.
To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that's one thing[1]. But it's actively confusing to conflate it with technical detail about the technique itself.
Though it's not the kind of detail that should live in the first sentence of the tag description, probably.
Almost no specific (interesting) output is information that's already been generated by any model, in the strictest sense.
reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative
Seems like it'd pretty obviously cover information generated by non-covered models that are routinely used by many ordinary people (as open source image models currently are).
As a sidenote, I think the law is unfortunately one of those pretty cursed domains where it's hard to be very confident of anything as a layman without doing a lot of your own research, and you can't even look at experts speaking publicly on the subject since they're often performing advocacy, rather than making unbiased predictions about outcomes. You could try to hire a lawyer for such advice, but it seems to be pretty hard to find lawyers who are comfortable giving their clients quantitative (probabilistic) and conditional estimates. Maybe this is better once you're hiring for e.g. general counsel of a large org, or maybe large tech company CEOs have to deal with the same headaches that we do. Often your best option is to just get a basic understanding of how relevant parts of the legal system work, and then do a lot of research into e.g. relevant case law, and then sanity-check your reasoning and conclusions with an actual lawyer specialized in that domain.
Not your particular comment on it, no.
Notwithstanding the tendentious assumption in the other comment thread that courts are maximally adversarial processes bent on on misreading legislation to achieve their perverted ends, I would bet that the relevant courts would not in fact rule that a bunch of deepfaked child porn counted as "Other grave harms to public safety and security that are of comparable severity to the harms described in subparagraphs (A) to (C), inclusive", where those other things are "CBRN > mass casualties", "cyberattack on critical infra", and "autonomous action > mass casualties". Happy to take such a bet at 2:1 odds.
But there are some simpler reason that particular hypothetical fails:
See:
(2) “Critical harm” does not include any of the following:
(A) Harms caused or materially enabled by information that a covered model or covered model derivative outputs if the information is otherwise reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative.
It does not actually make any sense to me that Mira wanted to prevent leaks, and therefore didn't even tell Sam that she was leaving ahead of time. What would she be afraid of, that Sam would leak the fact that she was planning to leave... for what benefit?
Possibilities:
We recently had a security incident where an attacker used an old AWS access key to generate millions of tokens from various Claude models via AWS Bedrock. While we don't have any specific reason to think that any user data was accessed (and some reasons[1] to think it wasn't), most possible methods by which this key could have been found by an attacker would also have exposed our database credentials to the attacker. We don't know yet how the key was leaked, but we have taken steps to reduce the potential surface area in the future and rotated relevant credentials. This is a reminder that LessWrong does not have Google-level security and you should keep that in mind when using the site.
The main reason we don't think any user data was accessed is because this attack bore several signs of being part of a larger campaign, and our database also contains other LLM API credentials which would not have been difficult to find via a cursory manual inspection. Those credentials don't seem have been used by the attackers. Larger hacking campaigns like this are mostly automated, and for economic reasons the organizations conducting those campaigns don't usually sink time into manually inspecting individual targets for random maybe-valuable stuff that isn't part of their pipeline.
I agree with your top-level comment but don't agree with this. I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically. If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good (if spiky, with easily trimmed downside).
And I do think "attempting to impart a basic intuition that might let people avoid certain classes of errors" is an appropriate shape of post for LessWrong, to the extent that it's validly argued.
I don't really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).
Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.