One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn't necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn't trying to squeeze more communication into a limited token budget.
A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If they tend to diverge over time, that suggests the model is assigning additional hidden meaning. (This might fail if the synonym embeddings are too close.)
My preferred approach to CoT would be something like:
Give human raters the task of next-token prediction on a large text corpus. Have them write out their internal monologue when trying to predict the next word in a sentence.
Train a model to predict the internal monologue of a human rater, conditional on previous tokens.
Train a second model to predict the next token in the corpus, conditional on previous tokens in the corpus and also the written internal monologue.
Only combine the above two models in production.
Now that you've embedded CoT in the base model, maybe it will be powerful enough that you can discard RHLF, and replace it with some sort of fine-tuning on PhDs roleplaying as a helpful/honest/harmless chatbot.
Basically give the base model a sort of "working memory" that's incentivized for maximal human imitativeness and interpretability. Then you could build an interface where a person can mouse over any word in a sentence and see what the model was 'thinking' when it chose that word. (Realistically you wouldn't do this for every word in a sentence, just the trickier ones.)
If that's true, perhaps the performance penalty for pinning/freezing weights in the 'internals', prior to the post-training, would be low. That could ease interpretability a lot, if you didn't need to worry so much about those internals which weren't affected by post-training?
On LessWrong, there's a comment section where hard questions can be asked and are asked frequently.
In my experience, asking hard questions here is quite socially unrewarding. I could probably think of a dozen or so cases where I think the LW consensus "emperor" has no clothes, that I haven't posted about, just because I expect it to be an exercise in frustration. I think I will probably quit posting here soon.
I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect public policy then for most other topics.
In terms of advocacy methods, sure. In terms of desired policies, I generally disagree.
Everything that's written publically can be easily picked up by journalists wanting to write stories about AI.
If that's what we are worried about, there is plenty of low-hanging fruit in terms of e.g. not tweeting wildly provocative stuff for no reason. (You can ask for examples, but be warned, sharing them might increase the probability that a journalist writes about them!)
"The far left is censorious" and "Republicans are censorious" are in no way incompatible claims :-)
Great post. Self-selection seems huge for online communities, and I think it's no different on these fora.
Confidence level: General vague impressions and assorted thoughts follow; could very well be wrong on some details.
A disagreement I have with both the rationalist and EA communities is what the process of coming to robust conclusions looks like. In those communities, it seems like the strategy is often to identify a few super-geniuses who go do a super-deep analysis, and come to a conclusion that's assumed to be robust and trustworthy. See the "Groupthink" section on this page for specifics.
From my perspective, I would rather see an ordinary-genius do an ordinary-depth analysis, and then have a bunch of other people ask a bunch of hard questions. If the analysis holds up against all those hard questions, then the conclusion can be taken as robust.
Everyone brings their own incentives, intuitions, and knowledge to a problem. If a single person focuses a lot on a problem, they run into diminishing returns regarding the number of angles of attack. It seems more effective to generate a lot of angles of attack by taking the union of everyone's thoughts.
From my perspective, placing a lot of trust in top EA/LW thought leaders ironically makes them less trustworthy, because people stop asking why the emperor has no clothes.
The problem with saying the emporer has no clothes is: Either you show yourself a fool, or else you're attacking a high-status person. Not a good prospect either way, in social terms.
EA/LW communities are an unusual niche with opaque membership norms, and people may want to retain their "insider" status. So they do extra homework before accusing the emperor of nudity, and might just procrastinate indefinitely.
There can also be a subtle aspect of circular reasoning to thought leadership: "we know this person is great because of their insights", but also "we know this insight is great because of the person who said it". (Certain celebrity users on these fora get 50+ positive karma on basically every top-level post. Hard to believe that the authorship isn't coloring the perception of the content.)
A recent illustration of these principles might be the pivot to AI Pause. IIRC, it took a "super-genius" (Katja Grace) writing a super long post before Pause became popular. If an outsider simply said: "So AI is bad, why not make it illegal?" -- I bet they would've been downvoted. And once that's downvoted, no one feels obligated to reply. (Note, also -- I don't believe there was much reasoning transparency regarding why the pause strategy was considered unpromising at the time. You kinda had to be an insider like Katja to know the reasoning in order to critique it.)
In conclusion, I suspect there are a fair number of mistaken community beliefs which survive because (1) no "super-genius" has yet written a super-long post about them, and (2) poking around by asking hard questions is disincentivized.
Yeah, I think there are a lot of underexplored ideas along these lines.
It's weird how so much of the internet seems locked into either the reddit model (upvotes/downvotes) or the Twitter model (likes/shares/followers), when the design space is so much larger than that. Someone like Aaron, who played such a big role in shaping the internet, seems more likely to have a gut-level belief that it can be shaped. I expect there are a lot more things like Community Notes that we could discover if we went looking for them.
I've always wondered what Aaron Swartz would think of the internet now, if he was still alive. He had far-left politics, but also seemed to be a big believer in openness, free speech, crowdsourcing, etc. When he was alive those were very compatible positions, and Aaron was practically the poster child for holding both of them. Nowadays the far left favors speech restrictions and is cynical about the internet.
Would Aaron have abandoned the far left, now that they are censorious? Would he have become censorious himself? Or would he have invented some clever new technology, like RSS or reddit, to try and fix the internet's problems?
Just goes to show what a tragedy death is, I guess.
I expect escape will happen a bunch
Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?
To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?
Sure there will be errors, but how important will those errors be?
Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?
If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".
OpenAI behaves in a generally antisocial way, inconsistent with its charter, yet other power centers haven't reined it in. Even in the EA and rationalist communities, people don't seem to have asked questions like "Is the charter legally enforceable? Should people besides Elon Musk be suing?"
If an idea is failing in practice, it seems a bit pointless to discuss whether it will work in theory.