matansok — LessWrong

I think this has a very high chance of success, but trades off the reliability of the probe - if the NLA somehow learns how to hide misalignment semantically (and not steganographically), this type of probe becomes useless.

No77e's Shortform

matansok2mo21

This seems anecdotal.

So far, we have documented cases of Generative AI being used to subvert elections in Romania (actually causing an annulment), and to some extent in NYC. We also have this report by OpenAI from 2024, which details such influence operations facilitated using OAI's API. Given the proliferation of significantly more capable open-source models in the 18 months since, we can be fairly confident that broader, more complex operations are taking place today.

I also tend to associate AI-slop with low quality content, but we know AI is more capab... (read more)

ChristianKl's Shortform

matansok3mo2-1

I mean... The Iranians did try to assassinate Trump.

I'm saying that propaganda efforts are significantly more effective against democracies. Authoritarians receive no penalty for lying, and are less susceptible to lies by their enemy. Their local media is 100% controlled. This is why they lie blatantly all the time - this is their main lever for "winning", since they have basically zero means of military resistance.

I think martyr death might invigorate people in the short term, but it's much harder to argue that the war was "won" when your leader died, and... (read more)

ChristianKl's Shortform

matansok3mo-4-9

Thinking about this as a war between two "countries" is, in my opinion, missing a large part of the picture. They would take out Trump if they could.

In a conflict between a democracy and an autocracy, the democracy is usually limited by domestic support, while the autocracy can just continuously reframe the narrative and keep on going pretty much indefinitely. This is true even at an enormous military disadvantage - the authoritarian side can force the democracy to surrender purely by psychology, and this is massively exploitable in the information age.

For... (read more)

Large-Scale Online Deanonymization with LLMs

matansok3mo71

Would be interesting to see if these results hold up in a more adversarial setting. In both cross-platform matching and matching split accounts, the users are not actively trying to remain anonymous, and would be less averse to sharing identifiable information.

I would argue that if de-anonymization is possible, it's better if a white-hat researcher does it first, a-la responsible disclosure.

faul_sname's Shortform

matansok7mo30

I don't necessarily disagree, but one big thing is freedom of speech. If the party line is to go big on AI, which is likely, given current investments, I'm not betting on Chinese Yudkowsky. Same with frontier lab whistleblowers, etc

Burny's Shortform

matansok8mo10

I feel like I'm missing something - the model doesn't "realize" that its CoT is monitored and then "decides" whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there's no direct optimization pressure to present an aligned CoT.

The only way I can see this working is through indirect optimization pressure - e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned