Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior.
I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models.
But I guess that's more of a "test how they behave in adversarial situations" study. If you're talking about a "test how to fight against them" study, that consists of "red teams" trying to hack various systems to make sure they are secure.
I'm not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I'm sure they would use them. So they're already stronger than AI.
I've gotten a lot of value out of the details of how other people use LLMs, so I'm delighted that Gavin Leech created a collection of exactly such posts (link should go to the right section of the page but if you don't see it, scroll down).
Some additions from me:
See also:
https://borretti.me/article/how-i-use-claude
I was planning on putting a whole list here but alas I am drawing a blank.
There's a lot of dispersed wisdom in @Zvi's stack too but I can't remember any sufficiently discriminatory key words to find them.
...oFurthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don't meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say wh
Ha, thinking back to childhood I get it now, it's the influence of the layot of the school daily journal in USSR/Ukraine, like https://cn1.nevsedoma.com.ua/images/2011/33/7/10000000.jpg
A_donor wrote up some thoughts we had: https://www.lesswrong.com/posts/3eXwKcg3HqS7F9s4e/swe-automation-is-coming-consider-selling-your-crypto-1
This is a cross-post from https://250bpm.substack.com/p/accountability-sinks
...Back in the 1990s, ground squirrels were briefly fashionable pets, but their popularity came to an abrupt end after an incident at Schiphol Airport on the outskirts of Amsterdam. In April 1999, a cargo of 440 of the rodents arrived on a KLM flight from Beijing, without the necessary import papers. Because of this, they could not be forwarded on to the customer in Athens. But nobody was able to correct the error and send them back either. What could be done with them? It’s hard to think there wasn’t a better solution than the one that was carried out; faced with the paperwork issue, airport staff threw all 440 squirrels into an industrial shredder.
[...]
It turned out that the order to destroy
I guess it was usually not worth bothering with prosecuting disobedience as long as it was rare. If ~50% of soldiers were refusing to follow these orders, surely the Nazi repression machine would have set up a process to effectively deal with them and solved the problem
Not just central banks but the U.S. going off the gold standard too then fiddling with bond yields to cover up ensuing inflation maybe?
Though, given my doomerism, I think the natsec framing of the AGI race is likely wrongheaded, let me accept the Dario/Leopold/Altman frame that AGI will be aligned with the national interest of a great power. These people seem to take as an axiom that a USG AGI will be better in some way than a CCP AGI. Has anyone written a justification for this assumption?
I am neither an American citizen nor a Chinese citizen.
What would it mean for an AGI to be aligned with "Democracy," or "Confucianism," or "Marxism with Chinese characteristics," or "the American Constitution"? Contingent on a world where such an entity exists and is compatible with my existence, what would my life be like in a weird transhuman future as a non-citizen in each...
I notice you're talking a lot about the values of American people but only talk about what the leaders of China are doing or would do.
If you just compare both leaders likelihood of enacting a world government, once again there is no clear winner.
And if the intelligence of the governing class is of any relevance to the likelihood of a positive outcome, um, CCP seems to have USG beat hands down.
--Intelligence is only a positive sign when the agent that is intelligent cares about you.
I'm interpreting this as "intelligence is irrelevant if the CCP doesn'...
It is an extension of the filter bubbles and polarisation issues of the social media era, but yes it is coming into its own as a new and serious threat.