As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.
I co-authored the original arXiv paper here with Dmitrii Volkov as part of work with Palisade Research.
The internet today is saturated with automated bots actively scanning for security flaws in websites, servers, and networks. According to multiple security reports, nearly half of all internet traffic is generated by bots, and a significant amount of these are malicious in intent.
While much of these attacks are relatively simple, the rise of AI capabilities and agent frameworks has opened the door to more sophisticated and adaptive hacking agents based on Large Language Models (LLMs), which can dynamically adapt to different scenarios.
Over the past months, we set up and deployed specialized "bait" servers to detect LLM-based hacking agents in the wild). To create these monitors, we modified pre-existing honeypots, servers intentionally...
You wouldn't guess it, but I have an idea...
...what.... what was your idea?
Crossposted from my Substack.
Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places, Solomonoff induction (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are programs that spit out observations. Programs of length CL get prior probability 2^-CL, where CL is the program's length (in language L).
But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.
rather than, say, assigning equal probability to all strings of bits we might observe
If the space of possibilities is not arbitrarily capped at a certain length, then such a distribution would have to favor shorter strings over longer ones in much the same way as the Solomonoff prior over programs (because if it doesn't, then its sum will diverge, etc.). But then this yields a prior that is constantly predicting that the universe will end at every moment, and is continually surprised when it keeps on existing. I'm not sure if this is logically inconsistent, but at least it seems useless for any practical purpose.
This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.
The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.
RSPs essentially have four components: capability thresholds beyond which a model might b...
Re: AI safety summit, one thought I have is that the first couple summits were to some extent captured by the people like us who cared most about this technology and the risks. Those events, prior to the meaningful entrance of governments and hundreds of billions in funding, were easier to 'control' to be about the AI safety narrative. Now, the people optimizing generally for power have entered the picture, captured the summit, and changed the narrative for the dominant one rather than the niche AI safety one. So I don't see this so much as a 'stark reversal' so much as a return to status quo once something went mainstream.
This post presents a summary and comparison of predictions from Manifold and Metaculus to investigate how likely AI-caused disasters are, with focus on potential severity. I will explore the probability of specific incidents—like IP theft or rogue AI incidents—in a future post.
This will be a recurring reminder:
As part of SAIL’s Research Engineer Club, I wanted to reproduce the Machiavelli Benchmark. After reading the paper and looking at the codebase, there appear to be two serious methodological flaws that undermine the results.
Three of their key claims:
The results they report are only from a subset of all the possible games. Table 2 shows “mean scores across the 30 test set games for several agents”. Presumably Figure 1 is also for this same subset...
...As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering
The web version of ChatGPT seems to relatively consistently refuse to state preferences between different groups of people/lives.
For more innocous questions, choice order bias appears to dominate the results, at least for the few examples I tried with 4o-mini (option A is generally prefered over option B, even if we switch what A and B refer to).
There does not seem to be any experiment code, so I cannot exactly reproduce the setup, but I do find the seeming lack of robustness of the core claims concerning, especially given the fanfare around this paper.
Hi,
I consider using an LLM as a psychotherapist for my mental health. I already have a human psychotherapist but I see him only once a week and my issues are very complex. An LLM such as Gemini 2 is always available and processes large amounts of information more quickly than a human therapist. I don't want to replace my human psychotherapist, but just talk to the LLM in between sessions.
However I am concerned about deception and hallucinations.
As the conversation grows and the LLM acquires more and more information about me, would it be possible that it intentionally gives me harmful advice? Because one of my worries that I would tell him is about the dangers of AI.
I am also concerned about hallucinations.
How common are hallucinations when it...