As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.
FVU_B doesn't make sense but I don't see where you're getting FVU_B from.
Here's the code I'm seeing:
resid_sum_of_squares = (
(flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1)
)
total_sum_of_squares = (
(flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1)
)
mse = resid_sum_of_squares / flattened_mask.sum()
explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares
Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)
How to build a lie detector app/program to release to the public (preferably packaged with advice/ideas on ways to use and strategies for marketing the app, e.g. packaging it with an animal body-language to english translator).
Who this post is for? Someone who either:
that's one path to RSI—where the improvement is happening to the (language) model itself.
the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn't be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.
Such a self-improving codebase... would it be reasonable to call this an agent?
"The AI bubble is reaching a tipping point", says Sequoia Capital.
AI companies paid billions of dollars for top engineers, data centers, etc. Meanwhile, companies are running out of 'free' data to scrape online and facing lawsuits for the data they did scrape. Finally, the novelty of chatbots and image generators is wearing off for users, and fierce competition is leading to some product commoditisation.
No major AI lab is making a profit yet (while downstream GPU providers do profit). That's not to say they won't make money eventually from automation.
It looks somewhat like the run-up of the Dotcom bubble. Companies then too were awash in investments (propped up by low interest rates), but most lacked a viable business strategy. Once the bubble burst, non-viable internet companies got filtered...
More Yoga and rock climbing.
https://x.com/arankomatsuzaki/status/1889522974467957033?s=46&t=9y15MIfip4QAOskUiIhvgA
O3 gets IOI Gold. Either we are in a fast takeoff or the "gold" standard benchmarks are a lot less useful than imagined.
Not too long ago, OpenAI presented a paper on their new strategy of Deliberative Alignment.
The way this works is that they tell the model what its policies are and then have the model think about whether it should comply with a request.
This is an important transition, so this post will go over my perspective on the new strategy.
Note the similarities, and also differences, with Anthropic’s Constitutional AI.
...We introduce deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering.
We used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies,
Another problem seems important to flag.
They said they first train a very powerful model, then "align" it - so they better hope it can't do anything bad until after they make it safe.
Then, as you point out, they are implicitly trusting that the unsafe reasoning won't have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying "we build it and it's definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.")
You could probably test if an AI makes moral decisions more often than the average person, if it has higher scope sensitivity, and if it makes decisions that resolve or deescalate conflicts or improve people's welfare compared to various human and group baselines.