Help clear something up for me: I am extremely confused (theoretically) how we can simultaneously have:
1. An Artificial Superintelligence
2. It be controlled by humans (therefore creating misuse of concentration of power issues)
My intuition is that once it reaches a particular level of power it will be uncontrollable. Unless people are saying that we can have models 100x more powerful than GPT4 without it having any agency??
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t...
Are you familiar with USA NDA's? I'm sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven't seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
I’m confused: if the dating apps keep getting worse, how come nobody has come up with a good one, or at least a clone of OkCupid? Like, as far as I can understand not even “a good matching system is somehow less profitable than making people swipe all the time (surely it’d still be profitable on the absolute scale)” or “it requires a decently big initial investment” can explain a complete lack of good products in a very demanded area. Has anyone digged into it / tried to start a good dating app as a summer project?
People try new dating platforms all the time. It's what Y Combinator calls a tarpit. The problem sounds solvable, but the solution is elusive.
As I have said elsewhere: Dating apps are broken because the incentives of the usual core approach don't work.
On the supplier side: Misaligned incentives (keep users on the platform) and opaque algorithms lead to bad matches.
On the demand side: Misaligned incentives (first impressions, low cost to exit) and no plausible deniability lead to predators being favored.
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
Notably, there are some lawyers here on LessWrong who might help (possibly even for the lols, you never know). And you can look at case law and guidance to see if clauses are actually enforceable or not (many are not). To anyone reading, here's habryka doing just that
Interest groups without an organizer.
This is a product idea that solves a large coordination problem. With billion people, there could be a huge number of groups of people sharing multiple interests. But currently, the number of valuable groups of people is limited by a) the number of organizers and b) the number of people you meet via a random walk. Some progress has been made on (b) with better search, but it is difficult to make (a) go up because of human tendencies - most people are lurkers - and the incentive to focus on one area to stand out. So what...
New concept for my "qualia-first calibration" app idea that I just crystallized. The following are all the same "type":
1. "this feels 10% likely"
2. "this feels 90% likely"
3. "this feels exciting!"
4. "this feels confusing :("
5. "this is coding related"
6. "this is gaming related"
All of them are a thing you can track: "when I observe this, my predictions turn out to come true N% of the time".
Numerical-probabilities are merely a special case (tho it still gets additional tooling, since they're easier to visualize graphs and calculate brier scores for)
And then ...
While I still don't feel like I understand electrolytes as well as I would like to, I become more convinced that supplementing potassium when one engages in activities that produce sweating is worthwhile.
Over the last year I started using potassium carbonate like a spice and whether or not it feels tasty depends a lot on how much I was sweating in the day before the meal.
Giving that summer comes up, if you aren't already supplementing electrolytes for those days that are warm enough to make you sweat, I recommend you to get some potassium carbonate a...
In addition to that from my perspective, I think that if every day of the year you consume the same amount of potassium you (as a typical office worker) likely consume either too much or too little on some days.
Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).
My prompts bolded:
'Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible...
Turns out, someone's already done a similar (vector arithmetic in neural space; latent traversals too) experiment in a restricted domain (face processing) with another model (GAN) and it seemed to work: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012058 https://github.com/neuralcodinglab/brain2gan/blob/main/figs_manuscript/Fig12.png https://openreview.net/pdf?id=hT1S68yza7
I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.
I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than...
True, though I think the Hessian is problematic enough that that I'd either want to wait until I have something better, or want to use a simpler method.
It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network. But in some cases the Jacobian gives us exactly the opposite of what we want.
If we consider the toy model of a neural network with no input neurons and only 1 output neuron (which I imagine to represent a path through the net...
I currently am completing psychological studies for credit in my university psych course. The entire time, all I can think is “I wonder if that detail is the one they’re using to trick me with?”
I wonder how this impacts results. I can’t imagine being in a heightened state of looking out for deception has no impact.
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.
Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
That's true, but the timing and incongruity of a "suicide" the day before testifying seems even more absurdly unlikely than corporations starting to murder people. And it's not like they're going out and doing it themselves; they'd be hiring a hitman of some sort. I don't know how any of that works, and I agree that it's hard to imagine anyone invested enough in their job or their stock options to risk a murder charge; but they may feel that their chances of avoiding charges are near 100%, so it might make sense to them.
I just have absolutely no other way ...
Causality is rare! The usual statement that "correlation does not imply causation" puts them, I think, on deceptively equal footing. It's really more like correlation is almost always not causation absent something strong like an RCT or a robust study set-up.
Over the past few years I'd gradually become increasingly skeptical of claims of causality just by updating on empirical observations, but it just struck me that there's a good first principles reason for this.
For each true cause of some outcome we care to influence, there are many other "measurables" ...
Those are not randomly selected pairs, however. There are 3 major causal patterns: A->B, A<-B, and A<-C->B. Daecaneus is pointing out that for a random pair of correlations of some variables, we do not assign a uniform prior of 33% to each of these. While it may sound crazy to try to argue for some specific prior like 'we should assign 1% to the direct causal patterns of A->B and A<-B, and 99% to the confounding pattern of A<-C->B', this is a lot closer to the truth than thinking that 'a third of the time, A causes B; a third of the...
Anybody know how Fathom Radiant (https://fathomradiant.co/) is doing?
They’ve been working on photonics compute for a long time so I’m curious if people have any knowledge on the timelines they expect it to have practical effects on compute.
Also, Sam Altman and Scott Gray at OpenAI are both investors in Fathom. Not sure when they invested.
I’m guessing it’s still a long-term bet at this point.
OpenAI also hired someone who worked at PsiQuantum recently. My guess is that they are hedging their bets on the compute end and generally looking for opportunities on ...
I'm working on publishing a post on this and energy bottlenecks. If anyone is interested in doing a quick skim for feedback, I hope to publish it in under two hours.
Edit: Post here.
A list of some contrarian takes I have:
People are currently predictably too worried about misuse risks
What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic's re
I'd be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I'm also not fully convinced you're studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I'm skeptical of that.
For example, the model...
I'm working on a non-trivial.org project meant to assess the risk of genome sequences by comparing them to a public list of the most dangerous pathogens we know of. This would be used to assess the risk from both experimental results in e.g. BSL-4 labs and the output of e.g. protein folding models. The benchmarking would be carried out by an in-house ML model of ours. Two questions to LessWrong:
1. Is there any other project of this kind out there? Do BSL-4 labs/AlphaFold already have models for this?
2. "Training a model on the most dangerous pa...
I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.
I feel like this is vaguely somehow related to:
Also check out "personalized pagerank", where the rating shown to each user is "rooted" in what kind of content this user has upvoted in the past. It's a neat solution to many problems.
I had speculated previously about links between task arithmetic and activation engineering. I think given all the recent results on in context learning, task/function vectors and activation engineering / their compositionality (In-Context Learning Creates Task Vectors, In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering, Function Vectors in Large Language Models), this link is confirmed to a large degree. This might also suggest trying to import improvements to task arithmetic (e.g. Task Arithmetic i...
For the pretraining-finetuning paradigm, this link is now made much more explicitly in Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm; as well as linking to model ensembling through logit averaging.
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI.
I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models. Is there any concrete evidence about what commitment wa... (read more)
Adding to the confusion: I've nonpublicly heard from people at UKAISI and [OpenAI or Anthropic] that the Politico piece is very wrong and DeepMind isn't the only lab doing pre-deployment sharing (and that it's hard to say more because info about not-yet-deployed models is secret). But no clarification on commitments.