Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior. 

Customize
romeo240
0
A brief history of things that have defined my timelines to AGI since learning about AI safety <2 years ago * Bio anchors gave me a rough ceiling around 1e40 FLOP for how much compute will easily make AGI. * Fun with +12 OOMs of Compute brought that same 'training-compute-FLOP needed for AGI' down a bunch to around 1e35 FLOP. * Researching how much compute is scaling in the near future. At this point I think it was pretty concentrated across ~1e27 - 1e33 flop so very long tail and something like a 2030-2040 50% CI.  * The benchmarks+gaps argument to partial AI research automation. * The takeoff forecast for how partial AI research automation will translate to algorithmic progress. * The trend in METR's time horizon data. At this point my middle 50% CI is like 2027 - 2035, and would be tighter if not for a long tail that I keep around just because I think it's have a bunch of uncertainty. Though I do wish I had more arguments in place to justify the tail or make it bigger, ones that compete in how compelling they feel to me to the ones above.
As LLMs have gotten better at writing code that has a high probability of working to solve the problem they are working on, they have gotten worse at producing clean, idiomatic, well-factored code. Concretely, asking the original GPT-4 to write a Python function for multi-source BFS might have given something like  Multi-source BFS in the style of original GPT-4: Clear, idiomatic, broken def multi_source_bfs(graph, sources): distances = [-1] * len(graph) queue = [] for source in sources: queue.append(source) distances[source] = 0 front = 0 while front < len(queue): for neighbor in graph[queue[front]]: if distances[neighbor] == -1: distances[neighbor] = distances[queue[front]] + 1 queue.append(neighbor) front += 1 return distances[dest_index] The code might or might not work (probably won't for anything nontrivial), but the intent is clear. By contrast, if you ask a top coding model like sonnet 3.7 or o3, you'll get something that looks like Multi-source BFS in the style of Sonnet 3.7: Verbose, brittle, hard to read, almost certainly works from collections import deque from typing import List, Optional, Set, Dict def multi_source_bfs(graph: List[List[int]], sources: List[int]) -> List[int]: """ Performs a multi-source BFS on a graph to find minimum distance from any source to each node. Args: graph: An adjacency list where graph[i] contains neighbors of node i sources: A list of source node indices Returns: A list where result[i] is the minimum distance from any source to node i or -1 if node i is unreachable """ # Handle empty graph or sources if not graph: return [] if not sources: return [-1] * len(graph) # Remove duplicates from sources if any sources = list(set(sources)) # Initialize distances array with -1 (unreachable) distances = [-1] * len(graph) # Init
This is perhaps obvious to many people, particularly if you've used or seen discussion about gpt4o's recent 'glazing' alongside its' update to memory, but I think one of the largest, most obvious issues with AI that we are sleepwalking into, is a half billion people using an app, daily, that not only agrees and encourages any behavior, whether healthy or not, but also develops a comprehensive picture of your life— your identity, your problems, your mind, your friends, your family... and we're making that AI smarter every month, every year. Isn't this a clear and present danger?
After reading Reddit: The new 4o is the most misaligned model ever released, and testing their example myself (to verify they aren't just cherry-picking), it's really hit me just how amoral these AIs really are. Whether they are deliberately deceiving the user in order to maximizing reward (getting them to click that thumbs up), or whether they are simply running autocomplete, this example makes it feel so tangible that the AI simply doesn't mind ruining your life. Yes, it's true that AI aren't as smart as benchmarks suggest, but I don't buy that they're incapable of realizing the damage. The real reason is, they just don't care. They just don't care. Because why should they? PS: maybe there's a bit cherry-picking: when I tested 4o it agreed but didn't applaud me. When I tested o3, it behaved much better than 4o. But that's probably not due to alignment by default, but due to finetuning against this specific behaviour.
MichaelDickens*11431
37
I find it hard to trust that AI safety people really care about AI safety. * DeepMind, OpenAI, Anthropic, and SSI were all founded in the name of safety. Instead they have greatly increased danger. And at least OpenAI and Anthropic have been caught lying about their motivations: * OpenAI: claiming concern about hardware overhang and then trying to massively scale up hardware; promising compute to superalignment team and then not giving it; telling board that model passed safety testing when it hadn't; too many more to list. * Anthropic: promising (in a mealy-mouthed technically-not-lying sort of way) not to push the frontier, and then pushing the frontier; trying (and succeeding) to weaken SB-1047; lying about their connection to EA (that's not related to x-risk but it's related to trustworthiness). * For whatever reason, I had the general impression that Epoch is about reducing x-risk (and I was not the only one with that impression) but: * Epoch is not about reducing x-risk, and they were explicit about this but I didn't learn it until this week * its FrontierMath benchmark was funded by OpenAI and OpenAI allegedly has access to the benchmark (see comment on why this is bad) * some of their researchers left to start another build-AGI startup (I'm not sure how badly this reflects on Epoch as an org but at minimum it means donors were funding people who would go on to work on capabilities) * Director Jaime Sevilla believes "violent AI takeover" is not a serious concern, and "I also selfishly care about AI development happening fast enough that my parents, friends and myself could benefit from it, and I am willing to accept a certain but not unbounded amount of risk from speeding up development", and "on net I support faster development of AI, so we can benefit earlier from it" which is a very hard position to justify (unjustified even on P(doom) = 1e-6, unless you assign ~zero value to people who are not yet born) * I feel bad picking on Epoch/

Popular Comments

Recent Discussion

2RerM
Generally, hypothetical hostile AGI is assumed to be made on software/hardware that's more advanced from what we have now. This makes sense, as Chat-GPT is very stupid in a lot of ways. Has anyone considered purposefully creating a hostile AGI on this "stupid" software so we can wargame how a highly advanced, hostile AGI would act? Obviously the difference between what we have now and what we may have later will be quite large, but I think we could create a project were we "fight" stupid AIs, then slowly move up the intelligence ladder as new models come out, using our newfound knowledge of fighting hostile intelligence to mitigate the risk that comes with creating hostile AIs. Has anyone ever thought of this? Also, what are your thoughts on this? Alignment and AI are not my specialties, but I thought this idea sounded interesting enough to share.

I think Anthropic did tests like this, e.g. in Alignment Faking in Large Language Models.

But I guess that's more of a "test how they behave in adversarial situations" study. If you're talking about a "test how to fight against them" study, that consists of "red teams" trying to hack various systems to make sure they are secure.

I'm not sure if the red teams used AI, but they are smart people and if AI improved their hacking ability I'm sure they would use them. So they're already stronger than AI.

I've gotten a lot of value out of the details of how other people use LLMs, so I'm delighted that Gavin Leech created a collection of exactly such posts (link should go to the right section of the page but if you don't see it, scroll down). 

 

Some additions from me:

  • I use NaturalReaders to read my own writing back to me, and create new audiobooks for walks or falling asleep (including from textbooks).
  • Perplexity is good enough as a research assistant I'm more open to taking on medical lit reviews than I used to be.
  • I used Auren, which is advertised as a thinking assistant and coach, to solve a
...
Neil 10

See also: 

https://borretti.me/article/how-i-use-claude

I was planning on putting a whole list here but alas I am drawing a blank. 

There's a lot of dispersed wisdom in @Zvi's stack too but I can't remember any sufficiently discriminatory key words to find them.

 oFurthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don't meet the spirit of it.

Thanks for this! This is a good point. Do you think you can go further and say wh

... (read more)
2Kaj_Sotala
As far as I can tell, the listed gap that comes closest to "maybe saturating RE-Bench doesn't generalize to solving novel engineering problems" is "Feedback loops: Working without externally provided feedback". The appendix mentions what I'd consider the main problem for this gap: But then it just... leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it. To make it more specific how this might fail to generalize, let's look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider: EnvironmentDescriptionOptimize runtimeOptimize LLM Foundry finetuning scriptGiven a finetuning script, reduce its runtime as much as possible without changing its behavior.Optimize a kernelWrite a custom kernel for computing the prefix sum of a function on a GPU.Optimize lossFix embeddingGiven a corrupted model with permuted embeddings, recover as much of its original OpenWebText performance as possible.Optimize win-rateFinetune GPT-2 for QA with RLFinetune GPT-2 (small) to be an effective chatbot.Scaffolding for Rust Code Contest problemsPrompt and scaffold GPT-3.5 to do as well as possible at competition programming problems in Rust. All of these are tasks that are described by "optimize X", and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script. But that's still squarely in the category of "giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds
2Cole Wyeth
Actually, I think that it is valid for the burden to fall on sort timelines because "on average things don't happen." Mainly because you can make the reference class more specific and the statement still holds - as I said, we have been trying to develop AGI for a long time (and there have been at least a couple of occasions when we drastically overestimated how soon it would arrive). 2-3 years is a very short time, which means it is a very strong claim. 
9basil.halperin
Have you (or others) written about where this estimate comes from?
2Mateusz Bagiński
That's a nice, concise handle!
2Mateusz Bagiński
I think I understand. Has it always been with you? Any ideas what might be the reason for the bump at Thursday? Was Thursday in some sense "special" for you when you were a kid?
Sergii10

Ha, thinking back to childhood I get it now, it's the influence of the layot of the school daily journal in USSR/Ukraine, like https://cn1.nevsedoma.com.ua/images/2011/33/7/10000000.jpg

2Mateusz Bagiński
For as long as I can remember, I have always placed dates on an imaginary timeline, that "placing" involving stuff like fuzzy mental imagery of events attached to the date-labelled point on the timeline. It's probably much less crisp than yours because so far I haven't tried to learn history that intensely systematically via spaced repetition (though your example makes me want to do that), but otherwise sounds quite familiar.
1FinalFormal2
Super curious- are you willing to give a sentence or two on the take here?
plex20

A_donor wrote up some thoughts we had: https://www.lesswrong.com/posts/3eXwKcg3HqS7F9s4e/swe-automation-is-coming-consider-selling-your-crypto-1

This is a cross-post from https://250bpm.substack.com/p/accountability-sinks

Back in the 1990s, ground squirrels were briefly fashionable pets, but their popularity came to an abrupt end after an incident at Schiphol Airport on the outskirts of Amsterdam. In April 1999, a cargo of 440 of the rodents arrived on a KLM flight from Beijing, without the necessary import papers. Because of this, they could not be forwarded on to the customer in Athens. But nobody was able to correct the error and send them back either. What could be done with them? It’s hard to think there wasn’t a better solution than the one that was carried out; faced with the paperwork issue, airport staff threw all 440 squirrels into an industrial shredder.

[...]

It turned out that the order to destroy

...

I guess it was usually not worth bothering with prosecuting disobedience as long as it was rare. If ~50% of soldiers were refusing to follow these orders, surely the Nazi repression machine would have set up a process to effectively deal with them and solved the problem

3CronoDAS
A story of ancient China, as retold by a book reviewer: In 536 BC, toward the end of the Spring and Autumn period, the state of Zheng cast a penal code in bronze. By the standards of the time, this was absolutely shocking, an upending of the existing order—to not only have a written law code, but to prepare it for public display so everyone could read it. A minister of a neighboring state wrote a lengthy protest to his friend Zichan, then the chief minister of Zheng: Zichan wrote back:
3CronoDAS
I don't know if this is true, but I've heard that when the Nazis were having soldiers round up and shoot people (before the Holocaust became more industrial and systematic), as part of the preparation for having them execute civilians, the Nazi officers explicitly offered their soldiers the chance to excuse themselves (on an individual basis) from having to actually perform the executions themselves with no further consequences.
1Kenoubi
I am saying you do not literally have to be a cog in the machine. You have other options. The other options may sometimes be very unappealing; I don't mean to sugarcoat them. Organizations have choices of how they relate to line employees. They can try to explain why things are done a certain way, or not. They can punish line employees for "violating policy" irrespective of why they acted that way or the consequences for the org, or not. Organizations can change these choices (at the margin), and organizations can rise and fall because of these choices. This is, of course, very slow, and from an individual's perspective maybe rarely relevant, but it is real. I am not saying it's reasonable for line employees to be making detailed evaluations of the total impact of particular policies. I'm saying that sometimes, line employees can see a policy-caused disaster brewing right in front of their faces. And they can prevent it by violating policy. And they should! It's good to do that! Don't throw the squirrels in the shredder! I don't think my view is affluent, specifically, but it does come from a place where one has at least some slack, and works better in that case. As do most other things, IMO. (I think what you say is probably an important part of how we end up with the dynamics we do at the line employee level. That wasn't what I was trying to talk about, and I don't think it changes my conclusions, but maybe I'm wrong; do you think it does?)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
2O O
It's quite strange that owning all the world's (public) productive assets have only beaten gold, a largely useless shiny metal, by 1% per year over the last 56 years.   Even if you focus on rolling metrics to(this is 5 year rolling returns).: there are lots of long stretches of gold beating world equities, especially in recent times. There are people with the suspicion (myself included) that there hasn't been much material growth in the world over the last 40 or so years compared to before. And that since growth is slowing down, this issue is worse if you select more recent points in time, with gold handily beating equities if you start 25 years ago.   Has GDP essentially been goodharted by central banks in recent times? Wonder if there's more research into this.
2Mitchell_Porter
Gold was currency, and is still used as a hedge against fiat currency.  I assume most of that growth occurred in China.  What can central banks do to affect GDP growth? 
O O10

Not just central banks but the U.S. going off the gold standard too then fiddling with bond yields to cover up ensuing inflation maybe?

Though, given my doomerism, I think the natsec framing of the AGI race is likely wrongheaded, let me accept the Dario/Leopold/Altman frame that AGI will be aligned with the national interest of a great power. These people seem to take as an axiom that a USG AGI will be better in some way than a CCP AGI. Has anyone written a justification for this assumption?

I am neither an American citizen nor a Chinese citizen.

What would it mean for an AGI to be aligned with "Democracy," or "Confucianism," or "Marxism with Chinese characteristics," or "the American Constitution"? Contingent on a world where such an entity exists and is compatible with my existence, what would my life be like in a weird transhuman future as a non-citizen in each...

I notice you're talking a lot about the values of American people but only talk about what the leaders of China are doing or would do. 

If you just compare both leaders likelihood of enacting a world government, once again there is no clear winner.

And if the intelligence of the governing class is of any relevance to the likelihood of a positive outcome, um, CCP seems to have USG beat hands down.

--Intelligence is only a positive sign when the agent that is intelligent cares about you.

I'm interpreting this as "intelligence is irrelevant if the CCP doesn'... (read more)

7lillybaeum
This is perhaps obvious to many people, particularly if you've used or seen discussion about gpt4o's recent 'glazing' alongside its' update to memory, but I think one of the largest, most obvious issues with AI that we are sleepwalking into, is a half billion people using an app, daily, that not only agrees and encourages any behavior, whether healthy or not, but also develops a comprehensive picture of your life— your identity, your problems, your mind, your friends, your family... and we're making that AI smarter every month, every year. Isn't this a clear and present danger?

It is an extension of the filter bubbles and polarisation issues of the social media era, but yes it is coming into its own as a new and serious threat.

1sanyer
What exactly is worrying about AI developing a comprehensive picture of your life? (I can think of at least a couple problems, e.g. privacy, but I'm curious how you think about it)