LESSWRONG
LW

1872
GradientDissenter
1009Ω849350
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2GradientDissenter's Shortform
10mo
65
GradientDissenter's Shortform
GradientDissenter13h40

I think the canary string is slightly more likely to work than the robots.txt. And the things you link don't say which labs. I totally believe some labs scrape aggressively and ignore all robots.txt, but it wouldn't surprise me if others don't! In my opinion, if even one frontier lab doesn't train on this data, that's still quite valuable. And it seems so cheap to just try this as a bit of a Hail Mary.

Reply
GradientDissenter's Shortform
GradientDissenter19h31

Absolutely! But having models that don't know how to effectively scheme without access to external resources is still very useful. As just one example, we could interrogate the models and give them honeypots without giving them address to the internet.

Labs could also ban some of their models from reading certain webpages or ingesting certain information (as in filter at inference time all inputs to the model) in deployment (especially for models performing especially high-stakes tasks where sabotage would be easier).

I also suspect digesting a LessWrong post the model reads in deployment is more likely to involve explicit reasoning about the post in the CoT (or probes etc) and thus to be caught. By contrast, if a model has seen lots of of transcripts of misalignment in pre-training, the misaligned actions might be intuitive and second-nature, and thus harder to detect.

Reply
GradientDissenter's Shortform
GradientDissenter1d76

Probably I should have included a footnote about this. I'm well aware that this is not a foolproof mechanism, but it still seems better than nothing and I think it's very easy to have a disclaimer that makes this clear. As I said in the post, I think that people should only do this for information they would have posted on LessWrong anyway.

I disagree that these things are basically ignored by labs. My guess is many labs put some effort into filtering out data with the canary string, but that this is slightly harder than you might think and so they end up messing it up sometimes. (They might also sometimes ignore it on purpose, I'm not sure.)

Even if labs ignore the canary string now having the canary string in there would make it much easier to filter these things out if labs ever wanted to do that in the future.

 

I also suggest using better methods like captchas for non-logged-in users. I expect something like this to work somewhat well (though it still wouldn't be foolproof).

Reply1
GradientDissenter's Shortform
GradientDissenter1d*30-8

LessWrong feature request: make it easy for authors to opt-out of having their posts in the training data.

If most smart people were put in the position of a misaligned AI and tried to take over the world, I think they’d be caught and fail.[1] If I were a misaligned AI, I think I’d have a much better shot at succeeding, largely because I’ve read lots of text about how people evaluate and monitor models, strategies schemers can use to undermine evals and take malicious actions without being detected, and creative paths to taking over the world as an AI.

A lot of that information is from LessWrong.[2] It's unfortunate that this information will probably wind up in the pre-training corpus of new models (though sharing the information is often still worth it overall to share most of this information[3]).

LessWrong could easily change this for specific posts! They could add something to their robots.txt to ask crawlers looking to scrape training data to ignore the pages. They could add canary strings to the page invisibly. (They could even go a step further and add something like copyrighted song lyrics to the page invisibly.) If they really wanted, they could put the content of a post behind a captcha for users who aren’t logged in. This system wouldn't be perfect (edit: please don't rely on these methods. They're harm-reduction for information where you otherwise would have posted without any protections), but I think even reducing the odds or the quantity of this data in the pre-training corpus could help.

I would love to have this as a feature at the bottom of drafts. I imagine a box I could tick in the editor that would enable this feature (and maybe let me decide if I want the captcha part or not). Ideally the LessWrong team could prompt an LLM to read users’ posts before they hit publish. If it seems like the post might be something the user wouldn't want models trained on, the site could could proactively ask the user if they want to have their post be removed from the training corpus if it seems likely the user might want that.

As far as I know, no other social media platform has an easy way to try to avoid having their data up in the training corpus (and many actively sell it for this purpose). So LessWrong would be providing a valuable service.

The actual decisions around what should or shouldn’t be part of the pre-training corpus seem nuanced: if we want to use LLMs to help with AI safety, it might help if those LLMs have some information about AI safety in their pre-training corpus (though adding that information back in during post-training might work almost as well). But I want to at least give users the option to opt out of the current default.

  1. ^

    That's not to say all misaligned AIs would fail; I think there will be a period where AIs are roughly as smart as me and thus could at least bide their time and hide their misalignment without being caught if they'd read LessWrong and might fail to do so and get caught if they hadn't. But you can imagine we're purchasing dignity points or micro-dooms depending on your worldview. In either case I think this intervention is relatively cheap and worthwhile.

  2. ^

    Of course much of it is reproduced outside LessWrong as well. But I think (1) so much of it is still on LessWrong and nowhere else that it’s worth it, and (2) the more times this information is reported in the pre-training dats the more likely the model is to memorize it or have the information be salient to it.

  3. ^

    And the information for which the costs of sharing it aren't worth it probably still shouldn't be posted even if the proposal I outline here is implemented, since there’s still a good chance it might leak out.

Reply
GradientDissenter's Shortform
GradientDissenter2d30

Interesting! How did Norquist/Americans for Tax Reform get so much influence? They seem to spend even less money than Intuit on lobbying, but maybe I'm not looking at the right sources or they have influence via means other than money?

I'm also somewhat skeptical of the claims. The agreement between the the IRS and the Free File Alliance feels too favorable to the Free File Alliance for them to have had no hand in it.

As to your confusion, I can see why an advocacy group that wants to lower taxes might want the process of filing taxes to be painful. I'm just speculating, but I bet the fact that taxes are annoying to file and require you to directly confront the sizable sum you may owe the government makes people favor lower taxes and simpler tax codes.

Reply
GradientDissenter's Shortform
GradientDissenter3d120

Ways training incentivizes and disincentivizes introspection in LLMs.

Recent work has shown some LLMs have some ability to introspect. Many people were surprised to learn LLMs had this capability at all. But I found the results somewhat surprising for another reason: models are trained to mimic text, both in pre-training and fine-tuning. Almost every time a model is prompted in training to generate text related to introspection, the answer it's trained to give is whatever answer the LLMs in the training corpus would say, not what the model being trained actually observes from its own introspection. So I worry that even if models could introspect, they might learn to never introspect in response to prompting.

We do see models act consistently with this hypothesis sometimes: if you ask a model how many tokens it sees in a sentence or instruct it to write a sentence that has a specific number of tokens in it, it won't answer correctly.[1] But the model probably "knows" how many tokens there are; it's an extremely salient property of the input, and the space of possible tokens is a very useful thing for a model to know since it determines what it can output. At the very least models can be trained to at semi-accurately count tokens and conform their outputs to short token limits.

I presume the main reason models answer questions about themselves correctly at all is because AI developers very deliberately train them to do so. I bet that training doesn't directly involve introspection/strongly noting the relationship between the model's internal activations and the wider world.

So what could be going on? Maybe the way models learn to answer any questions about themselves generalizes? Or maybe introspection is specifically useful for answering those questions and instead of memorizing some facts about themselves, models learn to introspect (this could especially explain why they can articulate what they've been trained to do via self-awareness alone).

But I think the most likely dynamic is that in RL settings[2] introspection that affects the model's output is sometimes useful. Thus it is reinforced. For example, if you ask a reasoning model a question that's too hard for it to know the answer to, it could introspect to realize it doesn't know the answer (which might be more efficient than simply memorizing every question it does or doesn't know the answer to). Then it could articulate in the CoT that it doesn't know the answer, which would help it avoid hallucinating and ultimately produce the best output it could given the constraints.

One other possibility is the models are just that smart/self-aware and aligned towards being honest and helpful. They might have an extremely nuanced world-model, and since they're trained to honestly answer questions,[3] they could just put the pieces together and introspect (possibly in a hack-y or shallow way).

Overall these dynamics make introspection a very thorny thing to study. I worry it could go undetected in some models or it could seem like a model can introspect in a meaningful way when it only has shallow abilities reinforced directly by processes like the above (for example knowing when they don't know something [because that might have been learned during training], but not knowing in general how to query their internal knowledge on topics in other related ways).

  1. ^

    At least, not on any model I tried. They occasionally get it right by chance; they give plausible answers, just not precisely correct ones.

  2. ^

    Technically this could apply to fine-tuning settings too, for example if the model uses a CoT to improve its final answers enough to justify the CoT not being maximally likely tokens.

  3. ^

    In theory at least. In reality I think this training does occur but I don't know how well it can pinpoint honesty vs several things that are correlated with it (and for things like self-awareness those subtle correlates with truth in training data seem particularly pernicious).

Reply
GradientDissenter's Shortform
GradientDissenter4d280

TurboTax and H&R Block famously lobby the US government to make taxes more annoying to file to drum up demand for their products.[1] But as far as I can tell, they each only spend ~$3-4 million a year on lobbying. That's... not very much money (contrast it with the $60 billion the government gave the IRS to modernize its systems or the $4.9 billion in revenue Intuit made last fiscal year from TurboTax or the hundreds of millions of hours[2] spent that a return-free tax filing system could save).

Perhaps it would "just" take a multimillionaire and a few savvy policy folks to make the US tax system wildly better? Maybe TurboTax and H&R Block would simply up their lobbying budget if they stopped getting their way, but maybe they wouldn't. Even if they do, I think it's not crazy to imagine a fairly modest lobbying effort could beat them, since simpler tax filing seems popular across party lines/is rather obviously a good idea, and therefore may have an easier time making its case. Plus I wonder if pouring more money into lobbying hits diminishing returns at some point such that even a small amount of funding against TurboTax could go a long way.

Nobody seems to be trying to fight this. The closest things are an internal department of the IRS and some sporadic actions from broad consumer protection groups that don't particularly focus on this issue (for example ProPublica wrote an amazing piece of investigative journalism in 2019 that includes gems like the below Intuit slide:)

In the meantime, the IRS just killed its pilot direct file program. While the program was far from perfect, it seemed to me like the best bet out there for eventually bringing the US to a simple return-free filing system, like the UK, Japan, and Germany use. It seems like a tragedy that the IRS sunset this program.[3]

In general, the amount of money companies spend on lobbying is often very low, and the harm to society that lobbying causes seems large. If anyone has examples of times folks tried standing up to corporate lobbying like this that didn't seem to involve much money, I'd love to know more about how that's turned out.

  1. ^

    I haven't deeply investigated how true this narrative is. It seems clear TurboTax/Intuit lobbies actively with this goal in mind, but it seems possible that policymakers are ignoring them and that filing taxes is hard for some other reason. That would at least explain why TurboTax and H&R Block spend so little here.

  2. ^

    I don't trust most sources that quote numbers like this. This number comes from this Brookings article from 2006, which makes up numbers just like everyone else but at least these numbers are made up by a respectable institution that doesn't have an obvious COI.

  3. ^

    In general, I love when the government lets the private sector compete and make products! I want TurboTax to keep existing, but it's telling that they literally made the government promise not to build a competitor. That seems like the opposite of open competition.

Reply1
GradientDissenter's Shortform
GradientDissenter5d8235

There’s a cottage industry that thrives off of sneering, gawking, and maligning the AI safety community. This isn't new, but it's probably going to become more intense and pointed now that there are two giant super PACs that (allegedly[1]) see safety as a barrier to [innovation/profit, depending on your level of cynicism]. Brace for some nasty, uncharitable articles.

I think the largest cost of this targeted bad press will be the community's overreaction, not the reputational effects outside the AI safety community. I've already seen people shy away from doing things like donating to politicians that support AI safety for fear of provoking the super PACs.

Historically, the safety community often freaked out in the face of this kind of bad press. People got really stressed out, pointed fingers about whose fault it was, and started to let the strong frames in the hit pieces get into their heads.[2] People disavowed AI safety and turned to more popular causes. And the collective consciousness decided that the actions and people who ushered in the mockery were obviously terrible and dumb, so much so that you'd get a strange look if you asked them to justify that argument. In reality I think many actions that were publicly ridiculed were still worth it ex-ante despite the bad press.

It seems bad press is often much, much more salient to the subjects of that press than it is to society at large, and it's best to shrug it off and let it blow over. Some of the most PR-conscious people I know are weirdly calm during actual PR blowups and are sometimes more willing than the "weird" folks around me to take dramatic (but calculated) PR risks.

In the activist world, I hear this is a well-known phenomenon. You can get 10 people to protest a multi-billion-dollar company and a couple journalists to write articles, and the company will bend to your demands.[3] The rest of the world will have no idea who you are, but to the executives at the company, it will feel the world is watching them. These executives are probably making a mistake![4] Don't be like them.

With all these (allegedly anti-safety[1]) super PACs, there will probably be a lot more bad press than usual. All else being equal, avoiding the bad press is good, but in order to fight back, people in the safety community will probably take some actions, and the super PACs will probably twist any actions into headlines about cringe doomer tech bros.

I do think people should take into account when deciding what to do that provoking the super PACs is risky, and should think carefully before doing it. But often I expect it will be the right choice and the blowback will be well worth it.

If people in the safety community refuse to stand up to them, then they super PACs will get what they want anyway and the safety community won't even put up a fight.

Ultimately I think the AI safety community is an earnest, scrupulous group of people fighting for an extremely important cause. I hope we continue to hold ourselves to high standards for integrity and honor, and as long as we do, I will be proud to be part of this community no matter what the super PACs say.

  1. ^

    They haven't taken any anti-safety actions yet as far as I know (they're still new). The picture they paint of themselves isn't opposed to safety, and while I feel confident they will take actions I consider opposed to safety, I don't like maligning people before they've actually taken actions worthy of condemnation.

  2. ^

    I think it's really healthy to ask yourself if you're upholding your principles and acting ethically. But I find it a little suspicious how responsive some of these attitudes can be to bad press, where people often start tripping over themselves to distance themselves from whatever the journalist happened to dislike. If you've ever done this, consider asking yourself before you take any action how you'd feel if the fact that you took that action was on the front page of the papers. If you'd feel like you could hold your head up high, do it. Otherwise don't. And then if you do end up on the front page of the papers, hold your head up high!

  3. ^

    To a point. They won't do things that would make them go out of business, but they might spend many millions of dollars on the practices you want them to adopt.

  4. ^

    Tactically, that is. In many cases I'm glad the executives can be held responsible in this way and I think their changed behavior is better for the world.

Reply
LWLW's Shortform
GradientDissenter6d1526

I don't understand how working on "AI control" here is any worse than working on AI alignment (I'm assuming you don't feel the same about alignment since you don't mention it).

In my mind, two different ways AI could cause bad things to happen are: (1) misuse: people use the AI use it for bad things, and (2) misalignment: regardless of anyone's intent, the AI does bad things of its own accord.

Both seem bad. Alignment research and control are both ways to address misalignment problems, I don't see how they differ for the purposes of your argument (though maybe I'm failing to understand your argument).

Addressing misalignment slightly increases people's ability to misuse AI, but I think the effect is fairly small and outweighed by the benefit of decreasing the odds a misaligned AI takes catastrophic actions.

Reply1
GradientDissenter's Shortform
GradientDissenter6d*166

The world seems bottlenecked on people knowing and trusting each other. If you're a trustworthy person who wants good things for the world, one of the best ways to demonstrate your trustworthiness is by interacting with people a lot, so that they can see how you behave in a variety of situations and they can establish how reasonable, smart, and capable you are. You can produce a lot of value for everyone involved by just interacting with people more.

I’m an introvert. My social skills aren't amazing, and my social stamina is even less so. Yet I drag myself to parties and happy hours and one-on-one chats because they pay off.

It's fairly common for me to go to a party and get someone to put hundreds of thousands of dollars towards causes I think are impactful, or to pivot their career, or to tell me a very useful, relevant piece of information I can act on. I think each of those things individually happens more than 15% of the time that I go to a party.

(Though this is only because I know of unusually good cause areas and career opportunities. I don't think I could get people to put money or time towards random opportunities. This is a positive-sum interaction where I'm sharing information!)

Even if talking to someone isn't valuable in the moment, knowing lots of people comes in really handy. Being able to directly communicate with lots of people in a high-bandwidth way lets you quickly orient to situations and get things done.

I try to go to every party I'm invited to that's liable to have new people, and I very rarely turn down an opportunity to chat with a new person. I give my calendar link out like candy. Consider doing the same!

Talking to people is hits-based

Often, people go to an event and try to talk to people but it isn't very useful, and they give up on the activity forever. Most of the time you go to an event it will not be that useful. But when it is useful, it's extremely useful. With a little bit of skill, you can start to guess what kinds of conversations and events will be most useful (it is often not the ones that are most flashy and high-status).

Building up trust takes time

Often when I get good results from talking to people, it's because I've already talked to them a few times at parties and I've established myself as a trustworthy person that they know.

Talking to people isn’t zero-sum

When I meet new people, I try to find ways I can be useful to them. (Knowing lots of people makes it easier to help other folks because often you can produce value by connecting people to each other.) And when I help the people I'm talking to, I'm also helping myself because I am on the same team as them. I want things that are good for the world, and so do most other people. I'm not sure the strategy is in this short form would work at all if I was trying to trick investors into overvaluing my startup or convincing people to work for me when that wasn't in their best interest.

I think this is the main way that "talking to people", as I'm using the term here, differs from "networking". 

Be genuine

When I talk to people, I try to be very blunt and earnest. I happen to like hanging out with people who are talented and capable, so I typically just try to find good conversations I enjoy. I build up friendships and genuine trust with people (by being a genuinely trustworthy person doing good things, not by trying to signal trust in complicated ways). I think I have good suggestions for things people should do with their money and time, and people are often very happy to hear these things.

Sometimes I do seek out specific people for specific reasons. If I’m only talking to someone because they have information/resources that are of interest to me, I try to directly (though tactfully) acknowledge that. Part of my vibe is that I'm weirdly goal-oriented/mission-driven, and I just wear that on my sleeve because I think the mission I drive towards is a good one.

I also try to talk to all kinds of folks and often purposefully avoid “high-status” people. In my experience, chasing them is usually a distraction anyway and the people in the interesting conversations are more worth talking to.

You can ask to be invited to more social events

When I encourage people to go to more social events, often they tell me that they're not invited to more. In my experience, messaging the person you know who is most into going to social events and asking if they can invite you to stuff works pretty well most of the time. Once you’re attending a critical mass of social events, you’ll find yourself invited to more and more until your calendar explodes.

Reply1
Load More
31There's some chance oral herpes is pretty bad for you?
10d
4
35Reflections on 4 years of meta-honesty
11d
7
29Summary and Comments on Anthropic's Pilot Sabotage Risk Report
14d
0
97Considerations around career costs of political donations
24d
17
69A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”
2mo
25
64CoT May Be Highly Informative Despite “Unfaithfulness” [METR]
Ω
3mo
Ω
3
141METR's Evaluation of GPT-5
Ω
3mo
Ω
15
2GradientDissenter's Shortform
10mo
65
101LessWrong readers are invited to apply to the Lurkshop
3y
41