In this post, I proclaim/endorse forum participation (aka commenting) as a productive research strategy that I've managed to stumble upon, and recommend it to others (at least to try). Note that this is different from saying that forum/blog posts are a good way for a research community to communicate. It's about individually doing better as researchers.

I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made: Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3. Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A -> B -> C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D -> A -> E -> F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could've just gone D -> A -> B -> C instead. But what if it learns of a path E -> B? Or a shortcut A -> C? Or a path F -> G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance. Ok, so that's the problem, but how could it be fixed? Speculation time: You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal. More formally, let's say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation: Hope(s, a) = rs' + f Hope(s', a') Where s' is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a' is the optimal action in s'. Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience. Here's another thing to notice: let's say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization, based on which Hope it pursues. In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation: Result(s, a, w) = rs' + f Result(s', a', (w-rs')/f) Where a' is the action recommended by the decision transformer when asked to achieve (w-rs')/f from state s'. This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards. An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, probably it can be hacked by learning some clever embeddings or doing some clever approximations or something. Maybe something as simple as predicting a weighted average of the raw pixel observations that occur as rewards are obtained, though in practice I expect that to be too blurry. Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it's a bit tricky because one needs to distinguish correlation and causation.
yanni2d3453
4
I like the fact that despite not being (relatively) young when they died, the LW banner states that Kahneman & Vinge have died "FAR TOO YOUNG", pointing to the fact that death is always bad and/or it is bad when people die when they were still making positive contributions to the world (Kahneman published "Noise" in 2021!).
A strange effect: I'm using a GPU in Russia right now, which doesn't have access to copilot, and so when I'm on vscode I sometimes pause expecting copilot to write stuff for me, and then when it doesn't I feel a brief amount of the same kind of sadness I feel when a close friend is far away & I miss them.
Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set. How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model "thinks its on-distribution" for randomly sampled activations.
Novel Science is Inherently Illegible Legibility, transparency, and open science are generally considered positive attributes, while opacity, elitism, and obscurantism are viewed as negative. However, increased legibility in science is not always beneficial and can often be detrimental. Scientific management, with some exceptions, likely underperforms compared to simpler heuristics such as giving money to smart people or implementing grant lotteries. Scientific legibility suffers from the classic "Seeing like a State" problems. It constrains endeavors to the least informed stakeholder, hinders exploration, inevitably biases research to be simple and myopic, and exposes researchers to constant political tug-of-war between different interest groups poisoning objectivity.  I think the above would be considered relatively uncontroversial in EA circles.  But I posit there is something deeper going on:  Novel research is inherently illegible. If it were legible, someone else would have already pursued it. As science advances her concepts become increasingly counterintuitive and further from common sense. Most of the legible low-hanging fruit has already been picked, and novel research requires venturing higher into the tree, pursuing illegible paths with indirect and hard-to-foresee impacts.

Popular Comments

Recent Discussion

2Neel Nanda2h
What banner?

They took it down real quick for some reason.

3yanni14h
I have heard rumours that an AI Safety documentary is being made. Separate to this, a good friend of mine is also seriously considering making one, but he isn't "in" AI Safety. If you know who this first group is and can put me in touch with them, it might be worth getting across each others plans.
1Neil 14h
This reminds me of when Charlie Munger died at 99, and many said of him "he was just a child". Less of a nod to transhumanist aspirations, and more to how he retained his sparkling energy and curiosity up until death. There are quite a few good reasons to write "dead far too young". 
g-w1

Hey, so I wanted to start this dialogue because we were talking on Discord about the secondary school systems and college admission processes in the US vs NZ, and some of the differences were very surprising to me.

I think that it may be illuminating to fellow Americans to see the variation in pedagogy. Let's start off with grades. In America, the way school works is that you sit in class and then have projects and tests that go into a gradebook. Roughly speaking, each assignment has a max points you can earn. Your final grade for a subject is . Every school has a different way of doing the grading though. Some use A-F, while some use a number out of 4, 5, or 100. Colleges then

...
6Yair Halberstadt7h
I believe that the US is nearly unique in not having national assessments. Certainly in both the UK and Israel most exams with some impact on your future life are externally marked, and those few that are not are audited. From my perspective the US system seems batshit insane, I'd be interested in what a steelman of "have teachers arbitrarily grade the kids then use that to decide life outcomes" could be? Another huge difference between the education system in the US and elsewhere is the undergraduate/postgraduate distinction. Pretty much everywhere else an undergraduate degree is focused in a specific field, and meant to teach you sufficiently well to immediately get a job in that field. When 3 years isn't enough for that the length of the degree is increased by a year or 2 and you come out with a masters or a doctorate at the end. For example my wife took a 4 year course and now has a master's in pharmacy, allowing her to work as a pharmacist. Friends took a 5 or 6 year course (depending on the university) and are not Doctors. Second degrees are pretty much only necessary if you want to go into academia or research. Meanwhile in the US it seems that all an undergraduate degree means is you took enough courses in anything you want to get a certificate, and then have to go to a postgraduate course to actually learn stuff that's relevant to your particular career. 8 years total seems like standard to become a doctor in the US, yet graduate doctors actually have a year or 2 less medical training than doctors in the UK. This seems like a total dead weight loss.

I'd be interested in what a steelman of "have teachers arbitrarily grade the kids then use that to decide life outcomes" could be?

The best argument I have thought of is that America loves liberty and hates centralized control. They want to give individual states, districts, schools, teachers the most power they can have as that is a central part of America's philosophy. Also anecdotally, some teachers have said that they hate standardized tests because they have to teach to it. And I hate being taught to for the test (like APs for example). It's much mo... (read more)

2Yair Halberstadt7h
The way the auditing works in the UK is as follows: Students will be given an assignment, with a strict grading rubric. This grading rubric is open, and students are allowed to read it. The rubric will detail exactly what needs to be done to gain each mark. Interestingly, even students who read the rubric often fail to get these marks. Teachers then grade the coursework against the rubric. Usually two from each school are randomly selected for review. If the external grader finds the marks more than 2 points off, all of the coursework will be remarked externally. The biggest problem with this system is that experienced teachers will carefully go over the grading rubric with their students, and explain precisely what needs to be done to gain each mark. They will then read through drafts of the coursework, and point out which marks the student is failing to get it. When they mark the final coursework they will add exactly one point to the total. Meanwhile less experienced teachers don't actually understand what the marking rubric means. They will pattern match the students response to the examples in the rubric, and give their students a too high mark. It will then be regraded externally and the students will end up with a far lower grade than they had expected. Thus much of the difference in grades between schools is explainable by the difference in teacher quality/experience. This is bad for courses which are mostly graded in coursework, but fortunately most academic subjects are 90% written exams.

I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made:

Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like ... (read more)

Please don’t feel like you “won’t be welcome” just because you’re new to ACX/EA or demographically different from the average attendee. You'll be fine!

Exact location: https://plus.codes/8CCGPRJW+V8

We meet on top of a small hill East of the Linha d'Água café in Jardim Amália Rodrigues. For comfort, bring sunglasses and a blanket to sit on. There is some natural shade. Also, it can get quite windy, so bring a jacket.

(Location might change due to weather)

[This is part of a series I’m writing on how to convince a person that AI risk is worth paying attention to.] 

tl;dr: People’s default reaction to politics is not taking them seriously. They could center their entire personality on their political beliefs, and still not take them seriously. To get them to take you seriously, the quickest way is to make your words as unpolitical-seeming as possible. 

I’m a high school student in France. Politics in France are interesting because they’re in a confusing superposition. One second, you'll have bourgeois intellectuals sipping red wine from their Paris apartment writing essays with dubious sexual innuendos on the deep-running dynamics of power. The next, 400 farmers will vaguely agree with the sentiment and dump 20 tons of horse manure in downtown...

Interesting, and very well written. Because you have access to particularly funny examples, you show very well how much politics is an empty status game.
 

I should probably point out that five years ago, I was a high school student in France, felt more or less the way you do, and went on to study political science at college (I don’t even need to say which college I’m talking about, do I?). It is a deep truth that politics is very unserious for most people, and that is perhaps most true for first-year political science students (or, god forbid, the sor... (read more)

6Shankar Sivarajan10h
Yes, every four years, if the good guys don't win the next (US) presidential election. Or if people don't switch to/away from nuclear power. Or they're killed by immigrants/cops. Or they die of a fentanyl overdose. Or in a school shooting. Or if the Iraqis/Russians/Chinese invade. Or if taxes are lowered/raised. Perhaps telling people they or their children are going to die imminently isn't a standard tactic of "mere politics" where you are; you did say you're not American.
1Neil 3h
Concept creep is a bastard. >:(
7Neil 14h
More French stories: So, at some point, the French decided what kind of political climate they wanted. What actions would reflect on their cause well? Dumping manure onto the city center using tractors? Sure! Lining up a hundred stationary taxi cabs in every main artery of the city? You bet! What about burning down the city hall's door, which is a work of art older than the United States? Mais évidemment! "Politics" evokes all that in the mind of your average Frenchman. No, not sensible strategies that get your goals done, but the first shiny thing the protesters thought about. It'd be more entertaining to me, except for the fact that I had to skip class at some point because I accidentally biked headfirst into a burgeoning cloud of tear gas (which the cops had detonated in an attempt to ward off the tractors). There are flagpoles in front of the government building those tractors dumped the manure on. They weren't entirely clean, and you can still see the manure level, about 10 meters high. 
8Charlie Steiner17h
Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set. How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model "thinks its on-distribution" for randomly sampled activations.

You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)

5Erik Jenner16h
I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders): 1. Sometimes, you may have a bunch of "untrusted" data, some of which contains anomalies. You just don't know which data points have anomalies on this untrusted data. (In addition, you have some "trusted" data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what "normal" SAE features look like based on the trusted data. 2. Even for an SAE that's been trained only on normal data, it seems plausible that some correlations between features would be different for anomalous data, and that this might work better than looking for correlations in the dense basis. As an extreme version of this, you could look for circuits in the SAE basis and use those for anomaly detection. Overall, I think that if SAEs end up being very useful for mech interp, there's a decent chance they'll also be useful for (mechanistic) anomaly detection (a lot of my uncertainty about SAEs applies to both possible applications). Definitely uncertain though, e.g. I could imagine SAEs that are useful for discovering interesting stuff about a network manually, but whose features aren't the right computational units for actually detecting anomalies. I think that would make SAEs less than maximally useful for mech interp too, but probably non-zero useful.
4Charlie Steiner11h
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I'm uncertain if it's going to be weak against adversarial anomalies relative to regular ol' random anomalies.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Cross-posted on the EA Forum. This article is the fourth in a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (e.g. incident reportingsafety evals, model registries, etc.). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.

This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this...

An entry-level characterization of some types of guy in decision theory, and in real life, interspersed with short stories about them

A concave function bends down. A convex function bends up. A linear function does neither.

A utility function is just a function that says how good different outcomes are. They describe an agent's preferences. Different agents have different utility functions.

Usually, a utility function assigns scores to outcomes or histories, but in article we'll define a sort of utility function that takes the quantity of resources that the agent has control over, and the utility function says how good an outcome the agent could attain using that quantity of resources.

In that sense, a concave agent values resources less the more that it has, eventually barely wanting more resources at...

5Donald Hobson11h
The convex agent can be traded with a bit more than you think.  A 1 in 10^50 chance of us standing back  and giving it free reign of the universe is better than us going down fighting and destroying 1kg as we do. The concave agents are less cooperative than you think, maybe. I suspect that to some AI's, killing all humans now is more reliable than letting them live.  If the humans are left alive, who knows what they might do. They might make the vacuum bomb. Whereas the AI can Very reliably kill them now. 
2mako yass9h
Alternate phrasing, "Oh, you could steal the townhouse at a 1/8billion probability? How about we make a deal instead. If the rng rolls a number lower than 1/7billion, I give you the townhouse, otherwise, you deactivate and give us back the world." The convex agent finds that to be a much better deal, accepts, then deactivates. I guess perhaps it was the holdout who was being unreasonable, in the previous telling.

Or the sides can't make that deal because one side or both wouldn't hold up their end of the bargain. Or they would, but they can't prove it. Once the coin lands, the losing side has no reason to follow it other than TDT. And TDT only works if the other side can reliably predict their actions.

Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.

Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg)  for many helpful comments.

Introduction

Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function  as input to the Oracle, it will output an element  that has an impressively low[1] value of . But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).

What questions can you safely ask the Oracle? Can you use it to...

3Donald Hobson12h
I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.  You can start with m huge, and slowly turn it down, so you get a long list of "no results", followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.)  Depending on exactly how chaos theory and quantum randomness work, even 1 bit of malicious super optimization could substantially increase the chance of doom.  And of course, side channel attacks. Hacking out of the computer. And, producing formal proofs isn't pivotal. 
1Simon Fischer4h
Yes, I believe that's within reach using this technique. This is quite dangerous though if the Oracle is deceptively withholding answers; I commented on this in the last paragraph of this section.

If the oracle is deceptively withholding answers, give up on using it. I had taken the description to imply that the oracle wasn't doing that. 

2EGI13h
"...under the assumption that the subset of dangerous satisficing outputs D is much smaller than the set of all satisficing outputs S, and that we are able to choose a number m such that |D|≪m<|S|." I highly doubt that  D≪S is true for anything close to a pivotal act since most pivotal acts at some point involve deploying technology that can trivially take over the world. For anything less ambitious the proposed technique looks very useful. Strict cyber- and physical security will of course be necessary to prevent the scenario Gwern mentions.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA