All of Mark Xu's Comments + Replies

Prizes for ELK proposals

We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.

1brglnd2dI see, thanks for answering. To further clarify, given the reporter's only access to the human's nodes is through the human's answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human's Bayes net in particular?
Prizes for ELK proposals

The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.

Prizes for ELK proposals

A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previ... (read more)

1Quintin Pope13dIf you want, you can slightly refactor my proposal to include a reporter module that takes the primary model's hidden representations as input and outputs more interpretable representations for the student models to use. That would leave the primary model's training objective unchanged. However, I don't think this is a good idea for much the same reason that training just the classification head of a pretrained language model isn't a good idea. However, I think training the primary model to be interpretable to other systems may actually improve economic competitiveness. The worth of a given approach depends on the ratio of capabilities to compute required. If you have a primary model whose capabilities are more easily distilled into smaller models, that's an advantage from a competitiveness standpoint. You can achieve better performance on cheaper models compared to competitors. I think people are FAR too eager to assume a significant capabilities/interpretability tradeoff. In a previous post [] , I used analogies to the brain to argue that there's enormous room to improve the interpretability of existing ML systems with little capabilities penalty. To go even further, more interpretable internal representations may actually improve learning. ML systems face their own internal interpretability problems. To optimize a system, gradient descent needs to be able to disentangle which changes will benefit vs harm the system's performance. This is a form of interpretability, though not one we often consider. Being "interpretable to gradient descent" is very different from being "interpretable to humans". However, most of my proposal focuses on making the primary model generally interpretable to many different systems, with humans as a special case. I think being more interpretable may directly lead to being easier to optimize. Intuitively, it seems easier to improve a
Prizes for ELK proposals

There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)

In this frame, doing anything to train the way the pred... (read more)

ARC's first technical report: Eliciting Latent Knowledge

I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I'm not sure I'm understanding correctly, but I think I would make the following claims:

  • Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there co
... (read more)
1Ramana Kumar20dThis "there isn't any ambiguity"+"there is ambiguity" does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first? I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
3Charlie Steiner25dI think this statement encapsulates some worries I have. If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that entails) entirely. Or to phrase it another way: I don't have any beef with a narrow approach that says "there's some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments." But I'm worried about a narrow approach that says "let's assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong." It just feels to me like this second approach is sort of... treating the real world as if it's a perturbative approximation to the platonic realm.
ARC's first technical report: Eliciting Latent Knowledge

Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:

  • How are you training this "terrifier"? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the "bad terrifier" you might learn doesn't generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during
... (read more)
ARC's first technical report: Eliciting Latent Knowledge

My point is either that:

  • it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn't understand or the AI will have deduced some property of diamonds that humans thought they didn't have
  • or there will be some tampering for which it's impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments
ARC's first technical report: Eliciting Latent Knowledge

Thanks for your proposal! I'm not sure I understand how the "human is happy with experiment" part is supposed to work. Here are some thoughts:

  • Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human's best guess is going to be "combining 1000 random chemicals doesn't do anything"
  • If the human
... (read more)
1Ramana Kumar1moThanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly. Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly. This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further. Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)
ARC's first technical report: Eliciting Latent Knowledge

We don't think that real humans are likely to be using Bayes nets to model the world. We make this assumption for much the same reasons that we assume models use Bayes nets, namely that it's a test case where we have a good sense of what we want a solution to ELK to look like. We think the arguments given in the report will basically extend to more realistic models of how humans reason (or rather, we aren't aware of a concrete model of how humans reason for which the arguments don't apply).

If you think there's a specific part of the report where the human Bayes net assumption seems crucial, I'd be happy to try to give a more general form of the argument in question.

The Plan

Agreed, but the thing you want to use this for isn’t simulating a long reflection, which will fail (in the worst case) because HCH can’t do certain types of learning efficiently.

2johnswentworth1moOnce we get past Simulated Long Reflection, there's a whole pile of Things To Do With AI which strike me as Probably Doomed on general principles. You mentioned using HCH to "let humans be epistemically competitive with the systems we're trying to train", which definitely falls in that pile. We have general principles saying that we should definitely not rely on humans being epistemically competitive with AGI; using HCH does not seem to get around those general principles at all. (Unless we buy some very strong hypotheses about humans' skill at factorizing problems, in which case we'd also expect HCH to be able to simulate something long-reflection-like.) Trying to be epistemically competitive with AGI is, in general, one of the most difficult use-cases one can aim for. For that to be easier than simulating a long reflection, even for architectures other than HCH-emulators, we'd need some really weird assumptions.
The Plan

I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.

8johnswentworth1moI mean, we have this thread [] with Paul directly saying "If all goes well you can think of it like 'a human thinking a long time'", plus Ajeya and Rohin both basically agreeing with that.
Biology-Inspired AGI Timelines: The Trick That Never Works

The way that you would think about NN anchors in my model (caveat that this isn't my whole model):

  • You have some distribution over 2020-FLOPS-equivalent that TAI needs.
  • Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio.
  • The function from 20XX to the 1:N ratio is relatively predictable, e.g. a "smooth" exponential with respect to time.
  • Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at convert
... (read more)
2Vanessa Kosoy1moI don't understand this. * What is the meaning of "2020-FLOPS-equivalent that TAI needs"? Plausibly you can't build TAI with 2020 algorithms without some truly astronomical amount of FLOPs. * What is the meaning of "20XX-FLOPS convert to 2020-FLOPS-equivalent"? If 2020 algorithms hit DMR, you can't match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs. Maybe you're talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in this case, how do you quantify the performance required for TAI? Do we have "real life elo" for modern algorithms that we can compare to human "real life elo"? Even if we did, this is not what Cotra is doing with her "neural anchor".
Biology-Inspired AGI Timelines: The Trick That Never Works

My model is something like:

  • For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an "effective compute regime" where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
  • In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn't get you that much better performance than like an OOM of compute (I have no idea if this is true, ex
... (read more)
5Vanessa Kosoy2moHmm... Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is "best algorithm" is interpreted to mean "best algorithm out of some wide class of algorithms s.t. we never or almost never managed to discover any algorithm outside of this class". This can justify biological anchors as upper bounds[1] [#fn-cCeH9Wga7mav4koHv-1] : if biology is operating using the best algorithm then we will match its performance when we reach the same level of compute, whereas if biology is operating using a suboptimal algorithm then we will match its performance earlier. However, how do we define the compute used by biology? Moravec's estimate is already in the past and there's still no human-level AI. Then there is the "lifetime" anchor from Cotra's report which predicts a very short timeline. Finally, there is the "evolution" anchor which predicts a relatively long timeline. However, in Cotra's report most of the weight is assigned to the "neural net" anchors which talk about the compute for training an ANN of brain size using modern algorithms (plus there is the "genome" anchor in which the ANN is genome-sized). This is something that I don't see how to justify using Mark's model. On Mark's model, modern algorithms might very well hit diminishing returns soon, in which case we will switch to different algorithms which might have a completely different compute(parameter count) function. -------------------------------------------------------------------------------- 1. Assuming evolution also cannot discover algorithms outside our class o
johnswentworth's Shortform

In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on "objective" metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.

johnswentworth's Shortform

A similar point is made by Korinek in his review of Could Advanced AI Drive Explosive Economic Growth:

My first reaction to the framing of the paper is to ask: growth in what? It’s important to keep in mind that concepts like “gross domestic product” and “world gross domestic product” were defined from an explicit anthropocentric perspective - they measure the total production of final goods within a certain time period. Final goods are what is either consumed by humans (e.g. food or human services) or what is invested into “capital goods” that last for m

... (read more)
2Mark Xu4moIn general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on "objective" metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.
Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.

Yeah that seems like a reasonable example of a good that can't be automated.

I think I'm mostly interested in whether these sorts of goods that seem difficult to automate will be a pragmatic constraint on economic growth. It seems clear that they'll eventually be ultimate binding constraints as long as we don't get massive population growth, but it's a separate question about whether or not they'll start being constraints early enough to prevent rapid AI-driven economic growth.

1aaronb506moGood idea, think I will.
rohinmshah's Shortform

My house implemented such a tax.

Re 1, we ran into some of the issues Matthew brought up, but all other COVID policies are implicitly valuing risk at some dollar amount (possibly inconsistently), so the Pigouvian tax seemed like the best option available.

2rohinmshah7moNice! And yeah, that matches my experience as well.

I'd be interested to see the rest of this list, if you're willing to share.

2gianlucatruda7moI'll DM you :)
Rogue AGI Embodies Valuable Intellectual Property

Yeah, I'm really not sure how the monopoly -> non-monopoly dynamics play out in practice. In theory, perfect competition should drive the cost to the cost of marginal production, which is very low for software. I briefly tried getting empirical data for this, but couldn't find it, plausibly since I didn't really know the right search terms.

An Intuitive Guide to Garrabrant Induction

both of those sections draw from section 7.2 of the original paper

1PaulK7moOh, nevermind then
An Intuitive Guide to Garrabrant Induction

Yes, and there will always exist such a trader.

How refined is your art of note-taking?

It’s based on bullet points, which I find helpful. It also lets me reference other notes I’ve taken.

I like the idea of question notes. Thanks for the tip!

How refined is your art of note-taking?

The particular technology stack I use for notes on reading is {Instapaper, PDF Expert on iPad} -> Readwise -> Roam Research -> Summarize it.

To answer your specific questions:

  1. If I plan on summarizing, I tend to only highlight important bits. I write down any connections I make with other concepts. Readwise reminds me of 15 highlights I've taken in the past per day, which I've been doing for about half a year. I'm not sure if it's helpful, but the time cost is low, so I continue.

  2. Sometimes if I want to know what I thought about specific posts.

... (read more)
4AllAmericanBreakfast8moThis connects for me. One type of note I frequently take is "question notes," where for each paragraph of the text I start by writing a question for which that paragraph could serve as an answer. Sometimes, I do this before I've even read the paragraph in detail. Having that question in mind in advance really helps me feel like I comprehend the main points. That way, information isn't just a stream of data, but has a purpose. Is Roam Research/RemNote just a piece of editing software? Or does it in some way force a unified format or structure to your notes?

Can you be more specific?

AMA: Paul Christiano, alignment researcher

How would you teach someone how to get better at the engine game?

2paulfchristiano9moNo idea other than playing a bunch of games (might as well current version, old dailies probably best) and maybe looking at solutions when you get stuck. Might also just run through a bunch of games and highlight the main important interactions and themes for each of them, e.g. Innovation + Public Works + Reverberate [] or Hatchery + Till []. I think on any given board (and for the game in general) it's best to work backwards from win conditions, then midgames, and then openings.
4Neel Nanda9moWhat's the engine game?
AMA: Paul Christiano, alignment researcher

You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?

I think my best guess is kind of like this story, but:

  1. People aren't even really deploying best practices.
  2. ML systems generalize kind of pathologically over long time horizons, and so e.g. long-term predictions don't correctly reflect the probability of systemic collapse.
  3. As a result there's no complicated "take over the sensors moment" it's just everything is going totally off the rails and everyone is yelling about it but it just keeps gradually drifting on the rails.
  4. Maybe the biggest distinction is that e.g. "watchdogs" can actually give pretty good argume
... (read more)
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.

2TurnTrout9moIt's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below). Here's a Wikipedia article I pasted into SuperMemo. Blue bits are the extracts, which it'll remind me to refine into flashcards later.A cloze deletion flashcard. It's easy to make a lot of these. I like them.Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away. In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!
Transparency Trichotomy

I agree it's sort of the same problem under the hood, but I think knowing how you're going to go from "understanding understanding" to producing an understandable model controls what type of understanding you're looking for.

I also agree that this post makes ~0 progress on solving the "hard problem" of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.

Strong Evidence is Common

Yeah, I agree 95% is a bit high.

Open Problems with Myopia

One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.

There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.

I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.

4Daniel Kokotajlo6moAlso: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is: "Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger interstellar civilization, or a simulation of a planet, or whatever. Basically, you need to believe that you were created by humans but that no intelligence played a role in the creation and/or arrangement of the humans who created you. Or... no role other than the "normal" one in which parents create offspring, governments create institutions, etc. I think this is a fairly specific belief, and I don't think we have the ability to shape our AIs beliefs with that much precision, at least not yet.
Open Problems with Myopia

has been changed to imitation, as suggested by Evan.

Open Problems with Myopia

Yeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."

Open Problems with Myopia

Yep - I switched the setup at some point and forgot to switch this sentence. Thanks.

Coincidences are Improbable

I am using the word "causal" to mean d-connected, which means not d-seperated. I prefer the term "directly causal" to mean A->B or B->A.

In the case of non-effects, the improbable events are "taking Benadryl" and "not reacting after consuming an allergy"

2philh1ySeems worth mentioning that the four ways you list for events to be causally linked are the building blocks of d-separation, not the whole thing. E.g. "A causes X, X causes B" is a causal link, but not direct. And "A causes X, B causes X, X causes Y, and we've observed Y" is one as well. Or even: "A causes X, Y causes X, Y causes B, X causes Z, and we've observed Z". (That's the link between s and y in example 3 from your link.)
1Daniel V1yOh yeah, definitely agree!
DanielFilan's Shortform Feed

I agree market returns are equal in expectation, but you're exposing. yourself to more risk for the same expected returns in the "I pick stocks" world, so risk-adjusted returns will be lower.

Ways to be more agenty?

I sometimes roleplay as someone role playing as myself, then take the action that I would obviously want to take, e.g. "wow sleeping regularly gives my character +1 INT!" and "using anki every day makes me level up 1% faster!"

1NicholasKross1yI've wondered about the roleplaying thing myself. Your example reminds me of Ulillillia [] and his "mind game" system.
Collider bias as a cognitive blindspot?

If X->Z<-Y, then X and Y are independent unless you're conditioning on Z. A relevant TAP might thus be:

  • Trigger: I notice that X and Y seem statistically dependent
  • Action: Ask yourself "what am I conditioning on?". Follow up with "Are any of these factors causally downstream of both X and Y?" Alternatively, you could list salient things causally downstream of either X or Y and check the others.

This TAP unfortunately abstract because "things I'm currently conditioning on" isn't an easy thing to list, but it might help.

Great minds might not think alike

Here are some possibilities:

  • great minds might not think alike
  • untranslated thinking sounds untrustworthy
  • disagreement as a lack of translation

Thanks! I've changed the title to "Great minds might not think alike".

Interestingly, when I asked my Twitter followers, they liked "Alike minds think great". I think LessWrong might be a different population. So I decided to change the title on LessWrong, but not on my blog.

3Neel Nanda1yI love " great minds might not think alike"
Load More