Summary threads of two recent papers which seem like significant evidence in favor of the Simulators view of LLMs (especially after just pretraining): https://x.com/aryaman2020/status/1852027909709382065 https://x.com/DimitrisPapail/status/1844463075442950229
These numbers are dictated by the regulator. What mechanism is there to make them have any relation to the real world?
Does this plan necessarily factor through the using the intent-aligned AGI to quickly commit some sort of pivotal act that flips the gameboard and prevents other intent-aligned AGIs from being used malevolently by self-interested or destructive (human) actors to gain a decisive strategic advantage? After all, it sure seems less than ideal to find yourself in a position where you can solve the theoretical parts of value alignment,[1] but you cannot implement that in practice because control over the entire future light cone has already been permanently taken over by an AGI intent-aligned to someone who does not care about any of your broadly prosocial goals...
In so far as something like this even makes sense, which I have already expressed my skepticism of many times, but I don't think I particularly want to rehash this discussion with you right now...
This is IMO the type of experiment that I like, even though it is evidence against my thesis on AI alignment generalization, and if done more could be a much more productive method of settling the current disputes that currently exists in the AI safety field.
I will also say that at the moment, it's looking like a lot of the proposed alignment methods are unlikely to be adversarially robust, which while mostly fine for an alignment proposal in a vacuum, is going to require rather extreme info-security to make the plan work out in practice without unleashing extinction or worse.
I don't go as far as Zach Stein Perlman here, but I agree with the broad take that most plans for controlling/aligning AIs rely fairly critically on the model not being stolen by other parties:
https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform#ckNQKZf8RxeuZRrGH
Yeah "stop reading here if you don't want to be spoiled." suggests the entire post is going to be spoilery, it isn't, or shouldn't be. Also opening with an unnecessary literary reference instead of a summary or description is an affectation symptomatic of indulgent writer-reader cultures where time is not valued.
[undeletable spoiler tag below please ignore]
I don't remember them having the actual stats, not watching it again though. I wonder if they published those elsewhere
Do you know why the error bars in the replication are smaller than the original one? (just more people?) And with which confidence is the null hypothesis (difference = 0) is rejected in both cases?
Interesting! I think that works.
You can still use the same positively-oriented brainstorming process for figuring out how to avoid bad outcomes. As soon as there's even a vague idea of avoiding a very bad outcome, that becomes a very good reward prediction after taking the differential. The dopamine system does calculate such differentials, and it seems like the valance system, while probably different from direct reward prediction and more conceptual, should and could also take differentials in useful ways. Valance needs to at least somewhat dependent on context. I don't think this requires unique mechanisms (although it might have them); it's sufficient to learn variants of the concepts like "avoiding a really bad event" and then attaching valance to that concept variant.
I'm curious why you were downvoted, for you hit the nail on the head. For a short an concise answer, yours is the best.
Does anyone know? Otherwise I will just assume that they're rationalists who dislike (and look down on) traditional/old things for moral reasons. This is not very flattering of me but I can't think of better explanations.
Ancient wisdom is not scientific, and it might even be false, but the benefits are very real, and these benefits sort of works to make the wisdom true.
The best example I can give is placebo, the belief that something is true helps make it true, so even if it's not true, you get the benefits of it being true. The special trait ancient wisdom has is this: The outcome is influenced by your belief in the outcome. This tends to be true for psychological things, and advice like "Belief can move mountains" is entirely true in the psychological realm. But scientific people, who deal with reality, tend to reject all of this and consider it as nonsense, as the problems they're used to aren't influenced by belief.
Another case in which belief matters includes treating things with weight/respect/sacredness/divinity. These things are just human constructs, but they have very real benefits. Of course, you can be an obnoxious atheist and break these illusions all you want, but the consequences of doing this will be nihilism. Why? Because treating things as if they have weight is what gives them weight, and nihilism is basically the lack of perceived weight. There's nothing objectively valid about filial piety, but it does have benefits, and acting as if it's something special makes it so.
Ancient wisdom often gets the conclusions right, but get the explanations wrong, and this is likely in order to make people take the conclusions seriously. Meditation has been shown to be good for you. Are you feeling "Ki" or does your body just feel warm when you concentrate on it? Do you become "one with everything" or does your perception just discard duality for a moment? Do you "meet god" or do you merely experience a peace of mind as you let go of resistance? The true answer is the boring one, but the fantastical explanation helps make these ideas more contagious, and it's likely that the false explanations have stuck around because they're stronger memetically.
Ancient wisdom has one advantage that modern science does not: It can deal with things which are beyond our understanding. The opposite is dangerous: If you reject something just because you don't understand why it might be good (or because the people who like it aren't intellectual enough to defend it), then you're being rational in the map rather than in the territory. Maybe the thing you're dismissing is actually good for reasons that we won't understand for another 20 years.
You can compare this with money, money is "real but not real" in a similar way. And this all generalizes far beyond my examples, but the main benefits are found, like I said, in everything human (psychological and spiritual) and in areas in which the consensus has an incomplete map. I belive that nature has its own intelligence in a way, and that we tend to underestimate it.
Edit: Downvotes came fast. Surely I wrote enough that I've made it very easy to attack my position? This topic is interesting and holds a lot of utility, so feel free to reply.
In this case, everybody seems pretty sure that the price is where it is because of the actions of a single person who's dumped in a very large amount of money relative to the float.
I think it's clear that he's the reason the price blew out so dramatically. But it's not clear why the market didn't 'correct' all the way back (or at least much closer) to 50/50. Thirty million dollars is a lot of money, but there are plenty of smart rich people who don't mind taking risks. So, once the identity and (apparent) motives of the Trump whale were revealed, why didn't a handful of them mop up the free EV?
That's not a rhetorical question; I'm interested in your answer and might be convinced by it. But right now I don't see sufficient reason to be confident that the market is still badly distorted, rather than having legitimately settled on ~60/40.
You've gotten a fair number of disagree-votes thus far, but I think it's generally correct to say that many (arguably most) prediction markets still currently lack the trading volume necessary to justify confidence that EMH-style arguments mean inefficiencies will be rapidly corrected. To a large extent, it's fair to say this is due to over-regulation and attempts at outright banning (perhaps the relatively recent 5th Circuit ruling in favor of PredictIt against the Commodities Future Trading Commission is worth looking at as a microcosm of how these legal battles are playing out in today's day and age).
Nevertheless, the standard theoretical argument that inefficiencies in prediction markets are exploitable and thus lead to a self-correcting mechanism still seems entirely correct, as Garrett Baker points out.
This started happening in Hawaii, and to a lesser extent in Arizona. The resolution, apart from reducing net metering subsidies, has been to increased the fixed component of the bill (which pays for the grid connection) and reduce the variable component. My impression is this has been a reasonably effective solution, assuming people don't want to cut their connection entirely.
I don't actualy think your post was hostile, but I think I get where deepthoughtlife is coming from. At the least, I can share about how I felt reading this post and point out to why, since you seem keen on avoiding the negative side. Btw I don't think you avoid causing any frustration in readers, they are too diverse, so don't worry too much about it either.
The title of the piece is strongly worded and there's no epistimic status disclaimer to state this is exploratory, so I actually came in expecting much stronger arguments. Your post is good as an exposition of your thoughts and conversation started, but it's not a good counter argument to NAH imo, so shouldn't be worded as such. Like deepthoughtlife, I feel your post is confused re NAH, which is totally fine when stated as such, but a bit grating when I came in expecting more rigor or knowledge of NAH.
Here's a reaction to the first part :
- in "Systems must have similar observational apparatus" you argue that different apparatus lead to different abstractions and claim a blind deaf person is such an example, yet in practice blind deaf people can manipulate all the abstractions others can (with perhaps a different inner representation), that's what general intelligence is about. You can check out this wiki page and video for some of how it's done https://en.wikipedia.org/wiki/Tadoma . The point is that all the abstractions can be understood and must be understood by a general intelligence trying to act effectively, and in practice Helen Keler could learn to speak by using other senses than hearing, in the same way we learn all of physics despite limited native instruments.
I think I had similar reactions to other parts, feeling they were missing the point about NAH and some background assumptions.
Thanks for posting!
Nice. I also have an offer - begin with yourself.
How so?
Good point.
What I meant by updatelessness removes most of the justification is the reason given here at the very beginning of "Against Resolute Choice". In order to make a money pump that leads the agent in a circle, the agent has to continue accepting trades around a full preference loop. But if it has decided on the entire plan beforehand, it will just do any plan that involves <1 trip around the preference loop. (Although it's unclear how it would settle on such a plan, maybe just stopping its search after a given time). It won't (I think?) choose any plan that does multiple loops, because they are strictly worse.
After choosing this plan though, I think it is representable as VNM rational, as you say. And I'm not sure what to do with this. It does seem important.
However, I think Scott's argument here satisfies (a) (b) and (c). I think the independence axiom might be special in this respect, because the money pump for independence is exploiting an update on new information.
There is not a difference between the two situations in the way you're claiming, and indeed the differentiation point of view is used fruitfully on both factory floors and in more complex convex optimization problems. For example, see the connection between dual variables and their indication of how slack or taught constraints are in convex optimization, and how this can be interpreted as a relative tradeoff price between each of the constrained resources.
In your factory floor example, the constraints would be the throughput of each machine, and (assuming you're trying to maximize the throughput of the entire process), the dual variables would be zero everywhere except at that machine where it is the negative derivative of the throughput of the entire process with respect to the throughput of the constraining machine, and we could determine indeed the tight constraint is the throughput of the relevant machine by looking at the derivative which is significantly greater than all others.
Practical problems also often have a similar sparse structure to their constraining inputs too, but just because not every constraint is exactly zero except one doesn't mean those non-zero constraints are secretly not actually constraining, or its unprincipled to use the same math or intuitions to reason about both situations.
I watched the video and didn't see any stats from their own experiment. Do you have a frame or a section?
They replicated it within the video itself?
Can't this only be judged in retrospect, and over a decent sample size?
The model that makes you hope for accuracy from the market is that it aggregates the information, including non-public information, available to a large number of people who are doing their best to maximize profits in a reasonable VNM-ish rational way.
In this case, everybody seems pretty sure that the price is where it is because of the actions of a single person who's dumped in a very large amount of money relative to the float. It seems likely that that person has done this despite having no access to any important non-public information about the actual election. For one thing, they've said that they're dumping all of their liquidity into bets on Trump. Not just all the money they already have allocated to semi-recreational betting, or even all the money they have allocated to speculative long shots in general, but their entire personal liquidity. That suggests a degree of certainty that almost no plausible non-public information could actually justify.
Not only that, but apparently they've done it in a way calculated to maximally move the price, which is the opposite of what you'd expect a profit maximizer to want to do given their ongoing buying and their (I think) stated and (definitely at this point) evidenced intention to hold until the market resolves.
If the model is that makes you expect accuracy to begin with is known to be violated, it seems reasonable to assume that the market is out of whack.
Sure, it's possible that the market just happens to be giving an accurate probability for some reason unrelated to how it's "supposed" to work, but that sort of speculation would take a lot of evidence to establish confidently.
I'm assuming that by "every other prediction source" you mean everything other than prediction/betting markets
Well, yes. I would expect that if you successfully mess up Polymarket, you have actually messed up "The Betting Market" as a whole. If there's a large spread between any two specific operators, that really is free money for somebody, especially if that person is already set up to deal on both.
Only one mention of Jules Verne in answers seems weird to me.
First and foremost, "The Mysterious Island". (But maybe it has already been read at nine?)
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
[I learned the term teleosemantics from you! :) ]
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
beliefs can have impacts on the world if the world looks at them
…Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some watcher system W, which reads off some vector of internal variables V of the core LI system, and then W takes actions according to some function A(V).
After a while, the LI system will automatically catch onto what W is doing, and “learn” to interpret V as an expectation that A(V) is going to happen.
I think the central case is that W is part of the larger AI system, as above, leading to normal agent-like behavior (assuming some sensible system for resolving the underdetermination). But in theory W could also be humans peeking into the LI system and taking actions based on what they see. Fundamentally, these aren’t that different.
So whatever solution we come up with to resolve the underdetermination, whether human-brain-like “valence” or something else, that solution ought to work for the humans-peeking-into-the-LI situation just as it works for the normal W-is-part-of-the-larger-AI situation.
(But maybe weird things would happen before convergence. And also, if you don’t have any system at all to resolve the underdetermination, then probably the results would be weird and hard to reason about.)
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
Hi Seth,
I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I'm not even sure how robust the open-source model is (I only know how it works in theory). I'm more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise.
My point about the UN's law against aggression wasn't that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate.
I also agree that RSI-capable AGI introduces a level of independence that we haven't seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to "confide" in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn't need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation, which is still a species of dystopia.
But if you've got a super-intelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.
This sort of illustrates the contradiction here, if you're pretty intelligent (as in you're designing a super-intelligent AGI) you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.
Now, I realise this sounds a lot like the situation humanity is in as a whole... so I agree with you that...
multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance.
I'm just suggesting that the other AGI teams do (or can, leveraging the right incentives) provide a significant contribution to this surveillance.
(Most people in AI Alignment work at scaling labs and are therefore almost exclusively working on LLM alignment. That said, I don't actually know what it means to work on LLM alignment over aligning other systems, it's not like we have a ton of traction on LLM alignment, and most techniques and insights seem general enough to not be conditional specifically on LLMs)
A few glaring issues here:
1) Does the question imply causation or not? It shouldn't.
2) Are these stats intended to be realistic such that I need to consider potential flaws and take a holistic view or just a toy scenario to test my numerical skills? If I believe it's the former and I'm confident X and Y are positively correlated, a 2x2 grid showing X and Y negatively correlated should of course make me question the quality of your data proportionally.
3) Is this an adversarial question such that my response may be taken out of context or otherwise misused?
The sample interviews from Veritasium did not seem to address any of these issues:
(1) They seemed to cut out the gun question, but the skin cream question implied causation, "Did the skin cream make the rash better or worse?"
(2) One person mentioned "I Wouldn't have expected that..." which implies he thought it was real data,
(3) the last person clearly interpreted it adversarially.
In the original study, the question was stated as "cities that enacted a ban on carrying concealed handguns were more likely to have a decrease in crime." This framing is not as bad, but still too close to implying causation in my opinion.
Too much runs into the very real issue that truth is stranger. 😉
It's nice to read some realistic science fiction.
Also this very recent one: https://www.lesswrong.com/posts/6h9p6NZ5RRFvAqWq5/the-summoned-heroine-s-prediction-markets-keep-providing
Do the stories get old? If it's trying to be about near-future AI, maybe the state-of-the-art will just obsolete it. But that won't make it bad necessarily, and there are many other settings than 2026. If it's about radical futures with Dyson spheres or whatever, that seems like at least a 2030s thing, and you can easily write a novel before then.
Also, I think it is actually possible to write pretty fast. 2k/day is doable, which gets you a good length novel in 50 days; even x3 for ideation beforehand and revising after the first draft only gets you to 150 days. You'd have to be good at fiction beforehand, and have existing concepts to draw on in your head though
All dead-on up until this:
... the universe will force them to use the natural abstractions (or else fail to achieve their goals). [...] Would the argument be that unnatural abstractions are just in practice not useful, or is it that the universe is such that its ~impossible to model the world using unnatural abstractions?
It's not quite that it's impossible to model the world without the use of natural abstractions. Rather, it's far instrumentally "cheaper" to use the natural abstractions (in some sense). Rather than routing through natural abstractions, a system with a highly capable world model could instead e.g. use exponentially large amounts of compute (e.g. doing full quantum-level simulation), or might need enormous amounts of data (e.g. exponentially many training cycles), or both. So we expect to see basically-all highly capable systems use natural abstractions in practice.
The problem with this model is that the "bad" models/theories in replication-crisis-prone fields don't look like random samples from a wide posterior. They have systematic, noticeable, and wrong (therefore not just coming from the data) patterns to them - especially patterns which make them more memetically fit, like e.g. fitting a popular political narrative. A model which just says that such fields are sampling from a noisy posterior fails to account for the predictable "direction" of the error which we see in practice.
links 11/05/2024: https://roamresearch.com/#/app/srcpublic/page/11-05-2024
Putting this short rant here for no particularly good reason but I dislike that people claim constraints here or there in a way where I guess their intended meaning is only that "the derivative with respect to that input is higher than for the other inputs".
On factory floors there exist hard constraints, the throughput is limited by the slowest machine (when everything has to go through this). The AI Safety world is obviously not like that. Increase funding and more work gets done, increase talent and more work gets done. None are hard constraints.
If I'm right that people are really only claiming the weak version, then I'd like to see somewhat more backing to their claims, especially if you say "definitely". Since none are constraints, the derivatives could plausibly be really close to one another. In fact, they kind of have to be, because there are smart optimizers who are deciding where to spend their funding and trying to actively manage the proportion of money sent to field building (getting more talent) vs direct work.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too?
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.
As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic.
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
Note: I added some spoiler warnings (given the one comment complaining). I don't feel strongly, so feel free to revert
Interesting thoughts, ty.
A difficulty to common understanding I see here is that you're talking of "good" or "bad" paragraphs in the absolute, but didn't particularly define "good" or "bad" paragraph by some objective standard, so you're relying on your own understanding of what's good or bad. If you were defining good or bad relatively, you'd look for a 100 paragraphs, and post the worse 10 as bad. I'd be interested in seeing what were the worse paragraphs you found, some 50 percentile ones, and what were the best, then I'd tell you if I have the same absolute standards as you have.
Enjoyed this post.
Fyi, from the front page I just hovered this post "The shallow bench" and was immediately spoiled on Project Hail Mary (which I had started listening to, but didn't get far into). Maybe add some spoiler tag or warning directly after the title?
Thanks for the comment!
We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.
Thanks for the comment!
We'll correct the typo in the next patch/bug fix.
As for the more direct adversarial tone of the prologue, it is an explicit choice (and is contrasted by the rest of the document). For the moment, we're waiting to get more feedback on the doc to see if it really turns people off or not.
I guess the big problem for someone who tries to do it not in small form is that while you write the story it is already getting old. There are writers who can write a novel in a season, but not many. At least if we talk about good writers. Hm-m-m, did rationalists try to hire Stephen King? :)
I don't fully understand the post. Without a clear definition of "winning," the points you're trying to make — as well as the distinction between pragmatic and non-pragmatic principles (which also aligns with strategies and knowledge formation) — aren't totally clear. For instance, "winning," in some vague sense, probably also includes things like "fitting with evidence," taking advice from others, and so on. You don't necessarily need to turn to non-pragmatic principles or those that don’t derive from the principle of winning. "Winning" is a pretty loose term.
Here is a category of book that I really loved at that age: non-embarrasing novels about how adults do stuff. Since, for me, that age was in 1973, the particular books I name might be obsolete. There’s a series of novels by Arthur Hailey, with titles like “Hotel” and “Airport”, that are set inside the titular institutions, and follow people as they deal with problems and interact with each other. And there is no, or at least minimal, sex, so they’re not icky to a kid. They’re not idealized; there is a reasonable degree of fallibility, venality and scheming, but that is also fascinating. And all the motivations, and the way the systems work, is clearly explained, so it can be understood by an unsophisticated reader.
These books were bestsellers back in the day, so you might be able to find a copy in the library. See if he likes it!
Another novel in this vein is “The view from the fortieth floor”, which is about a badly managed magazine going bankrupt. Doesn’t sound amazing, I know, but if you’re a kid, who’s never seen bad managers blunder into ineluctable financial doom, it’s really neat.
My wife is a middle school librarian. I’ll ask her when I see her for more books like this.
current inference scaling methods tend to be tied to CoT and the like, which are quite transparent
Aschenbrenner in Situational Awareness predicts illegible chains of thought are going to prevail because they are more efficient. I know of one developer claiming to do this (https://platonicresearch.com/) but I guess there must be many.
Can't you theoretically use both CellPainting assays and light-sheet microscopy?
I mean, I did look at CellPainting assays a small amount of time ago and I was still struck by how little control one had over the process, and how it isn't great for many kinds of mechanistic interpretability. I know there's a Brazil team looking at use of CellPainting for sphere-based silver-particle nanoplastics, but there are still many concrete variables, like intrinsic oxidative stress, that you can't necessarily get from CellPainting alone.
CellPainter can be used for toxicological predictions of organophosphate toxicity (predicting that they're more toxic than many other classes of compounds), but the toxicological assays used weren't able to use much nuance, especially the kind that's relevant to physiological concentrations that people are normally exposed to. I remember ketocozanole scored very highly on toxicity, but what does this say about physiological doses that are much smaller than the ones used for CellPainter?
Also, the cell lines were all cancer cell lines (OS osteosarcoma cancer cell lines), which gives little predictive power for neurotoxicity or a compound's ability to disrupt neuronal signalling.
Still, the CellPainter support ecosystem is extremely impressive, even though it doesn't produce Janelia-standard PB datasets that are used for lightsheet.. [cf https://www.cytodata.org/symposia/2024/ ]
https://markovbio.github.io/biomedical-progress/
FWIW, some of the most impressive near-term work might be whatever the https://www.abugootlab.org/ lab is going to do soon (large-scale perturb-seq combined with optical pooling to do readouts of genetic perturbations...)
Eh. feels wrong to me. Specifically, this argument feels over-complicated.
As best I can tell, the predominant mode of science in replication-crisis affected fields is that they do causal inference by sampling from noisy posteriors.
The predominant mode of science in non-replication-crisis affected fields is that they don't do this or do this less.
Most of the time it seems like science is conducted like that in those fields because they have to. Can you come up with a better way of doing Psychology research? Science in hard fields is hard is definitely a less sexy hypothesis, but it seems obviously true?
Doesn’t matter, because HPMOR is engaging enough on a chapter-by-chapter basis. I read lots of books when I was a kid when I didn’t understand the overarching plot. As long as I had a reasonable expectation that cool stuff would happen in the next chapter, I’d keep reading. I read “Stand On Zanzibar” repeatedly as a child, and didn’t understand the plot until I reread it as an adult last year. Same with the detective novel “A Deadly Shade of Gold”. I read it for the fistfights, snappy dialogue, and insights into adult life. The plot was lost on me.
Thank you for the in-depth thoughts!
Thank you!
It was a joke :) I had been warned by my friends that the joke was either only mildly funny or just entirely confusing. But I personally found it hilarious so kept it in. Sorry for my idiosyncratic sense of humor ;)
I agree about the punchline. Chef's kiss post
Thanks!
But I've heard that many people do a lot of thinking about negative outcomes, too.…
FWIW my answer is “involuntary attention” as discussed in Valence §3.3.5 (it also came up in §6.5.2.1 of this series).
If I look at my shoe and (voluntarily) pay attention to it, my subsequent thoughts are constrained to be somehow “about” my shoe. This constraint isn’t fully constraining—I might be putting my shoe into different contexts, or thinking about my shoe while humming a song to myself, etc.
By analogy, if I’m anxious, then my subsequent thoughts are (involuntarily) constrained to be somehow “about” the interoceptive feeling of anxiety. Again, this constraint isn’t fully constraining—I might be putting the feeling of anxiety into the context of how everyone hates me, or into the context of how my health is going downhill, or whatever else, and I could be doing both those things while simultaneously zipping up my coat and humming a song, etc.
Anxiety is just one example; I think there’s likewise involuntary attention associated with feeling itchy, feeling in pain, angry, etc.
Good list!
I personally really like Scott Alexander's Presidential Platform, it hits the hilarious-but-also-almost-works spot so perfectly. He also has many Bay Area house party stories in addition to the one you link (you can find a bunch (all?) linked at the top of this post). He also has this one from a long time ago, which has one of the best punchlines I've read.
Thanks for advertising my work, but alas, I think that's much more depressing than this one.
Could make for a good Barbie <> Oppenheimer combo though?
Agreed! Transformative AI is hard to visualise, and concrete stories / scenarios feel very lacking (in both disasters and positive visions, but especially in positive visions).
I like when people try to do this - for example, Richard Ngo has a bunch here, and Daniel Kokotajlo has his near-prophetic scenario here. I've previously tried to do it here (going out with a whimper leading to Bostrom's "disneyland without children" is one of the most poetic disasters imaginable - great setting for a story), and have a bunch more ideas I hope to get to.
But overall: the LessWrong bubble has a high emphasis on radical AI futures, and an enormous amount of fiction in its canon (HPMOR, Unsong, Planecrash). I keep being surprised that so few people combine those things.
I'm curious about the part where You wrote: "You could raise awareness for Leukemia, Dyslexia, or Estonia."
Estonia is a country. Leukemia and Dyslexia are not countries. Was it a typo? Or did you actually want to raise awareness about Estonia?
(I'm from Estonia myself)
Nice article though, thanks!
I did not actually consider this, but that is a very reasonable interpretation!
(I vaguely remember reading some description of explicitly flat-out anthropic immortality saving the day, but I can't seem to find it again now)
No I don’t recommend reading this post anymore, it has some ideas with little kernels of truth but also lots of errors and confusions. ¯\_(ツ)_/¯
Thanks for taking the time to explain this. This is a clears a lot of things up.
Let me see if I understand. So one reason that an agent might develop an abstraction is that it has a utility function that deals with that abstraction (if my utility function is ‘maximize the number of trees’, its helpful to have an abstraction for ‘trees’). But the NAH goes further than this and says that, even if an agent had a very ‘unnatural’ utility function which didn’t deal with abstractions (eg. it was something very fine-grained like ‘I value this atom being in this exact position and this atom being in a different position etc…’) it would still, for instrumental reasons, end up using the ‘natural’ set of abstractions because the natural abstractions are in some sense the only ‘proper’ set of abstractions for interacting with the world. Similarly, while there might be perceptual systems/brains/etc which favour using certain unnatural abstractions, once agents become capable enough to start pursuing complex goals (or rather goals requiring a high level of generality), the universe will force them to use the natural abstractions (or else fail to achieve their goals). Does this sound right?
Presumably its possible to define some ‘unnatural’ abstractions. Would the argument be that unnatural abstractions are just in practice not useful, or is it that the universe is such that its ~impossible to model the world using unnatural abstractions?
Hi Steve, I didn't read this post yet and just wanted to ask whether it's still worth reading or whether everything relevant is now better in "incentive learning and dead sea salt experiment"?
Could I get some constructive criticism about why I'm being downvoted? It would be helpful for the sake of avoiding the same mistakes in the future.
Correct. It lacks tactical practicality right now, but I think that from a macro-directional perspective, it's sensible to align all of my current actions to that end goal. And I believe there is a huge demand among business minded intellectuals and ambitious people for a community like this to be created.
a lot of the current human race spends a lot of time worrying - which I think probably has the same brainstorming dynamic and shares mechanisms with the positively oriented brainstorming. I don't know how to explain this; I think the avoidance of bad outcomes being a good outcome could do this work, but that's not how worrying feels - it feels like my thoughts are drawn toward potential bad outcomes even when I have no idea how to avoid them yet.
If we were not able to think about potentially bad outcomes well, that would a problem as clearly thinking about them is what avoids them, hopefully. But the question is a good one. My first intuition was that maybe the importance of an outcome - in both directions, good and bad - is relevant.
I like the examples from 8.4.2:
- Note the difference between saying (A) “the idea of going to the zoo is positive-valence, a.k.a. motivating”, versus (B) “I want to go to the zoo”. [...]
- Note the difference between saying (A) “the idea of closing the window popped into awareness”, versus (B) “I had the idea to close the window”. Since (B) involves the homunculus as a cause of new thoughts, it’s forbidden in my framework.
I think it could be an interesting mental practice to rephrase inner speech involving "I" in this way. I have been doing this for a while now. It started toward the end of my last meditation retreat when I switched to a non-CISM (or should I say "there was a switch in the thoughts about self-representation"?). Using "I" in mental verbalization felt like a syntax error and other phrasings like you are suggesting here, felt more natural. Interestingly, it still makes sense to use "I" in conversations to refer to me (the speaker). I think that is part of why the CISM is so natural: It uses the same element in internal and external verbalizations[1].
Pondering your examples, I think I would render them differently. Instead of: "I want to go to the zoo," it could be: "there is a desire to go to the zoo." Though I guess if "desire to" stands for "positive-valence thought about") it is very close to your "the idea of going to the zoo is positive-valence."
In practice, the thoughts would be smaller, more like "there is [a sound][2]," "there is a memory of [an animal]," "there is a memory of [an episode from a zoo visit]," "there is a desire to [experience zoo impressions]," "there is a thought of [planning]." The latter gets complicated. The thought of planning could be positive valence (because plans often lead to desirable outcomes) or the planning is instrumentally useful to get the zoo impressions (which themselves may be associated with desirable sights and smells), or the planning can be aversive (because effortful), but still not strong enough to displace the desirable zoo visit.
For an experienced meditator, the fragments that can be noticed can be even smaller - or maybe more pre-cursor-like. This distinction is easier to see with a quiet mind, where, before a thought fully occupies attention, glimmers of thoughts may bubble up[3]. This is related to noticing that attention is shifting. The everyday version of that happens why you notice that you got distracted by something. The subtler form is noticing small shifts during your regular thinking (e.g., I just noticed my attention shifting to some itch, without that really interuping my writing flow). But I'm not sure how much of that is really a sense of attention vs. a retroactive interpretation of the thoughts. Maybe a more competent meditator can comment.
And now I wonder whether the phonological loop, or whatever is responsible for language-like thoughts, maybe subvocalizations, is what makes the CISM the default model.
[brackets indicate concepts that are described by words, not the words themselves]
The question is though, what part notices the noticing. Some thought of [noticing something] must be sufficiently stable and active to do so.
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".
That seems fine to me, given that (as far as I am concerned) no detailed convincing arguments for AI X-risk exist. But at the moment, the summary of the document gave me the impression that maybe some such argument will appear. So I suggest updating the summary (or some other part of the doc) to make it explicit that no detailed arugment for AI X-risk will be given.
I don't think this is a good approach, and could easily backfire. The problem isn't that you need people to find errors in your reasoning. It's that you need to find the errors in your reasoning, fix them as best you can, iterate that a few times, then post your actual reasoning in a more thorough form, in a way that is collaborative and not combative. Then what you post may be in a form where it's actually useful for other people to pick it apart and discuss further.
The fact that you specify you want to put in little effort is a major red flag. So is the fact that you want to be perceived as someone worth listening to. The best way to be perceived as being worth listening to is to be worth listening to, which means putting in effort. An approach that focuses on signaling instead of being is a net drain on the community's resources and cuts against the goal of having humanity not die. It takes time and work to understand a field well enough for your participation to be a net positive.
That said, it's clear you have good questions you want to discuss, and there are some pretty easy ways to reformat your posts that would help. Could probably be done in at most an extra hour per post, less as it becomes habitual.
Some general principles:
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):
What is AI? Who is building it? Why? And is it going to be a future we want?
Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).
Machines of hateful competition need not have such hindrances.
"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work better.
There is no one is coming to save us.
Typo, extra "is".
The only thing necessary for the triumph of evil is for good people to do nothing. If you do nothing, evil triumphs, and that’s it.
Perhaps rewrite this for less antagonistic language? I know it is a quote and all, but still. (This can be interpreted as "the people building AI are evil and trying to cause harm on purpose". That seems false. And including this in the writing is likely to give the reader the impression that you don't understand the situation with AI, and stop reading.)
Perhaps (1) make it apparent that the first thing is a quote and (2) change the second sentence to "If you do nothing, our story gets a bad ending, and that's it.". Or just rewrite the whole thing.
So, there is a legitimate complaint here. It's true that sailors in the ancient world had a legitimate reason to want a word in their language whose extension was
{salmon, guppies, sharks, dolphins, ...}
. (And modern scholars writing a translation for present-day English speakers might even translate that word as fish, because most members of that category are what we would call fish.) It indeed would not necessarily be helping the sailors to tell them that they need to exclude dolphins from the extension of that word, and instead include dolphins in the extension of their word for{monkeys, squirrels, horses ...}
. Likewise, most modern biologists have little use for a word that groups dolphins and guppies together.
Ok, but salmon and guppies are more closely related to dolphins than sharks. Like I get where you are going with this, but "fish" is barely a natural category and it isn't obviously more of one than all descendants of the last common ancestor of the actinopterygians. Even if you limit it to marine descendants it still lets you predict bone vs cartilaginous skeletal system.
The political version of the question isn't functionally the same as the skin cream version, because the former isn't a randomized intervention—cities that decided to add gun control laws seem likely to have other crime-related events and law changes at the same time, which could produce a spurious result in either direction. So it's quite reasonable to say "My opinion is determined by my priors and the evidence didn't appreciably affect my position."
This is not a formal definition.
Your English sentence has no apparent connection to mathematical objects, which would be necessary for a rigorous and formal definition.
Related, I have a vague understanding on how product safety certification works in EU, and there are multiple private companies doing the certification in every state.
I would have found it helpful in your report for there to be a ROSES-type diagram or other flowchart showing the steps in your paper collation. This would bring it closer in line with other scoping reviews and would have made it easier to understand your methodology.
Thanks for that list of papers/posts. For most of the papers you linked, they’re not included because they did not feature in either of our search strategies: (1) titles containing specific keywords that we searched for on arXiv; (2) the paper is linked on the company’s website. I agree this is a limitation of our methodology. We won't add these papers in now as that would be somewhat ad hoc, and inconsistent between the companies.
Re the blog posts from Anthropic and what counts as a paper, I agree this is a tricky demarcation problem. We included the 'Circuit Updates' because it was linked to as a 'paper' on the Anthropic website. Even if GDM has a higher bar for what counts as a 'paper' than Anthropic, I think we don't really want to be adjudicating this, so I feel comfortable just deferring to each company about what counts as a paper for them.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo's post Lenses of Control.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!
It's less hard than you think, if you use a minimal-threshold definition of alignment:
That "AGI" continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.
When it comes to Buddhist practice, it's worth noting that practicing techniques by the book is not how Buddhism was practiced for most of the time in the last 2500 years. It was mostly an oral tradition and as such the knowledge that's passed down from teacher to student evolves over time in various ways.
Many modern Buddhist tradition put much more emphasis on meditation in contrast to ritualized behavior.
In Buddhism (and in Christanity for that matter) for thousands of years meditation was largely done in monasteries and not by lay-people. In many Buddhist communities "lay-people aren't supposed to meditate" is something you could call "ancient wisdom".
In someone convinces you in a Western context that following some practice is ancient wisdom, they are likely doing a lot of picking and choosing in a way that does not make it clear how ancient the thing they are promoting actually happens to be.
I think your explanation in section 8.5.2 resolves our disagreement nicely. You refer to S(X) thoughts that "spawn up" successive thoughts that eventually lead to X (I'd say X') actions shortly after (or much later). While I was referring to S(X) that cannot give rise to X immediately. I think the difference was that you are more lenient with what X can be, such that S(X) can be about an X that is happening much later, which wouldn't work in my model of thoughts.
Explicit (self-reflective) desire
Statement: “I want to be inside.”
Intuitive model underlying that statement: There’s a frame (§2.2.3) “X wants Y” (§3.3.4). This frame is being invoked, with X as the homunculus, and Y as the concept of “inside” as a location / environment.
How I describe what’s happening using my framework: There’s a systematic pattern (in this particular context), call it P, where self-reflective thoughts concerning the inside, like “myself being inside” or “myself going inside”, tend to trigger positive valence. That positive valence is why such thoughts arise in the first place, and it’s also why those thoughts tend to lead to actual going-inside behavior.
In my framework, that’s really the whole story. There’s this pattern P. And we can talk about the upstream causes of P—something involving innate drives and learned heuristics in the brain. And we can likewise talk about the downstream effects of P—P tends to spawn behaviors like going inside, brainstorming how to get inside, etc. But “what’s really going on” (in the “territory” of my brain algorithm) is a story about the pattern P, not about the homunculus. The homunculus only arises secondarily, as the way that I perceive the pattern P (in the “map” of my intuitive self-model).
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
The question is more if it can ever be truly proved at all, or if it doesn't turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
In many respects, I expect this to be closer to what actually happens than "everyone falls over dead in the same second" or "we definitively solve value alignment". Multipolar worlds, AI that generally follows the law (when operators want it to) but cannot fully be trusted, and generally muddling through are the default future. I'm hoping we don't get instrumental survival drives though.