But staying on the frontier seems to be a really hard job. Lots of new research comes every day, and scientists struggle to follow it. New research has lots of value while it's hot, and loses it as the field progresses and finds itself a part of general theory (and learning it is a much more worthwhile use of time).
Which does introduce the question: if you are not currently at the cutting edge and actively advancing your field, why follow new research at all? After a bit of time, the field would condense the most important and useful research into neat textbooks and overview articles, and reading them when they appear would be a much more efficient use of time. While you are not at the cutting edge — read condensations of previous works until you get there.
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough).
Two separate points:
A much better version of this idea: https://slatestarcodex.com/2017/11/09/ars-longa-vita-brevis/
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough)
I am surprised regarding the lack of distillation claim. I'd naively expected that to be more neglected in physics compared to alignment. Is there something in particular that you think could be more distilled?
Regarding research that tries to come up with new paradigms, here are a few reasons why you might not be observing that much: I guess that is less funded by the big labs and is spread across all kinds of orgs or individuals. Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD. More of these researchers didn't publish all their research compared to AI safety researchers at AGI labs, so you would not have been aware it was going on? Some are also actively avoiding researching things that could be easily applied and tested, because of capability externalities (I think Vanessa Kosoy mentions this somewhere in the YouTube videos on Infrabayesianism).
Is there something in particular that you think could be more distilled?
What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD.
Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that's doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.
The neuroscience/psychology rather than ml side of the alignment problem seems quite neglected (because it harder on the one hand, but it's easier to not work on something capabilities related if you just don't focus on the cortex). There's reverse engineering human social instincts. In principle would benefit from more high quality experiments in mice, but those are expensive.
Please, just please, don't start RSI on purpose. For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI, and even the mere existence of it poses a threat. We were afraid we would accidentally miss the point of no return, and now, so many people (not even in major AI companies, but in smaller labs too) are trying to bring that point closer purposefully.
Programs sometimes don't work as we expect them to, even when we are the ones designing them. How would making the hallucination machine do this job produce something so powerful with working guardrails?
I know your comment isn't an earnest attempt to convince people, but fwiw:
For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI
I think this argument is more likely to have the opposite effect than intended when used on the types of people pushing on RSI. I think your final paragraph would be much more effective.
Idiot disaster monkeys indeed. I still believe we as a species can make less fatal choices, even though many individual people in the AI industry are working very hard of their own free will to prove me wrong.
I recently prepared an overview lecture about research directions in AI alignment for the Moscow AI Safety Hub. I had limited time, so I did the following: I reviewed all the sites on the AI safety map, examined the 'research' sections, and attempted to classify the problems they tackle and the research paths they pursue. I encountered difficulties in this process, partly because most sites lack a brief summary of their activities and objectives (Conjecture is one of the counterexamples). I believe that the field of AI safety would greatly benefit from improved communication, and providing a brief summary of a research direction seems like low-hanging fruit.
When someone is doing physics (tries to find out what happens with a physical system knowing it initial conditions), they are performing the transformation from the time-consuming-but-easy-to-express form of connecting the initial conditions to the end result (physical laws), to a form of a single entry in the giant look-up table which matches initial conditions to the end result (which is not-time-consuming-but-harder-to-express form), essentially flattening out the time dimension. That creates a feeling that the process that they are analyzing is pre-determined, that this giant look-up table already exists. And when they apply it to themselves, this can create a feeling of no control over their own actions, like those observation-action pairs are drawn from that pre-existing table. But this table doesn't actually exist; they still need to perform the computation to get to the action; there is no way around it. Wherever the process is performed, that process is the person.
In other words, when people do physics on simple enough systems that they can fit in their head both the initial conditions and the end result and the connection between them, they feel a sense of "machineness" about those systems. They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt when they actually can fit the model of the system (and initial conditions/end results entries) in their head, which they don't in the case of humans.
They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt
I don't follow why this is "overgeneralize" rather than just "generalize". Are you saying it's NOT TRUE for complex systems, or just that we can't fit it in our heads? I can't compute the Mandelbrot Set in my head, and I can't measure initial conditions well enough to predict a multi-arm pendulum beyond a few seconds. But there's no illusion of will for those things, just a simple acknowledgement of complexity.
The "will" is supposedly taken away by GLUT, which is possible to create and have a grasp of it for small systems, then people (wrongly) generalize this for all systems including themselves. I'm not claiming that any object that you can't predict has a free will, I'm saying that having ruled out free will from a small system will not imply lack of free will in humans. I'm claiming "physicality no free will" and "simplicity no free will", I'm not claiming "complexity free will".
Hmm. What about the claim "pysicality -> no free will". This is the more common assertion I see, and the one I find compelling.
The simplicity/complexity I often see attributed to "consciousness" (and I agree: complexity does not imply consciousness, but simplicity denies it), but that's at least partly orthogonal to free will.
I'm claiming ... "simplicity ⇒ no free will"
Consider the ASP problem, where the agent gets to decide whether it can be predicted, whether there is a dependence of the predictor on the agent. The agent can destroy the dependence by knowing too much about the predictor and making use of that knowledge. So this "knowing too much" (about the predictor) is what destroys the dependence, but it's not just a consequence of the predictor being too simple, but rather of letting an understanding of predictor's behavior precede agent's behavior. It's in the agent's interest to not let this happen, to avoid making use of this knowledge (in an unfortunate way), to maintain the dependence (so that it gets to predictably one-box).
So here, when you are calling something simple as opposed to complicated, you are positing that its behavior is easy to understand, and so it's easy to have something else make use of knowledge of that behavior. But even when it's easy, it could be avoided intentionally. So even simple things can have free will (such as humans in the eyes of a superintelligence), from a point of view that decides to avoid knowing too much, which can be a good thing to do, and as the ASP problem illustrates can influence said behavior (the behavior could be different if not known, as the fact of not-being-known could happen to be easily knowable to the behavior).
I'd say this is correct, but it's also deeply counterintuitive. We don't feel like we are just a process performing itself, or at least that's way too abstract to wrap our heads around. The intuitive notion of free will is IMO something like the following:
had I been placed ten times in exactly the same circumstances, with exactly the same input conditions, I could theoretically have come up with different courses of action in response to them, even though one of them may make a lot more sense for me, based on some kind of ineffable non-deterministic quality that however isn't random either, but it's the manifestation of a self that exists somehow untethered from the laws of causality
Of course not exactly worded that way in most people's minds, but I think that's really the intuition that clashes against pure determinism. It's a materialistic viewpoint, and lots of people are consciously or not dualists - implicitly assuming there's one special set of rules that applies to the self/mind/soul that doesn't apply to everything else.
Some confusion remains appropriate, because for example there is still no satisfactory account of a sense in which the behavior of one program influences the behavior of another program (in the general case, without constructing these programs in particular ways), with neither necessarily occurring within the other at the level of syntax. In this situation, the first program could be said to control the second (especially if it understands what's happening to it), or the second program could be said to perform analysis of (reason about) the first.
Just Turing machines / lambda terms, or something like that. And "behavior" is however you need to define it to make a sensible account of the dependence between "behaviors", or of how one of the "behaviors" produces a static analysis of the other. The intent is to capture a key building block of acausal consequentialism in a computational setting, which is one way of going about formulating free will in a deterministic world.
(You don't just control the physical world through your physical occurrence in it, but also for example through the way other people are reasoning about your possible behaviors, and so an account that simply looks for your occurrence in the world as a subterm/part misses an important aspect of what's going on. As Turing machines also illustrate, not having subterm/part structure.)
Wake up babe, new superintelligence company just dropped
And they show some impressive results.
The Math Inc. team is excited to introduce Gauss, a first-of-its-kind autoformalization agent for assisting human expert mathematicians at formal verification. Using Gauss, we have completed a challenge set by Fields Medallist Terence Tao and Alex Kontorovich in January 2024 to formalize the strong Prime Number Theorem (PNT) in Lean (GitHub).
Gauss took 3 weeks to do so, which seems way out of METR task length horizon prediction. Though I'm not sure if that's fair comparison, both because we do not have baseline human time for this task, and because formalizing is a domain where it is very hard to get off track, the criterion of success is very crisp.
I think alignment researchers have to learn to use it (or any other powerful math prover assistant) in order to exploit every leverage we can get.
Just as you can unjustly privilege a low-likelihood hypothesis just by thinking about it, you can in the exact same way unjustly unprivilege a high-likelihood hypothesis just by thinking about it. Example: I believe that when I press a key on a keyboard, the letter on the key is going to appear on the screen. But I do not consciously believe that; most of the time I don't even think about it. And so, just by thinking about it, I am questioning it, separating it from all hypotheses which I believe and do not question.
Some breakthroughs were in the form of "Hey, maybe something which nobody ever thought of is true," but some very important breakthroughs were in the form "Hey, maybe this thing which everybody just assumes to be true is false."
I'm curious about the distinction you're making between "believe" and "consciously believe". Do you agree with the way I'm using these terms below? —
I can only be conscious of a small finite number of things at once (maybe only one, depending on how tight a loop we mean by "consciousness"). The set of things that I would say I believe, if asked about them, is rather larger than the number of things I can be conscious of at once. Therefore, at any moment, almost none of my beliefs are conscious beliefs. For instance, an hour ago, "the moon typically appears blue-white in the daytime sky" was an unconscious belief of mine, but right now it is a conscious belief because I'm thinking about it. It will soon become an unconscious belief again.
Your definition seems sensible to me. Humans are not bayesians, they are not built as probabilistic machines with all of their probability being put explicitly in the memory. So I usually think of Bayesian approximation, which is basically what you’ve said. It’s unconscious when you don’t try to model those beliefs as Bayesian and unconscious otherwise.
Money is a good approximation for what people value. Value can be destroyed. But what should I do to money to destroy the value it encompasses?
I might feel bad if somebody stole my wallet, but that money hasn't been destroyed; it is just now going to bring utility to another human, and if I (for some weird reason) value the quality of life of the robber just as much as my own, I wouldn't even think something bad has happened.
If I actually destroy money, like burn it to ashes, then there will be less money in circulation, which will increase the value of each banknote, making everyone a bit richer (and me a lot poorer). So is it balanced in that case?
Maybe I need to read some economics, please recommend me some book which would dissolve the question.
If you are destroying something you own, you would value the destruction of that thing more than any other use you have for that thing and any price you could sell it for on the market, so this creates value in the sense that there is no deadweight loss to the relevant transactions/actions.
You can destroy others’ value intentionally, but only in extreme circumstances where you’re not thinking right or have self-destructive tendencies can you “intentionally” destroy your own value. But then we hardly describe the choices such people make as “intentional”. Eg the self-destructive person doesn’t “intend” to lose their friends by not paying back borrowed money. And those gambling at the casino, despite not thinking right, can’t be said to “intend” to lose all their money, though they “know” the chances they’ll succeed.
You might not value the destruction as much as others valued the thing you destroyed. In other words, you're assuming homo economicus, I'm not.
To complete your argument, ‘and therefore the action has some deadweight loss associated with it, meaning its destroying value’.
But note that by the same logic, any economic activity destroys value, since you are also not homo economicus when you buy ice cream, and there will likely be smarter things you can do with your money, or better deals. Therefore buying ice cream, or doing anything else destroys value.
But that is absurd, and we clearly don’t have a so broad definition of “destroy value”. So your argument proves too much.
Money is a claim on things other people value. You can't destroy value purely by doing something with your claim on that value.
Except the degenerate case of "making yourself or onlookers sad by engaging in self-destructive behaviors where you destroy your claim on resources", I guess. But it's not really an operation purely with money.
Hmm, I guess you can make something's success conditional on your having money (e. g., a startup backed by your investments), and then deliberately destroy your money, dooming the thing. But that's a very specific situation and it isn't really purely about the money either; it's pretty similar to "buy a thing and destroy it". Closest you can get, I think?
(Man, I hope this is just a concept-refinement exercise and I'm not giving someone advice on how to do economics terrorism.)
(Epistemic status: not an economist.)
Money is not value, but the absence of value. Where money is, it can be spent, replacing the money by the thing bought. The money moves to where the thing was.
Money is like the empty space in a sliding-block puzzle. You must have the space to be able to slide the blocks around, instead of spotting where you can pull out several at once and put them back in a different arrangement.
Money is the slack in a system of exchange that would otherwise have to operate by face-to-face barter or informal systems of credit. Informal, because as soon as you formalise it, you've reinvented money.
IANAE. This is a really interesting riddle. Because even in incidents of fraud or natural disaster, from an economic standpoint the intrinsic value isn't lost: if a distillery full of barrels of whisky goes up in flames and there's nothing recoverable - then elsewhere in the whisky market you would presume that the prices would go up as scarcity is now greater than demand and you would expect that "loss" to be dispersed as a gain through their competitors - you would think. (Not to mention the expenditure of the distiller to their suppliers and employees - any money that changed hands they keep - so the Opportunity Cost of the whisky didn't go up in smoke).
I say "you would think" because Price elasticity is it isn't necessarily instantaneous nor is it perfect - the correction in prices can be delayed especially if information is delayed. Like you said - money is a good approximation of what people value but there is a certain amount of noise and lag.
For example, what if there is no elasticity in whisky markets? What if there was already an oversupply and the distiller was never going to recoup their investment (if the fire didn't wipe them out). It's really interesting because in theory they would have to drop their prices until someone would buy it. But not only is information not instantaneous, there's no certainty that it would happen like that.
You might be interested in reading George Soros' speech on Reflexivity which describes how sometimes the intrinsic value of things (like financial securities) and their market value grow further or closer together. What's interesting is that if perception and prices rise, this can actually have a changing effect on intrinsic value higher or lower.
No one ever knows precisely what the intrinsic value is at, and since it is reflexive and affected by the market value, this makes it much more elusive.
Really somewhere along the line value is being created, because whenever someone develops a more efficient means of producing the same output that is making the value of a dollar increase since the same output can be bought for less. That suggests that value can also be destroyed if those techniques or abilities are lost (i.e. the last COBOL coder dies and there's no one to replace him so they have to use a less efficient system) - but I think most real world examples of it are probably due to poor flow of information or misinformation.
At the end of the day it all feels suspiciously close to Aristotle's Potentiality and Actuality Dichotomy.
If I see a YouTube video pop up in my feed right after it’s published, I can often come up with a comment that gets a lot of likes and ends up near the top of the comment section.[1] It’s actually not that hard to do: the hardest part is being quick enough[2] to get into the first 10-30 comments (which I assume is the average number of comments viewers glance over), but the comment itself might be pretty generic and not that relevant to the video’s content.
Do you know a way I could use that? You can suggest advice for achieving convergent instrumental goals, usual human goals, and (most importantly) AI x-risk reduction. If you think I’m hyper-online or delusional about this, you can also point it out.
I wouldn’t be surprised if it’s actually not that hard and my success is just a consequence of being hyper-online.
I also suspect that the YouTube algorithm might have learned about this ability of mine and has now categorized me as a “top commenter,” so it shows me videos earlier than others and uses me to “boost engagement” or smth.
The general principle is that sufficiently smart people by default win most competitions among 100 randos (given sufficient training, when that's at all relevant) that they care to enter.
How many degrees of freedom do you have in which comment you post? Do highly upvoted comments boost a video's reach? Do you have an example of a comment that was useful in the way Community Notes are on X?
It’s actually not that hard to do: the hardest part is being quick enough[2] to get into the first 10-30 comments (which I assume is the average number of comments viewers glance over), but the comment itself might be pretty generic and not that relevant to the video’s content.
This confirms my view that YouTube comments are never worth reading.
My suspicions match your footnotes - this is probably accidental and fragile, so attempting to mess with it to get other dimensions of value (in propagating ideas orthogonal to the video, or monetizing, or influencing anything) is going to make it go away.
That said, it'd be interesting to measure and see if there IS a unique/special value - some tracking of popularity of videos that you comment on, then a randomization of NOT commenting on some things you were about to (literal coin flip or other no-judgement procedure) and tracking if the comments impact the popularity of the video would give you a bit of signal that the comment IS providing value in some way, which you could then figure out how to exploit.
Partner with an onlyfans model, make a YouTube account for promoting her, use a thirsttrappy avatar for it, and post those comments using that account. Don't forget to promote her onlyfans in that YouTube account.
IIUC, those are just bots who copy early and liked comments. So my comment would also be copied by other bots.
If the future contains far more people than we have today, and if people are going to have their memory upgraded, and if the information about us on the internet is going to be preserved, then each person alive today is going to be kind of a celebrity.
It’s as if our civilization started with 10 people and they recorded every second of their lives: we would know almost everything about them. People would read their quotes, live by their wisdom, and create cults around them.
Sending information is equivalent to storing information if you consider Galilean relativity (any experiment performed in a frame of reference moving at a constant speed is equivalent to the same experiment in a static frame of reference).
Sometimes, the amount of optimization power that was put into the words is less than you expect, or less than the gravity of the words would imply.
Some examples:
"You are not funny." (Did they evaluate your funniness across many domains and in diverse contexts in order to justify a claim like that?)
"Don't use this drug, it doesn't help." (Did they do the double-blind studies on a diverse enough population to justify a claim like that?)
"That's the best restaurant in town." (Did they really go to every restaurant in town? Did they consider that different people have different food preferences?)
That doesn't mean you should disregard those words. You should use them as evidence. But instead of updating on the event "I'm not funny," you should update on the event "This person, having some intent, not putting a lot of effort into evaluating this thing and mostly going off the vibes and shortness of the sentence, said to me 'You are not funny.'"
People often say, "Oh, look at this pathetic mistake AI made; it will never be able to do X, Y, or Z." But they would never say to a child who made a similar mistake that they will never amount to doing X, Y, or Z, even though the theoretical limits on humans are much lower than for AI.
Is there a reason to hate Bill Gates? From a utilitarian perspective, he might be “the best person ever,” considering how much he gives to effective charities.
Do people just use the “billionaire = evil” heuristic, or are there other considerations?
A lot of it is the billionaire = evil heuristic. If you try to steelman the argument, it's essentially that anyone who actually becomes a billionaire, given the way the incentives work in capitalism, probably did a lot of questionable things to get into such a position (at the very least, massively exploiting the surplus labour of their workers, if you think that's a thing), while also choosing not to give away enough to stop being a billionaire (like Chuck Feeney or Yvon Chouinard did, though admittedly late in life), when there are lots of starving children in the world that they could easily save right now (at least in theory).
For Bill Gates in particular, he probably also still has his reputation as a founder and former CEO of Microsoft, which was, at least back in the 1990s, known in popular culture as an "evil" company that maintained a monopoly with unsavoury tactics and by buying out competitors, etc. Back then, he was seen as the face of greedy American tech giant capitalism, before other players like Amazon or Google even existed. To a lot of people then, his philanthropy is seen not as true altruism, but more something akin to trying to redeem himself in the eyes of the public and ensure his legacy is well-regarded.
Personally, I'm not that cynical, but a lot of people are, so you get the hate.
Idea status: butterfly idea
In real life, there are too many variables to optimize each one. But if a variable is brought to your attention, it is probably important enough to consider optimizing it.
Negative example: you don’t see your eyelids; they are doing their job of protecting your eyes, so there’s no need to optimize them.
Positive example: you tie your shoelaces; they are the focus of your attention. Can this process be optimized? Can you learn to tie shoelaces faster, or learn a more reliable knot?
Humans already do something like this, but mostly consider optimizing a variable when it annoys them. I suggest widening the consideration space because the “annoyance” threshold is mostly emotional and therefore probably optimized for a world with far fewer variables and much smaller room for improvement (though I only know evolutionary psychology at a very surface level and might be wrong).
In comments to this quick take I am planning to report my intellectual journey: what I read, what I learned, what exercises I've done, on which projects/research problems I worked. Thanks to @TristianTrim for suggesting the idea. Feel free to comment anything you think might be helpful/relevant.
Solomonoff Induction is incredibly powerful. It's so powerful that it can't exist in our world. But because of its power, it needs to be handled with care. For it to actually produce accurate hypotheses, you have to expose it to as much evidence as possible, because even the tiniest coincidence in your data (which will happen if you don't collect the widest dataset possible) would be interpreted as a Deep Rule of the world.
or at least make them less obvious
My eyes are tired of AI-generated images. At this point, I would even prefer Corporate Memphis. It saddens me every time I see an obviously AI-generated image on the website of some good cause (like alignment agendas).
One counterexample is LessWrong’s featured articles, which sometimes use AI-generated backgrounds, but those are usually rather abstract, and their imperfections are less noticeable and actually fit the style.
Some folks on LessWrong already push back really hard on AI-generated text, and I’d like to add some pushback on AI-generated images too.
White House launches Manhattan project for AI (sorta).
In this pivotal moment, the challenges we face require a historic national effort, comparable in urgency and ambition to the Manhattan Project that was instrumental to our victory in World War II and was a critical basis for the foundation of the Department of Energy (DOE) and its national laboratories.
I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model's thoughts. But why? LLMs are still just as black-boxy as they were before; we still don't know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black box bigger. In an alternate universe, we could just have even bigger, even messier LLMs (and I assume interpretability gets harder with size: after all, some small transformers have been interpreted), and observing the progress of CoT reasoning models is an update away from this universe, which was the (subjective) default path before this update.
Rules can generate examples. For instance: DALLE-3 is a rule according to which different examples (images) are generated.
From examples, rules can be inferred. For example: with a sufficient dataset of images and their names, a DALLE-3 model can be trained on it.
In computer science, there is a concept called Kolmogorov complexity of data. It is (roughly) defined as the length of the shortest program capable of producing that data.
Some data are simple and can be compressed easily; some are complex and harder to compress. In a sense, the task of machine learning is to find a program of a given size that serves as a "compression" of the dataset.
In the real world, although knowing the underlying rule is often very useful, sometimes it is more practical to use a giant look-up table (GLUT) of examples. Sometimes you need to memorize the material instead of trying to "understand" it.
Sometimes there are examples that are more complex than the rule that generated them. For example, in the interval [0;1] (which is quite easy to describe, the rule being: all numbers are not greater than 1 and not less than 0), there exists a number containing all the works of Shakespeare (which definitely cannot be compressed to a description comparable to that of the interval [0;1]).
Or, сonsider the program that outputs every natural number from 1 to (which is very short, because the Kolmogorov complexity of is low) will at some point produce a binary encoding of LOTR. In that case, the complexity lies in the starting index, the map for finding the needle in the haystack is as valuable (and as complex) as the needle itself.
Properties follow from rules. It is not necessary to know about every example of a rule in order to have some information about all of them. Moreover, all examples together can have less information (or Kolmogorov complexity) than sum of individual Kolmogorov complexities (as in example above).