But staying on the frontier seems to be a really hard job. Lots of new research comes every day, and scientists struggle to follow it. New research has lots of value while it's hot, and loses it as the field progresses and finds itself a part of general theory (and learning it is a much more worthwhile use of time).
Which does introduce the question: if you are not currently at the cutting edge and actively advancing your field, why follow new research at all? After a bit of time, the field would condense the most important and useful research into neat textbooks and overview articles, and reading them when they appear would be a much more efficient use of time. While you are not at the cutting edge — read condensations of previous works until you get there.
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough).
Two separate points:
A much better version of this idea: https://slatestarcodex.com/2017/11/09/ars-longa-vita-brevis/
Also, it seems like there is not much of that in the field of alignment. I want there to be more work on unifying (previously frontier) alignment research and more effort to construct paradigms in this preparadigmatic field (but maybe I just haven't looked hard enough)
I am surprised regarding the lack of distillation claim. I'd naively expected that to be more neglected in physics compared to alignment. Is there something in particular that you think could be more distilled?
Regarding research that tries to come up with new paradigms, here are a few reasons why you might not be observing that much: I guess that is less funded by the big labs and is spread across all kinds of orgs or individuals. Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD. More of these researchers didn't publish all their research compared to AI safety researchers at AGI labs, so you would not have been aware it was going on? Some are also actively avoiding researching things that could be easily applied and tested, because of capability externalities (I think Vanessa Kosoy mentions this somewhere in the YouTube videos on Infrabayesianism).
Is there something in particular that you think could be more distilled?
What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD.
Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that's doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.
The neuroscience/psychology rather than ml side of the alignment problem seems quite neglected (because it harder on the one hand, but it's easier to not work on something capabilities related if you just don't focus on the cortex). There's reverse engineering human social instincts. In principle would benefit from more high quality experiments in mice, but those are expensive.
I recently prepared an overview lecture about research directions in AI alignment for the Moscow AI Safety Hub. I had limited time, so I did the following: I reviewed all the sites on the AI safety map, examined the 'research' sections, and attempted to classify the problems they tackle and the research paths they pursue. I encountered difficulties in this process, partly because most sites lack a brief summary of their activities and objectives (Conjecture is one of the counterexamples). I believe that the field of AI safety would greatly benefit from improved communication, and providing a brief summary of a research direction seems like low-hanging fruit.
When someone is doing physics (tries to find out what happens with a physical system knowing it initial conditions), they are performing the transformation from the time-consuming-but-easy-to-express form of connecting the initial conditions to the end result (physical laws), to a form of a single entry in the giant look-up table which matches initial conditions to the end result (which is not-time-consuming-but-harder-to-express form), essentially flattening out the time dimension. That creates a feeling that the process that they are analyzing is pre-determined, that this giant look-up table already exists. And when they apply it to themselves, this can create a feeling of no control over their own actions, like those observation-action pairs are drawn from that pre-existing table. But this table doesn't actually exist; they still need to perform the computation to get to the action; there is no way around it. Wherever the process is performed, that process is the person.
In other words, when people do physics on simple enough systems that they can fit in their head both the initial conditions and the end result and the connection between them, they feel a sense of "machineness" about those systems. They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt when they actually can fit the model of the system (and initial conditions/end results entries) in their head, which they don't in the case of humans.
They can overgeneralize that feeling over all physical systems (like humans), missing out on the fact that this feeling should only be felt
I don't follow why this is "overgeneralize" rather than just "generalize". Are you saying it's NOT TRUE for complex systems, or just that we can't fit it in our heads? I can't compute the Mandelbrot Set in my head, and I can't measure initial conditions well enough to predict a multi-arm pendulum beyond a few seconds. But there's no illusion of will for those things, just a simple acknowledgement of complexity.
The "will" is supposedly taken away by GLUT, which is possible to create and have a grasp of it for small systems, then people (wrongly) generalize this for all systems including themselves. I'm not claiming that any object that you can't predict has a free will, I'm saying that having ruled out free will from a small system will not imply lack of free will in humans. I'm claiming "physicality no free will" and "simplicity no free will", I'm not claiming "complexity free will".
Hmm. What about the claim "pysicality -> no free will". This is the more common assertion I see, and the one I find compelling.
The simplicity/complexity I often see attributed to "consciousness" (and I agree: complexity does not imply consciousness, but simplicity denies it), but that's at least partly orthogonal to free will.
I'm claiming ... "simplicity ⇒ no free will"
Consider the ASP problem, where the agent gets to decide whether it can be predicted, whether there is a dependence of the predictor on the agent. The agent can destroy the dependence by knowing too much about the predictor and making use of that knowledge. So this "knowing too much" (about the predictor) is what destroys the dependence, but it's not just a consequence of the predictor being too simple, but rather of letting an understanding of predictor's behavior precede agent's behavior. It's in the agent's interest to not let this happen, to avoid making use of this knowledge (in an unfortunate way), to maintain the dependence (so that it gets to predictably one-box).
So here, when you are calling something simple as opposed to complicated, you are positing that its behavior is easy to understand, and so it's easy to have something else make use of knowledge of that behavior. But even when it's easy, it could be avoided intentionally. So even simple things can have free will (such as humans in the eyes of a superintelligence), from a point of view that decides to avoid knowing too much, which can be a good thing to do, and as the ASP problem illustrates can influence said behavior (the behavior could be different if not known, as the fact of not-being-known could happen to be easily knowable to the behavior).
I'd say this is correct, but it's also deeply counterintuitive. We don't feel like we are just a process performing itself, or at least that's way too abstract to wrap our heads around. The intuitive notion of free will is IMO something like the following:
had I been placed ten times in exactly the same circumstances, with exactly the same input conditions, I could theoretically have come up with different courses of action in response to them, even though one of them may make a lot more sense for me, based on some kind of ineffable non-deterministic quality that however isn't random either, but it's the manifestation of a self that exists somehow untethered from the laws of causality
Of course not exactly worded that way in most people's minds, but I think that's really the intuition that clashes against pure determinism. It's a materialistic viewpoint, and lots of people are consciously or not dualists - implicitly assuming there's one special set of rules that applies to the self/mind/soul that doesn't apply to everything else.
Some confusion remains appropriate, because for example there is still no satisfactory account of a sense in which the behavior of one program influences the behavior of another program (in the general case, without constructing these programs in particular ways), with neither necessarily occurring within the other at the level of syntax. In this situation, the first program could be said to control the second (especially if it understands what's happening to it), or the second program could be said to perform analysis of (reason about) the first.
Just Turing machines / lambda terms, or something like that. And "behavior" is however you need to define it to make a sensible account of the dependence between "behaviors", or of how one of the "behaviors" produces a static analysis of the other. The intent is to capture a key building block of acausal consequentialism in a computational setting, which is one way of going about formulating free will in a deterministic world.
(You don't just control the physical world through your physical occurrence in it, but also for example through the way other people are reasoning about your possible behaviors, and so an account that simply looks for your occurrence in the world as a subterm/part misses an important aspect of what's going on. As Turing machines also illustrate, not having subterm/part structure.)
Money is a good approximation for what people value. Value can be destroyed. But what should I do to money to destroy the value it encompasses?
I might feel bad if somebody stole my wallet, but that money hasn't been destroyed; it is just now going to bring utility to another human, and if I (for some weird reason) value the quality of life of the robber just as much as my own, I wouldn't even think something bad has happened.
If I actually destroy money, like burn it to ashes, then there will be less money in circulation, which will increase the value of each banknote, making everyone a bit richer (and me a little poorer). So is it balanced in that case?
Maybe I need to read some economics, please recommend me some book which would dissolve the question.
If you are destroying something you own, you would value the destruction of that thing more than any other use you have for that thing and any price you could sell it for on the market, so this creates value in the sense that there is no deadweight loss to the relevant transactions/actions.
You can destroy others’ value intentionally, but only in extreme circumstances where you’re not thinking right or have self-destructive tendencies can you “intentionally” destroy your own value. But then we hardly describe the choices such people make as “intentional”. Eg the self-destructive person doesn’t “intend” to lose their friends by not paying back borrowed money. And those gambling at the casino, despite not thinking right, can’t be said to “intend” to lose all their money, though they “know” the chances they’ll succeed.
You might not value the destruction as much as others valued the thing you destroyed. In other words, you're assuming homo economicus, I'm not.
To complete your argument, ‘and therefore the action has some deadweight loss associated with it, meaning its destroying value’.
But note that by the same logic, any economic activity destroys value, since you are also not homo economicus when you buy ice cream, and there will likely be smarter things you can do with your money, or better deals. Therefore buying ice cream, or doing anything else destroys value.
But that is absurd, and we clearly don’t have a so broad definition of “destroy value”. So your argument proves too much.
Money is a claim on things other people value. You can't destroy value purely by doing something with your claim on that value.
Except the degenerate case of "making yourself or onlookers sad by engaging in self-destructive behaviors where you destroy your claim on resources", I guess. But it's not really an operation purely with money.
Hmm, I guess you can make something's success conditional on your having money (e. g., a startup backed by your investments), and then deliberately destroy your money, dooming the thing. But that's a very specific situation and it isn't really purely about the money either; it's pretty similar to "buy a thing and destroy it". Closest you can get, I think?
(Man, I hope this is just a concept-refinement exercise and I'm not giving someone advice on how to do economics terrorism.)
(Epistemic status: not an economist.)
Money is not value, but the absence of value. Where money is, it can be spent, replacing the money by the thing bought. The money moves to where the thing was.
Money is like the empty space in a sliding-block puzzle. You must have the space to be able to slide the blocks around, instead of spotting where you can pull out several at once and put them back in a different arrangement.
Money is the slack in a system of exchange that would otherwise have to operate by face-to-face barter or informal systems of credit. Informal, because as soon as you formalise it, you've reinvented money.
IANAE. This is a really interesting riddle. Because even in incidents of fraud or natural disaster, from an economic standpoint the intrinsic value isn't lost: if a distillery full of barrels of whisky goes up in flames and there's nothing recoverable - then elsewhere in the whisky market you would presume that the prices would go up as scarcity is now greater than demand and you would expect that "loss" to be dispersed as a gain through their competitors - you would think. (Not to mention the expenditure of the distiller to their suppliers and employees - any money that changed hands they keep - so the Opportunity Cost of the whisky didn't go up in smoke).
I say "you would think" because Price elasticity is it isn't necessarily instantaneous nor is it perfect - the correction in prices can be delayed especially if information is delayed. Like you said - money is a good approximation of what people value but there is a certain amount of noise and lag.
For example, what if there is no elasticity in whisky markets? What if there was already an oversupply and the distiller was never going to recoup their investment (if the fire didn't wipe them out). It's really interesting because in theory they would have to drop their prices until someone would buy it. But not only is information not instantaneous, there's no certainty that it would happen like that.
You might be interested in reading George Soros' speech on Reflexivity which describes how sometimes the intrinsic value of things (like financial securities) and their market value grow further or closer together. What's interesting is that if perception and prices rise, this can actually have a changing effect on intrinsic value higher or lower.
No one ever knows precisely what the intrinsic value is at, and since it is reflexive and affected by the market value, this makes it much more elusive.
Really somewhere along the line value is being created, because whenever someone develops a more efficient means of producing the same output that is making the value of a dollar increase since the same output can be bought for less. That suggests that value can also be destroyed if those techniques or abilities are lost (i.e. the last COBOL coder dies and there's no one to replace him so they have to use a less efficient system) - but I think most real world examples of it are probably due to poor flow of information or misinformation.
At the end of the day it all feels suspiciously close to Aristotle's Potentiality and Actuality Dichotomy.
People often say, "Oh, look at this pathetic mistake AI made; it will never be able to do X, Y, or Z." But they would never say to a child who made a similar mistake that they will never amount to doing X, Y, or Z, even though the theoretical limits on humans are much lower than for AI.
Idea status: butterfly idea
In real life, there are too many variables to optimize each one. But if a variable is brought to your attention, it is probably important enough to consider optimizing it.
Negative example: you don’t see your eyelids; they are doing their job of protecting your eyes, so there’s no need to optimize them.
Positive example: you tie your shoelaces; they are the focus of your attention. Can this process be optimized? Can you learn to tie shoelaces faster, or learn a more reliable knot?
Humans already do something like this, but mostly consider optimizing a variable when it annoys them. I suggest widening the consideration space because the “annoyance” threshold is mostly emotional and therefore probably optimized for a world with far fewer variables and much smaller room for improvement (though I only know evolutionary psychology at a very surface level and might be wrong).
Rules can generate examples. For instance: DALLE-3 is a rule according to which different examples (images) are generated.
From examples, rules can be inferred. For example: with a sufficient dataset of images and their names, a DALLE-3 model can be trained on it.
In computer science, there is a concept called Kolmogorov complexity of data. It is (roughly) defined as the length of the shortest program capable of producing that data.
Some data are simple and can be compressed easily; some are complex and harder to compress. In a sense, the task of machine learning is to find a program of a given size that serves as a "compression" of the dataset.
In the real world, although knowing the underlying rule is often very useful, sometimes it is more practical to use a giant look-up table (GLUT) of examples. Sometimes you need to memorize the material instead of trying to "understand" it.
Sometimes there are examples that are more complex than the rule that generated them. For example, in the interval [0;1] (which is quite easy to describe, the rule being: all numbers are not greater than 1 and not less than 0), there exists a number containing all the works of Shakespeare (which definitely cannot be compressed to a description comparable to that of the interval [0;1]).
Or, сonsider the program that outputs every natural number from 1 to (which is very short, because the Kolmogorov complexity of is low) will at some point produce a binary encoding of LOTR. In that case, the complexity lies in the starting index, the map for finding the needle in the haystack is as valuable (and as complex) as the needle itself.
Properties follow from rules. It is not necessary to know about every example of a rule in order to have some information about all of them. Moreover, all examples together can have less information (or Kolmogorov complexity) than sum of individual Kolmogorov complexities (as in example above).