I would question the framing of mental subagents as "mesa optimizers" here. This sneaks in an important assumption: namely that they are optimizing anything. I think the general view of "humans are made of a bunch of different subsystems which use common symbols to talk to one another" has some merit, but I think this post ascribes a lot more agency to these subsystems than I would. I view most of the subagents of human minds as mechanistically relatively simple.
For example, I might reframe a lot of the elements of talking about the unattainable "object of...
I'm interested in the "Xi will be assassinated/otherwise killed if he doesn't secure this bid for presidency" perspective. Even if he was put in a position where he'd lose the bid for a third term, is it likely that he'd be killed for stepping down? The four previous paramount leaders weren't. Is the argument that he's amassed too much power/done too much evil/burned too many bridges in getting his level of power?
Although I think most people who amass Xi's level of power are best modelled as desiring power (or at least as executing patterns which have in the past maximized power) for its own sake, so I guess the question of threat to his life is somewhat moot with regards to policy.
Seems like there's a potential solution to ELK-like problems. If you can force the information to move from the AI's ontology to (it's model of) a human's ontology and then force it to move it back again.
This gets around "basic" deception since we can always compare the AI's ontology before and after the translation.
The question is how do we force the knowledge to go through the (modeled) human's ontology, and how do we know the forward and backward translators aren't behaving badly in some way.
Unmentioned but large comparative advantage of this: it's not based in the Bay Area.
The typical alignment pitch of: "Come and work on this super-difficult problem you may or may not be well suited for at all" Is a hard enough sell for already-successful people (which intelligent people often are) without adding: "Also you have to move to this one specific area of California which has a bit of a housing and crime problem and very particular culture"
Unmentioned but large comparative advantage of this: it's not based in the Bay Area.
It's based in the Bay Area of England (Oxford), though, with no mention of remote. So, all the same pathologies: extreme liberal politics, high taxes and cost of living, Dutch disease being captured by NIMBYs with a lock on ever escalating real estate prices and banning density, persistent blatant crime and homelessness (in some ways, worse: I was never yelled at by the homeless in SF like I was in Oxford, and one woman tried to scam me twice. I was there for all of 2 weeks!).
I was referring to "values" more like the second case. Consider the choice blindness experiments (which are well-replicated). People think they value certain things in a partner, or politics, but really it's just a bias to model themselves as being more agentic than they actually are.
Both of your examples share the common fact that the information is verifiable at some point in the future. In this case the best option is to put down money. Or even just credibly offer to put down money.
For example, X offers to bet Y $5000 (possibly at very high odds) that in the year 2030 (after the Moon Nazis have invaded) they will provide a picture of the moon. If Y takes this bet seriously they should update. In fact all other actors A, B, C, who observe this bet will update.
The same is (sort of) true of the second case: just credibly bet some money...
When you say the idea of human values is new, do you mean the idea of humans having values with regards to a utilitarian-ish ethics, is new? Or do you mean the concept of humans maximizing things rationally (or some equivalent concept) is new? If it's the latter I'd be surprised (but maybe I shouldn't be?).
From my experience as a singer, relative pitch exercises are much more difficult when the notes are a few octaves apart. So making sure the notes jump around over a large range would probably help.
You make some really excellent points here.
The teapot example is atypical of deception in humans, and was chosen to be simple and clear-cut. I think the web-of-lies effect is hampered in humans by a couple of things, both of which result from us only being approximations of Bayesian reasoners. One is the limits to our computation, we can't go and check a new update that "snake oil works" against all possible connections. Another part (which is also linked to computation limits) is that I suspect a small enough discrepancy gets rounded down to zero.
So...
I interpret (at least some of) this behaviour as being more about protecting the perception of NFTs as a valid means of ownership than protecting the NFT directly. As analogy, if you bought the Mona Lisa to gain status from owning it and having people visit it, but everyone you spoke to made fun of you and said that they had a copy too, you might be annoyed.
Although before I read your comment I had actually assumed this upset behaviour was mostly coming from trolls - who had right-click copied the NFTs - making fake accounts to LARP as NFT owners. I don't ...
I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that "feels like" it has values, which is a similar type of problem.
I think I hadn't grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.
I don't understand this. As far as I can tell, I know what my preferences are, and so that information should in some way be encoded in a perfect simulation of my brain. Saying there is no way at all to infer my preferences from all the information in my brain seems to contradict the fact that I can do it right now, even if me telling them to you isn't sufficient for you to infer them.
Once an algorithm is specified, there is no more extra information to specify how it feels from the inside. I don't see how there can be any more information necessary on top of a perfect model of me to specify my feeling of having certain preferences.
This is a great analysis of different causes of modularity. One thought I have is that L1/L2 and pruning seem similar to one another on the surface, but very different to dropout, and all of those seem very different to goal-varying.
If penalizing the total strength of connections during training is sufficient to enforce modularity, could it be the case that dropout is actually just penalizing connections? (e.g. as the effect of a non-firing neuron is propagated to fewer downstream neurons)
I can't immediately see a reason why a goal-varying scheme could penalize connections but I wonder if this is in fact just another way of enforcing the same process.
I think the tweet about the NHS app(s) is slightly misleading. I'm pretty confident those screenshots relate to two separate apps: one is a general health services app which can also be used to generate a certificate of vaccination (as the app has access to health records). The second screenshot relates to a covid-specific app which enables "check-ins" at venues for contact-tracing purposes, and the statement there seems to be declaring that the local information listing venues visited could - in theory - be used to get demographic information. One is called the "NHS App" and the other is called the "NHS Covid 19 App" so it's an understandable confusion.
I'm afraid I didn't intend for people to be able to add conditions to their plans. While something like that is completely reasonable I can't find a place to draw the line between that and what would be too complex. The only system that might work is having everyone send me their own python code but that's not fair on people who can't code, and more work than I'm willing to do. Other answers haven't included conditions and I think it wouldn't be fair on them. I think my decision is that:
If you don't get the time to respond with a time to move on from
I think your comment excellently illustrates the problems with the experiment!
Next to the upvote/downvote buttons there's a separate box for agreement/disagreement. I think the aim is to separate "this post contributes to the discussion in a positive/negative way" from "I think the claims expressed here are accurate". It's active in the comments of the post I linked in my comment and there's a pinned comment from Ruby explaining it.
I'm very interested to try the new two-axis voting system but it seems to only be active on one post which also happens to be very tied up with some current Bay Area-specific issues which limits who can actually engage with it. I also think it would be good for the community to get to "practice" with such voting on some topics which are easier to discuss so norms can be established before moving on to the more explosive ones. I'd like to see more posts with this enabled, perhaps a few more people with posts having >20 comments currently on the frontpage...
Sure! I was planning to anyways but that plus my own busyness means it will more likely be early next week/even later if people would prefer.
As the unmet demand for housing at all levels is currently outstripped by supply, the optimal local move is to replace cheaper-per-space housing with expensive-per-space housing, where the latter is targeted towards rich people, whenever permission from local government can be obtained. If the unmet demand for housing at all levels were much smaller, then this move wouldn't be profitable by default and developers would have to choose where to build new marginal rich-people-targeted houses more carefully. For some human-desirable variable "strength of commu...
But I don't know why it's downvoted so far - it's an important topic, and I'm glad to have some more discussion of it here (even if I disagree with the conclusions and worry about the unstated assumptions).
I agree with this. The author has made a number of points I disagree with but hasn't done anything worthy of heavy downvotes (like having particularly bad epistemics, being very factually wrong, personally attacking people, or making a generally low-effort or low-quality post). This post alone has changed my views towards favouring a modification of the upvote/downvote system.
In the described scenario, the end result is omnicide. Thus, it is not much different from the AI immediately killing all humans.
I strongly disagree with this. I would much, much rather be killed immediately than suffer for a trillion years and then die. This is for the same reason that I would rather enjoy a trillion years of life and then die, than die immediately.
...In this case, the philosophy's adherents have no preference between dying and doing something else with zero utility (e.g. touching their nose). As humans encounter countless actions of a
You're argument rests on the fact that people who have suffered a million years of suffering could - in theory - be rescued and made happy, with it only requiring "tech and time". In an S-risk scenario, that doesn't happen.
In what I'd consider the archetypical S-risk scenario, an AI takes over, starts simulating humans who suffer greatly, and there is no more human agency ever again. The (simulated) humans experience great suffering until the AI runs out of power (some time trillions of years in the future when the universe can no longer power any more com...
I imagine their response would be along the lines of: "Why the hell should I let to someone who doesn't even know how big a Dull Viper is tell me how to hunt it!?"
I think it won't be easy to modify the genome of individuals to achieve predictable outcomes even if you get the machinery you describe to work.
Is this because of factors like the almost-infinite number of interactions between different genes, such that even with a hypothetical magic technology to arbitrarily and perfectly change the DNA in every cell in the body, it wouldn't be possible to predict the outcome of such a change? Or is it because you don't think that any machinery will ever be precise enough to make this work well enough? Or some other issue entirely?
What I meant is changing the genetic code in ~all of the cells in a human body. Or some sort of genetic engineering which has the same effect as that.
Here's one model I have as to how you could genetically engineer a living human:
Many viruses are able to reverse-transcribe RNA to DNA and insert that DNA into cells. This causes a lot of problems for cells, but there are (probably) large regions of the genome where insertions of new DNA wouldn't cause problems. I don't think it would be difficult to target insertion of DNA to those regions, as DNA binding pr...
I think cultural evolution will be the greater factor by a large margin. I think the technology for immortality is possible but that it will either directly involve genetic engineering of living humans, or be one or two steps away from it. People who are willing to take an immortality drug are very likely to also be willing to improve themselves in other ways. If the Horde is somehow going to outcompete them due entirely to beneficial mutations, the Imperium could simply steal them.
Thanks! I get your arguments about "knowledge" being restricted to predictive domains, but I think it's (mostly) just a semantic issue. I also don't think the specifics of the word "knowledge" are particularly important to my points which is what I attempted to clarify at the start, but I've clearly typical-minded and assumed that of course everyone would agree with me about a dog/fish classifier having "knowledge", when it's more of an edge-case than I thought! Perhaps a better version of this post would have either tabooed "knowledge" altogether or picked a more obviously-knowledge-having model.
This is a pretty strong indication of immune escape to me, if it persists in other outbreaks. If this was purely from increased infectiousness in naive individuals it would imply an R-value (in non-immune populations) of like 40 or something, which seems much less plausible than immune escape. I don't know what the vaccination/infection rates are in these communities though.
The UK has just switched their available rapid Covid tests from a moderately unpleasant one to an almost unbearable one. Lots of places require them for entry. I think the cost/benefit makes sense even with the new kind, but I'm becoming concerned we'll eventually reach the "imagine a society where everyone hits themselves on the head every day with a baseball bat" situation if cases approach zero.
My current belief on this is that the greatest difficulty is going to be finding the "human values" in the AI's model of the world. Any AI smart enough to deceive humans will have a predictive model of humans which almost trivially must contain something that looks like "human values". The biggest problems I see are:
1: "Human values" may not form a tight abstracted cluster in a model of the world at all. This isn't so much conceptual issue as in theory we could just draw a more complex boundary around them, but it makes it practically more difficult....
Very good point. Perhaps there just intrinsically is no way of doing something that this community perceives as "burning" money, without upsetting people.
Having now had a lot of different conversations on consciousness I'm coming to a slightly disturbing belief that this might be the case. I have no idea what this implies for any of my downstream-of-consciousness views.
I'm confident your model of Eliezer is more accurate than mine.
Neither the twitter thread or other writings originally gave me the impression that he had a model in that fine-grained detail. I was mentally comparing his writings on consciousness to his writings on free will. Reading the latter made me feel like I strongly understood free will as a concept, and since then I have never been confused, it genuinely reduced free will as a concept in my mind.
His writings on consciousness have not done anything more than raise that model to the same level of poss...
Basically yes I care about the subjective experiences of entities. I'm curious about the use of the word "still" here. This implies you used to have a similar view to mine but changed it, if so what made you change your mind? Or have I just missed out on some massive shift in the discourse surrounding consciousness and moral weight? If the latter is the case (which it might be, I'm not plugged into a huge number of moral philosophy sources) that might explain some of my confusion.
he defines consciousness as "what an algorithm implementing complex social games feels like when reflecting on itself".
In that case I'll not use the word consciousness and abstract away to "things which I ascribe moral weight to", (which I think is a fair assumption given the later discussion of eating "BBQ GPT-3 wings" etc.)
Eliezer's claim is therefore something along the lines of: "I only care about the suffering of algorithms which implement complex social games and reflect on themselves" or possibly "I only care about the suffering of algorithms ...
You present an excellently-written and interesting case here. I agree with the point that self-modelling systems can think in certain ways which are unique and special and chickens can't do that.
One reason I identify consciousness with having qualia is that Eliezer specifically does that in the twitter thread. The other is that qualia is generally less ambiguous than terms like consciousness and self-awareness and sentience. The disadvantage is that the concept of qualia is something which is very difficult (and beyond my explaining capabilities) to explai...
Eliezer later states that he is referring to qualia specifically, which for me are (within a rounding error) totally equivalent to moral relevance.
My first thought was that this could be avoided by - if the button was pressed - giving it to a "rare diseases in cute puppies" type charity, rather than destroying it. I'd suspect the intersection of "people who care strongly enough about effective altruism to be angry", "people who don't understand the point of Petrov Day", and "people who have the power to generate large amounts of negative publicity" is very small.
But I think a lot of LWers who are less onboard with Petrov Day in general would be just as (or almost as) turned off by this concept as the...
Just realized I'm probably feeling much worse than I ought to on days when I fast because I've not been taking sodium. I really should have checked this sooner. If you're planning to do long (I do a day, which definitely feels long) fasts, take sodium!
The green belt problem is not one I'd considered before. I've always assumed the biggest problems for places like London were the endless low-density suburbs rather than the limit on building houses outside of a certain radius. If you work in the centre of London and live in some new development just outside the green belt, that already seems like something of a failure.
I don't want to doubt the expert economic analysis though, perhaps removing it would allow people to move from the suburbs to new developments, freeing up suburb space. This also seems wron...
Actually that's a good point, I think that's the only rule which doesn't need to be written (which I completely forgot to mention). Other rules regarding text can be manipulated the same way the other rules can.
Using python I conducted a few different analyses:
Proportion of character wins vs other characters:
Proportion of character wins when paired with other characters:
With these I gave each possible team a score, equal to the sum over characters (sum over enemy characters(proportion of wins for said character) + sum over other teammates(proportion of wins when paired with said character)), and the highest scoring team was:
Rock-n-Roll Ranger, Blaze Boy, Nullifying Nightmare, Greenery Giant, Tidehollow Tyrant
This was much more pleasant than using Excel! I think I
The first example doesn't seem like a game of chicken to me, since neither Alexi nor Beth can make a change themselves. It may be that they have "inherited" the debate from their political factions' respective allies, who are actually playing a game of chicken. But Alexi and Beth are doing the classic political topic "talking past one another" and part of this seems to be that they're treating different sets of actions as reachable, and only assigning should-ness to reachable actions.
This is a "review" in the sense of reviewing the paper. I actually haven't used AlphaFold or crystallographic data as the protein I'm currently studying only takes on a defined structure when bound to certain metals (ruling out AlphaFold) and has yet to be crystallized.
I was also halfway through a review of this book. Since I've only met one other person who'd read it I thought it was unlikely anyone else would! I guess LWers have more similar interests than I would have predicted.
I suppose I'll review another book instead!
Though the task seemed really interesting, I didn't even enter an answer as I lost interest after some preliminary analysis. Almost all of these applied to me too. The data was presented in an excel-unfriendly way and as I'm currently settling into a new job I didn't have the energy to code a python script to trawl through the data. I suspect the participation was weighted towards those with more experience of statistical languages. A better presentation might have been a log of all squares ships had planned to go through with encounters listed there (with...
This is an excellent comment, and I'm very glad to see my thinking inspiring others!
My own findings on the issue are as such:
I am confident that mitochondrial dysfunction is upstream of AD.
This one gene called PGC 1 alpha is probably involved or something.
I do not know what (if anything) is upstream of that. It could be immune system health but the immune system is so complex that my understanding of it is generally poor.
Mitochondria which are defective are replaced in cells through a process called mitophagy. Stimulating the creation of mitochondria (mito...
Further observations having graphed all encounter damage as a histogram:
Dragon: Somtimes does zero, often does a lot of damage, long tail
Harpies: Usually do zero, occasionally do one of a few values up to about 0.2
Iceberg: One of ten-ish values spaced sporadically between 0 and 0.3
Kraken: Exponential-ish distribution with tail going up to 0.9 ish
Merfolk: Usually does zero, flat-ish distribution which goes up to 0.65
Sharks: Often do zero, otherwise one of a few values up to like 0.15, not a big threat
Storm: Exponential-ish with faster dropoff than kraken
WMF
I'd go for:
Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:
- It's much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
- It's less susceptible to value-leaking. By this I mean issues
... (read more)