All of Multicore's Comments + Replies

My guess is that early stopping is going to tend to stop so early as to be useless.

For example, imagine the agent is playing Mario and its proxy objective is "+1 point for every unit Mario goes right, -1 point for every unit Mario goes left". 

(Mario screenshot that I can't directly embed in a comment)

If I understand correctly, to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.

This probably includes things like "+1 point if Mario falls in a pit".... (read more)

2Oliver Sourbut2mo
I think you're (at least partly) right in spirit. The key extra nuance is that by constraining the 'angle' between the reward functions[1], you can rule out very opposed utilities like the one which rewards falling in a pit. So this is not true In particular I think you're imagining gradients in policy space (indeed a practical consideration). But this paper is considering gradients in occupancy space (which in practice is baking in some assumptions about foresight etc.). ---------------------------------------- 1. How? Yes this is a pretty big question (there are some theoretical and empirical ideas but I don't rate any of them yet, personally). ↩︎

With such a vague and broad definition of power fantasy, I decided to brainstorm a list of ways games can fail to be a power fantasy.

  1. Mastery feels unachievable.
    1. It seems like too much effort. Cliff-shaped learning curves, thousand-hour grinds, old PvP games where every player still around will stomp a noob like you flat.
    2. The game feels unfair. Excessive RNG, "Fake Difficulty" or "pay to win".
  2. The power feels unreal, success too cheaply earned.
    1. The game blatantly cheats in your favor even when you didn't need it to.
    2. Poor game balance leading to hours of triviall
... (read more)

I think ALWs are already more of a "realist" cause than a doomer cause. To doomers, they're a distraction - a superintelligence can kill you with or without them.

ALWs also seem to be held to an unrealistic standard compared to existing weapons. With present-day technology, they'll probably hit the wrong target more often than human-piloted drones. But will they hit the wrong target more often than landmines, cluster munitions, and over-the-horizon unguided artillery barrages, all of which are being used in Ukraine right now?

The Huggingface deep RL course came out last year. It includes theory sections, algorithm implementation exercises, and sections on various RL libraries that are out there. I went through it as it came out, and I found it helpful.

FYI all the links to images hosted on your blog are broken in the LW version.

Thanks; it should be fixed now.
Answer by MulticoreJun 08, 20234428

You are right that by default prediction markets do not generate money, and this can mean traders have little incentive to trade.

Sometimes this doesn't even matter. Sports betting is very popular even though it's usually negative sum.

Otherwise, trading could be stimulated by having someone who wants to know the answer to a question provide a subsidy to the market on that question, effectively paying traders to reveal their information. The subsidy can take the form of a bot that bets at suboptimal prices, or a cash prize for the best performing trader, or ... (read more)

Thanks for the excellent answer! On first blush, I'd respond with something like "but there's no way that's enough!" I think I see prediction markets as (potentially) providing a lot of useful information publicly, but needing a flow of money to compensate people for risk-aversion, the cost of research, and to overcome market friction. Of your answers: * Negative-sum betting probably doesn't scale well, especially to more technical and less dramatic questions. * Subsidies make sense, but could they run into a tragedy-of-the-commons scenario? For instance, if a group of businesses want to forecast something, they could pool their money to subsidize a prediction market. But there would be incentive to defect by not contributing to the pool, and getting the same exact information since the prediction market is public - or even to commission a classical market research study that you keep proprietary. * Hedging seems fine. If that reasoning is correct, prediction markets are doomed to stay small. Is that a common concern (and on which markets can wager on that? :P)
7Max H6mo
Yes. PredictIt used to attract a lot of "dumb money" - people who just wanted to bet on their favorite candidate (or against disfavored candidates). They also used to run weekly markets on polling averages and things like the number of times Trump would tweet that tended to attract people who just wanted to do some skill-based gambling, whether they actually had the skill or not. PredictIt charges high transaction fees with no outside subsidies, so all of the markets were extremely negative-sum. Despite this, gamblers and [Candidate X] True Believers managed to provide ample subsidy to attract some more knowledgable traders. (Based on the comments section of some of the popular markets, there were many people who lost thousands or tens of thousands. Probably some of the biggest losers were gambling addicts who destroyed their finances in the process. A pretty big negative externality of negative-sum markets where amateur participation is allowed.)
  • What are these AIs going to do that is immensely useful but not at all dangerous? A lot of useful capabilities that people want are adjacent to danger. Tool AIs Want to be Agent AIs.
  • If two of your AIs would be dangerous when combined, clearly you can't make them publicly available, or someone would combine them. If your publicly-available AI is dangerous if someone wraps it with a shell script, someone will create that shell script (see AutoGPT). If no one but a select few can use your AI, that limits its usefulness.
  • An AI ban that stops dangerous AI might
... (read more)
Thanks for the pointer. I'll hopefully read the linked article in a couple of days. I start from a point of "no AI for anyone" and then ask "what can we safely allow". I made a couple of suggestions, where "safely" is understood to mean "safe when treated with great care". You are correct that this definition of "safe" is incompatible with unfettered AI development. But what approach to powerful AI isn't incompatible with unfettered AI development? Every AI capability we build can be combined with other capabilities, making the whole more powerful and therefore more dangerous. To keep things safe while still having AI, the answer may be: "an international agency holds most of the world's compute power so that all AI work is done by submitting experiment requests to the agency which vets them for safety". Indeed, I don't see how we can allow people to do AI development without oversight, at all. This centralization is bad but I don't see how it can be avoided. Military establishments would probably refuse to subject themselves to this restriction even if we get states to restrict the civilians. I hope I'm wrong on this and that international agreement can be reached and enforced to restrict AI development by national security organizations. Still, it's better to restrict the civilians (and try to convince the militaries to self-regulate) than to restrict nobody. Is it possible to reach and enforce a global political consensus of "no AI for anyone ever at all"?. We may need thermonuclear war for that, and I'm not on board. I think "strictly-regulated AI development" is a relatively easier sell (though still terribly hard).  I agree that such a restriction is a large economic handicap, but what else can we do? It seems that the alternative is praying that someone comes up with an effectively costless and safe approach so that nobody gives up anything. Are we getting there in your opinion?
* immensely useful things these AI can do: * drive basic science and technology forward at an accelerated pace * devise excellent macroeconomic, geopolitical and public health policy * these things are indeed risk-adjacent, I grant.

When people calculate utility they often use exponential discounting over time. If for example your discount factor is .99 per year, it means that getting something in one year is only 99% as good as getting it now, getting it in two years is only 99% as good as getting it in one year, etc. Getting it in 100 years would be discounted to .99^100~=36% of the value of getting it now.

The sharp left turn is not some crazy theoretical construct that comes out of strange math. It is the logical and correct strategy of a wide variety of entities, and also we see it all the time.

I think you mean Treacherous Turn, not Sharp Left Turn.

Sharp Left Turn isn't a strategy, it's just an AI that's aligned in some training domains being capable but not aligned in new ones.

This post is tagged with some wiki-only tags. (If you click through to the tag page, you won't see a list of posts.) Usually it's not even possible to apply those. Is there an exception for when creating a post?

Looks like the New Post page doesn't check the wiki-only flag, which is a bug. Should be fixed soon.

Based on my incomplete understanding of transformers:

A transformer does its computation on the entire sequence of tokens at once, and ends up predicting the next token for each token in the sequence.

At each layer, the attention mechanism gives the stream for each token the ability to look at the previous layer's output for other token before it in the sequence.

The stream for each token doesn't know if it's the last in the sequence (and thus that its next-token prediction is the "main" prediction), or anything about the tokens that come after it.

So each tok... (read more)

In the blackmail scenario, FDT refuses to pay if the blackmailer is a perfect predictor and the FDT agent is perfectly certain of that, and perfectly certain that the stated rules of the game will be followed exactly. However, with stakes of $1M against $1K, FDT might pay if the blackmailer had an 0.1% chance of guessing the agent's action incorrectly, or if the agent was less than 99.9% confident that the blackmailer was a perfect predictor.

(If the agent is concerned that predictably giving in to blackmail by imperfect predictors makes it exploitable, it ... (read more)

I think that misuses of FDT happen because in certain cases FDT behaves like "magic" (i.e. pretty counterintuitive), "magic" violates "mundane rules", so it's possible to forget "mundane" things like "to make decision you should set probability distribution over relevant possibilities".

I'm not familiar with LeCun's ideas, but I don't think the idea of having an actor, critic, and world model is new in this paper. For a while, most RL algorithms have used an actor-critic architecture, including OpenAI's old favorite PPO. Model-based RL has been around for years as well, so probably plenty of projects have used an actor, critic, and world model.

Even though the core idea isn't novel, this paper getting good results might indicate that model-based RL is making more progress than expected, so if LeCun predicted that the future would look more like model-based RL, maybe he gets points for that.

Merge candidate with Philosophy of Language?

2Yoav Ravid1y
Agree. and I'd keep this title.

Things that probably actually fit into your interests:

A Sensible Introduction to Category Theory

Most of what 3blue1brown does

Videos that I found intellectually engaging but are far outside of the subjects that you listed:

Cursed Problems in Game Design

Luck and Skill in Games

Disney's FastPass: A Complicated History 

The Congress of Vienna

Building a 6502-based computer from scratch (playlist)


(I am also a jan Misali fan)

1Martín Soto1y
Neat, thanks so much for these recommendations! I do of course follow 3b1b, and I already know some Category. But I'll for sure check out all of the rest, which sound super cool!
1Martin Randall1y
1Nathan Young1y
I bought a load of yes, to try and drive the price up and incentivise people to sell their yes. Curious whether this was the right call.
lol, I filed the same market on manifold before scrolling down and seeing you already did.

This is a classical example where having a prediction market creates really bad incentives.

The preview-on-hover for those manifold links shows a 404 error. Not sure if this is Manifold's fault or LW's fault.

I would guess that the hover widget that LessWrong users assumes that every link to actually links to a market and produces an error for links to sites on that are not market and thus shouldn't really get the popup. 

One antifeature I see promoted a lot is "It doesn't track your data". And this seems like it actually manages to be the main selling point on its own for products like DuckDuckGo, Firefox, and PinePhone.

The major difference from the game and movie examples is that these products have fewer competitors, with few or none sharing this particular antifeature.

Antifeatures work as marketing if a product is unique or almost unique in its category for having a highly desired antifeature. If there are lots of other products with the same antifeature, the antifeatur... (read more)

On the first read I was annoyed at the post for criticizing futurists for being too certain in their predictions, while it also throws out and refuses to grade any prediction that expressed uncertainty, on the grounds that saying something "may" happen is unfalsifiable.

On reflection these two things seem mostly unrelated, and for the purpose of establishing a track record "may" predictions do seem strictly worse than either predicting confidently (which allows scoring % of predictions right), or predicting with a probability (which none of these futurists did, but allows creating a calibration curve).

Seems right, closing this.

Yes. The one I described is the one the paper calls FairBot. It also defines PrudentBot, which looks for a proof that the other player cooperates with PrudentBot and a proof that it defects against DefectBot. PrudentBot defects against CooperateBot.

The part about two Predictors playing against each other reminded me of Robust Cooperation in the Prisoner's Dilemma, where two agents with the algorithm "If I find a proof that the other player cooperates with me, cooperate, otherwise defect" are able to mutually prove cooperation and cooperate.

If we use that framework, Marion plays "If I find a proof that the Predictor fills both boxes, two-box, else one-box" and the Predictor plays "If I find a proof that Marion one-boxes, fill both, else only fill box A". I don't understand the math very well, but I th... (read more)

This would cooperate with CooperateBot (algorithm that unconditionally says "Cooperate").
Yeah after the first two conditionals return as non-halting, Marion effectively abandons trying to further predict the predictor. After iterating the non-halting stack, Marion will conclude that she's better served by giving into the partial blackmail and taking the million dollars then she is by trying to game the last $1000 out of the predictor, based on the fact that her ideal state is gated behind an infinitely recursed function.

I think in a lot of people's models, "10% chance of alignment by default" means "if you make a bunch of AIs, 10% chance that all of them are aligned, 90% chance that none of them are aligned", not "if you make a bunch of AIs, 10% of them will be aligned and 90% of them won't be".

And the 10% estimate just represents our ignorance about the true nature of reality; it's already true either that alignment happens by default or that it doesn't, we just don't know yet.

I generally disagree with the idea that fancy widgets and more processes are the main thing keeping the LW wiki from being good. I think the main problem is that not a lot of people are currently contributing to it. 

The things that discourage me from contributing more look like:

-There are a lot of pages. If there are 700 bad pages and I write one really good page, there are still 699 bad pages.

-I don't have a good sense of which pages are most important. If I put a bunch of effort into a particular page, is that one that people are going to care about... (read more)

is one of the first results for "yudkowsky harris" on Youtube. Is there supposed to be more than this?

Yes. Here's how I imagine some people will respond to getting a link to this video. "Oh, it's some weird YouTube video with capitalized words in the title and 265 views." And the channel is called "Thinking Atheist" and has very few subscribers. It's way less likely to be taken seriously than the full audio on the official podcast. Also, it's hard to listen to YouTube videos when moving around because people can't (easily) download them. And people have to keep their screen on the whole time (and not use their phone for any other purpose) unless they have some premium YouTube subscription.

You should distinguish between “reward signal” as in the information that the outer optimization process uses to update the weights of the AI, and “reward signal” as in observations that the AI gets from the environment that an inner optimizer within the AI might pay attention to and care about.

From evolution’s perspective, your pain, pleasure, and other qualia are the second type of reward, while your inclusive genetic fitness is the first type. You can’t see your inclusive genetic fitness directly, though your observations of the environment can let you ... (read more)

This has overtaken the post it's responding to as the top-karma post of all time.

Yes, it's never an equilibrium state for Eliezer communicating key points about AI to be the highest karma post on LessWrong. There's too much free energy to be eaten by a thoughtful critique of his position. On LW 1.0 it was Holden's Thoughts on the Singularity Institute, and now on LW 2.0 it's Paul's list of agreements and disagreements with Eliezer.

Finally, nature is healing.

I'm impressed by the number of different training regimes stacked on top of each other.

-Train a model that detects whether a Minecraft video on Youtube is free of external artifacts like face cams.

-Then feed the good videos to a model that's been trained using data from contractors to guess what key is being pressed each frame.

-Then use the videos and input data to train a model that, in any game situation, does whatever inputs it guesses a human would be most likely to do, in an undirected shortsighted way.

-And then fine-tune that model on a specific subset of videos that feature the early game.

-And only then use some mostly-standard RL training to get good at some task.

1Maxwell Clarke1y
It's impressive. So far we see capabilities like this only in domains with loads of data. The models seem to be able to do anything if scaled, but the data dictates the domains where this is possible. It really doesn't seem that far away until there's pre-trained foundation models for most modalities... Google's "Pathways" project is definitely doing it as we speak IMO.

While the engineer learned one lesson, the PM will learn a different lesson when a bunch of the bombs start installing operating system updates during the mission, or won't work with the new wi-fi system, or something: the folly of trying to align an agent by applying a new special case patch whenever something goes wrong.

No matter how many patches you apply, the safety-optimizing agent keeps going for the nearest unblocked strategy, and if you keep applying patches eventually you get to a point where its solution is too complicated for you to understand how it could go wrong.

Meta: This is now the top-voted LessWrong post of all time.

4Adam Zerner1y
True, but it's 8th if you adjust for inflation.

Robust Agents seems sort of similar but not quite right.

Looking at the generation code, aptitude had interesting effects on our predecessors' choice of cheats.


-Higher aptitude Hikkikomori and Otaku are less likely to take Hypercompetent Dark Side (which has lower benefits for higher aptitude characters).


-Higher aptitude characters across the board are less likely to take Monstrous Regeneration or Anomalous Agility, which were some of the better choices available.


-Higher aptitude Hikkikomori are more likely to take Mind Palace.

Somewhat. The profile pic changes based on the character's emotions, or their reaction to a situation. Sometimes there's a reply where the text is blank and the only content is the character's reaction as conveyed by the profile pic.

That said, it's a minor enough element that you wouldn't lose too much if it wasn't there.

On the other hand, it is important for you to know which character each reply is associated with, as trying to figure out who's talking from the text alone could get confusing in many scenes. So any format change should at least preserve the names.

6Stephen Bennett2y
Would a play format work? e.g. "Iarwain (languidly): I don't know that I am much interested in your family life, Lintamande" Someone would have to translate the various profile pictures into emotions, but I would expect there to only be a handful of them per character.
2Yoav Ravid2y
So perhaps the character names could be kept, maybe the pictures too (though I don't know if there's a good way to include so many pictures in an epub format, it's not exactly built for having small images beside the text, as far as I know), but not the authors names. Cause though it's nice knowing who wrote what, it's also confusing because everyone uses special usernames and I don't yet know which name is the author's name and which is the character's.

If everyone ends up with the same vote distribution, I think it removes the incentive for colluding beforehand, but it also means the vote is no longer meaningfully quadratic. The rank ordering of the candidates will be in order of how many total points were spent on them, and you basically end up with score voting.

edit: I assume that the automatic collusion mechanism is something like averaging the two ballots' allocations for each candidate, which does not change the number of points spent on each candidate. If instead some ballots end up causing more points to be spent on their preferred candidates than they initially had to work with, there are almost definitely opportunities for strategic voting and beforehand collusion.

Or put a spoilered link to this post in the dath ilan tag's wiki text?

A type of forum roleplay / collaborative fiction writing started by Alicorn.

For further complication, what if you consider potential backers having different estimations of the value of the project?

That would raise the risk of backing-for-the-bonus projects that you don't like. Maybe you would back the project to punch cute puppies to 5% or 25%, but if it's at 75% you start to suspect that there are enough cute puppy haters out there to push it all the way if you get greedy for the bonus.

For good projects, you could have a source for the refund bonuses other than the platform or the project organizers - the most devoted fans. Allow backers to submit a pledge that, if the project is refunded, gets distributed to other backers rather than the person who submitted it.

3Yoav Ravid2y
Oh, and maybe instead of knowing from the start what the refund would be (10%, 20%, some fixed amount...), you would proportionally distribute these funds between all other backers, and you can even have a counter that shows how much money is in the refund pool. That would be a very interesting refund scheme to test. I'd be happy to see an experiment that simulates a more realistic scenario, with projects that have different goals, projects that have negative value, option for participants to "promote" projects, and also something like this refund scheme. Of course, an actual crowdfunding platform experimenting with this would be better, but this would still be a valuable experiment.

Agree it doesn't belong; I have downvoted it.

There is no tag that encompasses all of AI alignment and nothing else.

I think the reason you gave is basically correct - when I look at the 15 posts with the highest relevance score on the AI tag, about 12 of them are about alignment.

On the other hand, when a tag doesn't exist it may just be because no one ever felt like making it.

Merge candidate with startups?

"Transformer Circuits" seems like too specific of a tag - I doubt it applies to much beyond this one post. Probably should be broadened to encompass and related stuff.

"Circuits (AI)" to distinguish from normal electronic circuits?

This sounds a lot like the "Precisely Bound Demons and their Behavior" concept that Yudkowsky described but never wrote the story for.

Ra also features magic-as-engineering.

Chiming in later to say that I think the tag should stay, especially now that multiple people are doing them. Compare "Rationality Quotes" and "Open Threads" for other tags that could be accused of just being sequences.

Should this tag include stuff about print versions of HPMOR or Rationality: From AI to Zombies, or just the review collections from 2018 forward?

In my opinion, Rationality A-Z clearly yes. I mean, it is a book (an edited collection of articles), and it is related to LW. HPMoR, yes, because it is important and was discussed a lot on LW. To be precise, the discussions are a reason to include HPMoR, but the discussions themselves should not be tagged (that would flood the tag and make it less useful), only articles related to making a book version of it. I am less sure about Eliezer's other fiction (leaning towards no), or Unsong (leaning towards yes), or Inadequate Equilibria (undecided). If there turn out to be too many articles with this tag, the tag description should contain canonical links to the individual books.

Something similar came up in the post:

If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.

Though rereading it, it's not addressing your exact question.

Removed this from the page itself now that talk pages exist:

[pre-talk-page note] I think this should maybe be merged with Distillation and Pedagogy – Ray

Load More