My guess is that early stopping is going to tend to stop so early as to be useless.
For example, imagine the agent is playing Mario and its proxy objective is "+1 point for every unit Mario goes right, -1 point for every unit Mario goes left".
(Mario screenshot that I can't directly embed in a comment)
If I understand correctly, to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.
This probably includes things like "+1 point if Mario falls in a pit"....
With such a vague and broad definition of power fantasy, I decided to brainstorm a list of ways games can fail to be a power fantasy.
I think ALWs are already more of a "realist" cause than a doomer cause. To doomers, they're a distraction - a superintelligence can kill you with or without them.
ALWs also seem to be held to an unrealistic standard compared to existing weapons. With present-day technology, they'll probably hit the wrong target more often than human-piloted drones. But will they hit the wrong target more often than landmines, cluster munitions, and over-the-horizon unguided artillery barrages, all of which are being used in Ukraine right now?
The Huggingface deep RL course came out last year. It includes theory sections, algorithm implementation exercises, and sections on various RL libraries that are out there. I went through it as it came out, and I found it helpful. https://huggingface.co/learn/deep-rl-course/unit0/introduction
You are right that by default prediction markets do not generate money, and this can mean traders have little incentive to trade.
Sometimes this doesn't even matter. Sports betting is very popular even though it's usually negative sum.
Otherwise, trading could be stimulated by having someone who wants to know the answer to a question provide a subsidy to the market on that question, effectively paying traders to reveal their information. The subsidy can take the form of a bot that bets at suboptimal prices, or a cash prize for the best performing trader, or ...
When people calculate utility they often use exponential discounting over time. If for example your discount factor is .99 per year, it means that getting something in one year is only 99% as good as getting it now, getting it in two years is only 99% as good as getting it in one year, etc. Getting it in 100 years would be discounted to .99^100~=36% of the value of getting it now.
The sharp left turn is not some crazy theoretical construct that comes out of strange math. It is the logical and correct strategy of a wide variety of entities, and also we see it all the time.
I think you mean Treacherous Turn, not Sharp Left Turn.
Sharp Left Turn isn't a strategy, it's just an AI that's aligned in some training domains being capable but not aligned in new ones.
This post is tagged with some wiki-only tags. (If you click through to the tag page, you won't see a list of posts.) Usually it's not even possible to apply those. Is there an exception for when creating a post?
See https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia and https://www.lesswrong.com/posts/cpdsMuAHSWhWnKdog/security-mindset-and-the-logistic-success-curve for Yudkowsky's longform explanation of the metaphor.
Based on my incomplete understanding of transformers:
A transformer does its computation on the entire sequence of tokens at once, and ends up predicting the next token for each token in the sequence.
At each layer, the attention mechanism gives the stream for each token the ability to look at the previous layer's output for other token before it in the sequence.
The stream for each token doesn't know if it's the last in the sequence (and thus that its next-token prediction is the "main" prediction), or anything about the tokens that come after it.
So each tok...
In the blackmail scenario, FDT refuses to pay if the blackmailer is a perfect predictor and the FDT agent is perfectly certain of that, and perfectly certain that the stated rules of the game will be followed exactly. However, with stakes of $1M against $1K, FDT might pay if the blackmailer had an 0.1% chance of guessing the agent's action incorrectly, or if the agent was less than 99.9% confident that the blackmailer was a perfect predictor.
(If the agent is concerned that predictably giving in to blackmail by imperfect predictors makes it exploitable, it ...
I'm not familiar with LeCun's ideas, but I don't think the idea of having an actor, critic, and world model is new in this paper. For a while, most RL algorithms have used an actor-critic architecture, including OpenAI's old favorite PPO. Model-based RL has been around for years as well, so probably plenty of projects have used an actor, critic, and world model.
Even though the core idea isn't novel, this paper getting good results might indicate that model-based RL is making more progress than expected, so if LeCun predicted that the future would look more like model-based RL, maybe he gets points for that.
This tag was originally prompted by this exchange: https://www.lesswrong.com/posts/qCc7tm29Guhz6mtf7/the-lesswrong-2021-review-intellectual-circle-expansion?commentId=CafTJyGL5cjrgSExF
Things that probably actually fit into your interests:
A Sensible Introduction to Category Theory
Most of what 3blue1brown does
Videos that I found intellectually engaging but are far outside of the subjects that you listed:
Cursed Problems in Game Design
Disney's FastPass: A Complicated History
Building a 6502-based computer from scratch (playlist)
(I am also a jan Misali fan)
The preview-on-hover for those manifold links shows a 404 error. Not sure if this is Manifold's fault or LW's fault.
One antifeature I see promoted a lot is "It doesn't track your data". And this seems like it actually manages to be the main selling point on its own for products like DuckDuckGo, Firefox, and PinePhone.
The major difference from the game and movie examples is that these products have fewer competitors, with few or none sharing this particular antifeature.
Antifeatures work as marketing if a product is unique or almost unique in its category for having a highly desired antifeature. If there are lots of other products with the same antifeature, the antifeatur...
On the first read I was annoyed at the post for criticizing futurists for being too certain in their predictions, while it also throws out and refuses to grade any prediction that expressed uncertainty, on the grounds that saying something "may" happen is unfalsifiable.
On reflection these two things seem mostly unrelated, and for the purpose of establishing a track record "may" predictions do seem strictly worse than either predicting confidently (which allows scoring % of predictions right), or predicting with a probability (which none of these futurists did, but allows creating a calibration curve).
Yes. The one I described is the one the paper calls FairBot. It also defines PrudentBot, which looks for a proof that the other player cooperates with PrudentBot and a proof that it defects against DefectBot. PrudentBot defects against CooperateBot.
The part about two Predictors playing against each other reminded me of Robust Cooperation in the Prisoner's Dilemma, where two agents with the algorithm "If I find a proof that the other player cooperates with me, cooperate, otherwise defect" are able to mutually prove cooperation and cooperate.
If we use that framework, Marion plays "If I find a proof that the Predictor fills both boxes, two-box, else one-box" and the Predictor plays "If I find a proof that Marion one-boxes, fill both, else only fill box A". I don't understand the math very well, but I th...
I think in a lot of people's models, "10% chance of alignment by default" means "if you make a bunch of AIs, 10% chance that all of them are aligned, 90% chance that none of them are aligned", not "if you make a bunch of AIs, 10% of them will be aligned and 90% of them won't be".
And the 10% estimate just represents our ignorance about the true nature of reality; it's already true either that alignment happens by default or that it doesn't, we just don't know yet.
I generally disagree with the idea that fancy widgets and more processes are the main thing keeping the LW wiki from being good. I think the main problem is that not a lot of people are currently contributing to it.
The things that discourage me from contributing more look like:
-There are a lot of pages. If there are 700 bad pages and I write one really good page, there are still 699 bad pages.
-I don't have a good sense of which pages are most important. If I put a bunch of effort into a particular page, is that one that people are going to care about...
is one of the first results for "yudkowsky harris" on Youtube. Is there supposed to be more than this?
You should distinguish between “reward signal” as in the information that the outer optimization process uses to update the weights of the AI, and “reward signal” as in observations that the AI gets from the environment that an inner optimizer within the AI might pay attention to and care about.
From evolution’s perspective, your pain, pleasure, and other qualia are the second type of reward, while your inclusive genetic fitness is the first type. You can’t see your inclusive genetic fitness directly, though your observations of the environment can let you ...
Yes, it's never an equilibrium state for Eliezer communicating key points about AI to be the highest karma post on LessWrong. There's too much free energy to be eaten by a thoughtful critique of his position. On LW 1.0 it was Holden's Thoughts on the Singularity Institute, and now on LW 2.0 it's Paul's list of agreements and disagreements with Eliezer.
Finally, nature is healing.
I'm impressed by the number of different training regimes stacked on top of each other.
-Train a model that detects whether a Minecraft video on Youtube is free of external artifacts like face cams.
-Then feed the good videos to a model that's been trained using data from contractors to guess what key is being pressed each frame.
-Then use the videos and input data to train a model that, in any game situation, does whatever inputs it guesses a human would be most likely to do, in an undirected shortsighted way.
-And then fine-tune that model on a specific subset of videos that feature the early game.
-And only then use some mostly-standard RL training to get good at some task.
While the engineer learned one lesson, the PM will learn a different lesson when a bunch of the bombs start installing operating system updates during the mission, or won't work with the new wi-fi system, or something: the folly of trying to align an agent by applying a new special case patch whenever something goes wrong.
No matter how many patches you apply, the safety-optimizing agent keeps going for the nearest unblocked strategy, and if you keep applying patches eventually you get to a point where its solution is too complicated for you to understand how it could go wrong.
Looking at the generation code, aptitude had interesting effects on our predecessors' choice of cheats.
Good:
-Higher aptitude Hikkikomori and Otaku are less likely to take Hypercompetent Dark Side (which has lower benefits for higher aptitude characters).
Bad:
-Higher aptitude characters across the board are less likely to take Monstrous Regeneration or Anomalous Agility, which were some of the better choices available.
Ugly:
-Higher aptitude Hikkikomori are more likely to take Mind Palace.
Somewhat. The profile pic changes based on the character's emotions, or their reaction to a situation. Sometimes there's a reply where the text is blank and the only content is the character's reaction as conveyed by the profile pic.
That said, it's a minor enough element that you wouldn't lose too much if it wasn't there.
On the other hand, it is important for you to know which character each reply is associated with, as trying to figure out who's talking from the text alone could get confusing in many scenes. So any format change should at least preserve the names.
If everyone ends up with the same vote distribution, I think it removes the incentive for colluding beforehand, but it also means the vote is no longer meaningfully quadratic. The rank ordering of the candidates will be in order of how many total points were spent on them, and you basically end up with score voting.
edit: I assume that the automatic collusion mechanism is something like averaging the two ballots' allocations for each candidate, which does not change the number of points spent on each candidate. If instead some ballots end up causing more points to be spent on their preferred candidates than they initially had to work with, there are almost definitely opportunities for strategic voting and beforehand collusion.
For further complication, what if you consider potential backers having different estimations of the value of the project?
That would raise the risk of backing-for-the-bonus projects that you don't like. Maybe you would back the project to punch cute puppies to 5% or 25%, but if it's at 75% you start to suspect that there are enough cute puppy haters out there to push it all the way if you get greedy for the bonus.
For good projects, you could have a source for the refund bonuses other than the platform or the project organizers - the most devoted fans. Allow backers to submit a pledge that, if the project is refunded, gets distributed to other backers rather than the person who submitted it.
There is no tag that encompasses all of AI alignment and nothing else.
I think the reason you gave is basically correct - when I look at the 15 posts with the highest relevance score on the AI tag, about 12 of them are about alignment.
On the other hand, when a tag doesn't exist it may just be because no one ever felt like making it.
"Transformer Circuits" seems like too specific of a tag - I doubt it applies to much beyond this one post. Probably should be broadened to encompass https://www.lesswrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits and related stuff.
"Circuits (AI)" to distinguish from normal electronic circuits?
This sounds a lot like the "Precisely Bound Demons and their Behavior" concept that Yudkowsky described but never wrote the story for.
Ra also features magic-as-engineering.
Should this tag include stuff about print versions of HPMOR or Rationality: From AI to Zombies, or just the review collections from 2018 forward?
Something similar came up in the post:
If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.
Though rereading it, it's not addressing your exact question.
Removed this from the page itself now that talk pages exist:
[pre-talk-page note] I think this should maybe be merged with Distillation and Pedagogy – Ray
Predict the winners at