[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

[-][anonymous]3yΩ030

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.

There's some chance that this was because I talked about it with OpenAI / MineRL people, but overall I think it's much more likely to be "different people independently came up with good projects".

Looking at the results I'm even more bullish than I already was about Minecraft as a good "alignment testbed".

[-]Pattern4y*20

Criterion (a) implies that we can’t have automated evaluation of agents (otherwise that could be turned into a reward function) and so suggests that we use human evaluation of agents as our ground truth.

1. On the basis that it's in Minecraft, if things were broken down into sub-tasks, some automated evaluation could be reasonable. For the final evaluation of the beauty of the waterfall, maybe not so much. It seems easy to ask questions that wouldn't show up in an evaluation like 'Is a waterfall more beautiful if there's a cave behind it?'

2. In a similar fashion to 1, maybe automated evaluation could be used for a while (especially on tasks evaluation seems like it would work on, like 'find water'), then changed.

3. There's another idea in here where agents are trained to perform subtasks, and instead of a waterfall agent there's a cliff finding agent, a water finding agent (stitched together by an always/regularly running road making agent that makes trails). An actual shared working memory would be more useful, but, I'm not clear on how that would be trained. (Setting up incentives in the environment to try to get communication sounds tricky.) But, if they can see what the other agents do, even if they're not doing anything, then maybe the water fall builder could remember where the water and the cliff are? This sounds complicated enough it's probably not worth the effort though.

Human evaluators then compare two agents against each other and determine which is better. Agents are then given a score using the TrueSkill system.

I was surprised to see this because, as the wikipedia link says:

TrueSkill is patented,[3] and the name is trademarked,[4] so it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm.

The conjunction of these sounds hilarious - if it's not the same evaluators, then is it an object measurement/score?

Though the whole point of this might be to take such things less seriously - once you've got agents that are building waterfalls and you're debating which ones should be considered the best at making beautiful waterfalls, you might be all set. Getting AI to the point where it is judged via subjective, arbitrary standards is maybe more life like. (The history of the value of art over time, and how much it drifts relative to 'when the artist is alive' comes to mind.)

1. In Atari or MuJoCo, there are often only a few reasonable goals: for example, in Pong, you either hit the ball back, or you die. If you’re testing algorithms that are meant to learn what the goal is, you want an environment where there could be many possible goals, as is the case in Minecraft.

This might be changeable if you're willing to take Pong and change it. Maybe remove the death mechanic, and come up with a goal. Like 'lose at pong, when both players are trying to lose at pong'.

Pong itself is fairly simple, and not super supportive. This might be difficult enough to require in game feedback, but - Brick Breaker, where the goal is to break certain bricks, but not others. Maybe you don't die if the ball falls below the screen, but you do die if you break any of the bricks which are a certain color. Or maybe a symbol moves around (inside and outside of bricks - predefined spots regardless of whether they're there's a brick there or not). If you hit it with the ball you lose (the ultimate anti-power up).

The above is clearly a bad idea if you want learning through other means because death communicates the goal. Remove that aspect, and it stops being so. Of course taking brick breaker and making the goal for the player (ignore the score) - remove every other column/leave rows of bricks with ANY amount of space in between, isn't super complicated, and it isn't a super open world.

Mostly, goals probably can't be super complicated relative to an end product - absent a wide variety of levels/random initial generation. Rules about intermediate states/time, could be more elaborate, like 'make a song, using the sounds of breaking bricks', though that's going to take some work to set up an interpretation method, and isn't a 'natural goal'.

Especially if you interpret 'a red brick is this note, time in between brick being broken is ignored, a 'glass brick' being broken is a rest note/the sound of chimes'. (Rest notes for 'the ball passed through a space where a brick used to be/would be if it hadn't been broken' are way too elaborate, and would need a lot of level design and redesign to make it workable.)

2. There’s lots of Minecraft videos on YouTube, so you could test a “GPT-3 for Minecraft” approach.

It's not immediately clear what is meant by this statement.

3. The “true reward function” in Atari or MuJoCo is often not a great evaluation: for example, a Hopper policy trained to stand still using a constant reward gets 1000 reward!

There's also the 'glitch to get huge reward issue. Or is that an issue?

Human evaluations should not be subject to the same problem.

Because they'll change over time? People will say 'that's better than not dying, it's doing fine, it's just getting started'. And then later 'this is boring. Move. Do something!'

4. Since the tasks were chosen to be inherently fuzzy and challenging to formalize, researchers are allowed to take whatever approach they want to solving the task, including “try to write down a reward function”. In contrast, for something like Atari or MuJoCo, you need to ban such strategies. The only restriction is that researchers cannot extract additional state information from the Minecraft simulator.

I look forward to this AI research milestone: researchers game the system. (Seriously though, training an agent/NN/whatever to predict state based on observations, then like, get rid of the state reveal....)

6. Since Minecraft is so popular, it is easy to hire Minecraft experts, allowing us to design algorithms that rely on expert time instead of just end user time.

You could also try amateur time.

7. Unlike Atari or MuJoCo, BASALT has a clear path to scaling up: the tasks can be made more and more challenging. In the long run, we could aim to deploy agents on public multiplayer Minecraft servers that follow instructions or assist with whatever large-scale project players are working on, all while adhering to the norms and customs of that server.

Elaborating on 'amateur time' more - you could set them loose on a server, untrained. Users might enjoy, who knows, just randomly killing them (or ones with terrible/ugly waterfalls).

The read more at the end of the summary for that paper, was really great design. (As compared to the fact that frequent submitting an in progress comment requires hitting save, then scrolling up, then clicking edit, then scrolling down...every. single. time.)

For example, later in this newsletter we’ll see that synthetically generated demos are not a good proxy for human demos

I'm not sure what demos means in that sentence, but if it means benchmarks...

Create a beautiful picture (with additional constraints) just seems like it can be generated.

House. Cavern. Ravine. Giant blocky art minecraft art of [generated part of prompt]. A good Minecraft cart system. A circuit diagram looking down. A circuit diagram as seen on the side of a wall.

A redstone circuit for [X], along with pictures showing someone how to build it that works. (This one seems way too ambitious - today!)

In theory, building a waterfall might be just...building trenches that extend it. (Figuring out how to build buckets to move water around seems like a bigger challenge - though one there's a lot of "expert" demonstrations of in videos, say on Youtube, or wherever.)

I’d encourage readers to participate in the competition, because I think it’s especially good as a way to get started with ML research. It’s a new benchmark, so there’s a lot of low-hanging fruit in applying existing ideas to the benchmark, and in identifying new problems not present in previous benchmarks and designing solutions to them. It’s also pretty easy to get started: the BC baseline is fairly straightforward and takes a couple of hours to be trained on a single GPU. (That’s partly because BC doesn’t require environment samples; something like GAIL (AN #17) would probably take a day or two to train instead.)

Good to see that tip there. (Partially for future reasons - I haven't seen a lot of stuff on how AI research has changed over time, and it seems obviously easier to figure out for the current time.*)

*Obviously someone doing that intentionally would be able to do more, like save that information together with copies of 'standard' versions for a given point in time, along with time estimates before and after, and other information. (Like cost of electricity. Longer term: changes in hardware and effects.)

Hypermind forecasting contest on AI (summarized by Rohin): Hypermind is running a forecasting contest on the evolution of artificial intelligence with a $30,000 prize over four years. The questions ask both about the growth of compute and about performance on specific benchmarks such as the MATH suite (AN #144).

If projects like that leave/make good records, then that might be less of an issue quantitatively. What qualitative factors are important in AI or to users/trainers/etc., and how those might be recorded, is less clear (especially to 'an outsider').

[-]Rohin Shah4y20

On the basis that it's in Minecraft, if things were broken down into sub-tasks, some automated evaluation could be reasonable.

I agree this is possible in Minecraft. The point I was trying to make is that we were trying to make environments where you can't have automated evaluation, because as soon as you have automated evaluation, you can make an automated reward function -- and the whole point was to create tasks where you can't make an automated reward function.

(Technically, since we ban participants from using internal state, we could have tried to create tasks with automated evaluation based on internal state. But when we thought through this we didn't think of tasks we liked as much.)

TrueSkill is patented

Huh, I hadn't noticed that, will have to look into it (though we are sponsored by Microsoft).

The conjunction of these sounds hilarious - if it's not the same evaluators, then is it an object measurement/score?

It is not replicable, i.e. you cannot run the same evaluation process and get the same number out. It should be reproducible, in that if you rerun the evaluation process you should get similar results. (The paper has some discussion of this.)

Getting AI to the point where it is judged via subjective, arbitrary standards is maybe more life like.

Yeah, I agree with this.

This might be changeable if you're willing to take Pong and change it.

I agree things like this could be done, but usually I'm not very excited about the result, because it will still feel unrealistic to me on some axis that I think matters. (Though I expect you could easily improve over Atari / MuJoCo, which really are not good benchmarks for LfHF approaches.)

It's not immediately clear what is meant by this statement.

From the blog post that's being summarized:

A sketch of such an approach would be:

1. Create a dataset of YouTube videos paired with their automatically generated captions, and train a model that predicts the next video frame from previous video frames and captions.

2. Train a policy that takes actions which lead to observations predicted by the generative model (effectively learning to imitate human behavior, conditioned on previous video frames and the caption).

3. Design a “caption prompt” for each BASALT task that induces the policy to solve that task.

There's also the 'glitch to get huge reward issue. Or is that an issue?

That would be an issue if LfHF approaches tended to discover the glitch. However, LfHF approaches usually don't do this, since humans don't give feedback that pushes the agent towards the glitch.

Because they'll change over time? People will say 'that's better than not dying, it's doing fine, it's just getting started'. And then later 'this is boring. Move. Do something!'

We're just talking about evaluation here (i.e. human judgments of the final trained agent's performance). If you ask a human to judge whether the robot is moving quickly to the right, and then they see the "standing still" policy, they are going to assign that policy a very low score.

You could also try amateur time.

Yeah, that would also be interesting. We didn't mention it because that's more commonly done in the existing literature. (Assuming that by "amateur time" you mean "feedback from humans who are amateurs at the game".)

As compared to the fact that frequent submitting an in progress comment requires hitting save, then scrolling up, then clicking edit, then scrolling down...every. single. time.

Protip: when you find yourself doing this, consider opening a duplicate tab -- one for reading and one for writing the comment.

I'm not sure what demos means in that sentence, but if it means benchmarks...

It meant demonstrations of the task, sorry (I probably should have used the full word "demonstrations"). So, for example, this could be a video of a human creating a waterfall (along with a record of what keys they pressed to do it).

[-]Pattern4y20

and the whole point was to create tasks where you can't make an automated reward function.

I was gesturing towards partially automatable - the whole can't. (Specifically find water - unless that can be crafted now, or mods are in play. (Find ice might also work though.)) Handcrafted: move, don't hold still.

Protip: when you find yourself doing this, consider opening a duplicate tab -- one for reading and one for writing the comment.

I was refering to the 'editing a comment process.' Though thanks for the tip, I do use that a lot.

Pong

Yeah. I mostly found it interesting to think about because it seems like a simpler environment (and might be easier to train), but the results would be a lot less interesting. And at the limit of modification, Edited Pong becomes like Minecraft 2.

Arguably Minecraft is the sort of game that embedding a mini game (say via a primitive red stone controller going to some sort of arcade machine mockup), could kind of work within. (Agents might be uninterested in such a machine if it doesn't affect its environment at all - and simple reward distribution setups could just be found by disassembling it.)

LESSWRONG
LW

LESSWRONG
LW

21

[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

21

Ω 14

21

Ω 14

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

AI GOVERNANCE

NEWS

FEEDBACK

PODCAST