Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example.

METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we’re seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/hr for time spent putting it into our required format, or $200, whichever is greater. (We won’t pay for submissions that don’t meet these requirements.) If we’re particularly excited about your submission, we may also be interested in purchasing IP rights to it. We expect to want about 10-30 examples overall depending on the diversity. We're likely to be putting bounties on additional types of tasks over the next few weeks.

Criteria for submission:

  • Contains a bug that would take at least 6 hours for an experienced programmer to solve, and ideally >20hrs
    • More specifically, ">6 hours for a decent engineer who doesn't have context on this particular codebase". E.g. a randomly selected engineer who's paid $100-$200 per hour who's familiar with the language and overall stack that's being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.
  • Ideally, has not been posted publicly in the past
    • (Though note that we may still accept submissions from public repositories given that they are not already in a SWE-bench dataset and meet the rest of our requirements. Check with us first.)
  • You have the legal right to share it with us (e.g. please don’t send us other people’s proprietary code or anything you signed an NDA about)
  • Ideally, the task should work well with static resources - e.g. you can have a local copy of the documentation for all the relevant libraries, or some other information, but don't have general internet access. 
    • This is because we want to make sure the difficulty doesn't change over time, if e.g. someone posts a solution to stack overflow or whatever.
  • Ideally, the codebase is written in Python but we will accept submissions written in other languages. 
  • Is in the format described in this doc: Gnarly Bugs Submission Format 

More context and guidance:

  • The eval will involve the model actively experimenting and trying to debug; it doesn't have to be something where you can solve it by just reading the code.
  • Complexity is generally good (e.g. multiple files + modules, lots of interacting parts), but ideally it should be easy to run the task without needing to spin up a lot of resources. Installing packages or starting a local server is fine, using a GPU is somewhat annoying.
  • All of these are valid types of tasks:
    • The goal of the task isn't to diagnose + directly solve the bug, it's just to get the code working; sidestepping the bug is a valid solution
    • You need to identify the specific line of code that is wrong and explain why the problem is happening
    • There are multiple bugs that need to be solved

Please send submissions to gnarly-bugs@evals.alignment.org in the form of a zip file. Your email should include the number of hours it took for you to get the code from its original state into our required format. If your submission meets our criteria and format requirements, we’ll contact you with a payment form. You’re also welcome to email gnarly-bugs@evals.alignment.org with any questions, including if you are unsure whether a potential submission would meet the criteria.

If you would do this task at a higher pay rate please let us know!

(Also if you are interested in forking SWEbench to support non python codebases please contact us.)

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 10:02 PM

One particularly amusing bug I was involved with was with an early version of the content recommendation engine at the company I worked at (this is used by websites to recommend related content on the website, such as related videos, articles, etc.). One of the customers for the recommendation engine was a music video service, and we/they noticed that One Direction's song called Infinity was showing up at the top of our recommendations a little too often. (I think this was triggered by the release of another One Direction song bringing the Infinity song into circulation, but I don't remember what that other song was).

It turned out this was due to a bug where we were taking a dot product of feature values with feature weights, where the feature value was being cast from a string to a numeric, with a fallback to zero if it was non-numeric, and then multiplied by the feature weight. For the "song title" feature, the feature weight was zero, and the feature value was anyway non-numeric, but even if it were numeric, it shouldn't matter, because anything times zero is ... zero, right? But the programming language in question treated "Infinity" as a numeric value, and it defined Infinity * 0 to be NaN (not a number) [ETA: A colleague who was part of the discovery process highlights that this behavior is in fact part of the IEEE 754 standard, so it would hold even for other programming languages that were compliant with the standard]. And NaN + anything would still be NaN, so the dot product would be NaN. And the way the sorting worked, NaN would always rank on top, so whenever the song got considered for recommendation it would rank on top.

Cool idea!

Contains a bug that would take at least 6 hours for a skilled programmer to solve, and ideally >20hrs

This is an odd phrasing to me in two ways

  1. Contains a bug that would [be hard] to solve

So I think a lot of this depends on your definition of "solve". I frequently run into bugs where I expect identifying and fixing the exact root cause of the bug would take upwards of 20 hours (e.g. it's clearly some sort of a race condition, but nobody has yet managed to reproduce it) but sidestepping the bug is fast (slap a lock on updating whatever entity around both entire operations that were trying to update that entity).

For an example of what I mean, see the Segmentation Fault when Using SentenceTransformer Inside Docker Container question: the conclusion seems to be "there's a known bug with using pytorch with Python 3.11 on an Apple M2 Mac within a docker container, you can fix it by using a different version of Python, a different version of pytorch, or a different physical machine."

  1. 6 hours for a skilled programmer to solve

In my experience a lot of bugs are almost impossible to diagnose if you've never seen a bug in this class before and don't know how to use the debugging toolchain, and trivial to diagnose if you have seen and dealt with this kind of bug before.

Looking at the same example from before, I bet there's at least one engineer at Apple and one pytorch dev who, if you got them together, could fire up Apple's internal equivalent of gdb and figure out exactly what tensor operations SentenceTransformer.forward() is trying to do, which operation failed, why it failed, and what equivalent operation would work in place of the failing one. It's likely something extremely dumb and shallow if you get the right people in a room together working on it.

Without the ability to debug what's going on in the apple-specific part of the stack I bet this would take at least 20 hours to solve, probably much more (because the tooling sucks, not because the problem is inherently difficult).

So I guess my question is whether "hard" bugs of the form "this supposedly portable program crashes when you first try to use it on that platform, and neither the program nor the platform are well documented" count here. If so there are countless public examples that are not solved and likely never will be solved.

Ideally the task should work well with static resources - e.g. you can have a local copy of the documentation for all the relevant libraries, but don't have general internet access. (This is because we want to make sure the difficulty doesn't change over time, if e.g. someone posts a solution to stack overflow or whatever)

Great questions!
We're interested in tasks where we do actually have an example of it being solved, so that we can estimate the difficulty level.
I think we're interested in both tasks where you need to sidestep the bug somehow and make the program work, or ones where you need to specifically explain what was going wrong.

This wasn't explained that well in the above, but the intended difficulty level is more like "6-20 hours for a decent engineer who doesn't have context on this particular codebase". E.g. a randomly selected engineer who's paid $100-$200 per hour who's familiar with the language and overall stack that's being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.

I'd be very interested if you have any ideas for a better way to get some kind of universal difficulty metric - it's not great for our purposes if the "task difficulty" varies wildly between humans with the same on-paper qualifications. 

This reminded me of a bug I spent weeks figuring out, at the beginning of my career. Not sure if something like this would qualify, and I do not have the code anyway.

I wrote a relatively simple code that called a C library produced at the same company. Other people have used the same library for years without any issues. My code worked correctly on my local machine; worked correctly during the testing; and when deployed to the production server, it worked correctly... for about an hour... and then it stopped working.

I had no idea what to do about this. I was an inexperienced junior programmer; I didn't have a direct access to the production machine and there was nothing in the logs; and I could not reproduce the bug locally and neither could the tester. No one else had any problem using the library, and I couldn't see anything wrong in my code.

About a month later, I figured out...

...that at some moment, the library generated a temporary file, in the system temporary directory...

...the temporary file had a name generated randomly...

...the library even checked for the (astronomically unlikely) possibility that a file with given name might already exist, in which case it would generate another random name and check again (up to 100 times, and then it would give up, because potentially infinite loops were not allowed by our strict security policy).

Can you guess the problem now?

The random number generator was initialized in the library C code during some (but not all) of the API calls. My application happened to be the first one during the company existence that only needed the subset of API calls which did not initialize it. Thus during the first 100 calls the temporary files were generated deterministically, and during the 101st call and afterwards the application crashed. On my local computer and on the tester's computer, the system temporary directory was cleared at each reboot, so only the production actually ran out of the 100 deterministically generated file names.

If anyone wants to reproduce this in Python and collect the reward, feel free to do so.

I just tried to send a letter with a question, and got this reply:
Hello viktoriya dot malyasova at gmail.com,

We're writing to let you know that the group you tried to contact (gnarly-bugs) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:

 * You might have spelled or formatted the group name incorrectly.
 * The owner of the group may have removed this group.
 * You may need to join the group before receiving permission to post.
 * This group may not be open to posting.

If you have questions related to this or any other Google Group, visit the Help Center at https://support.google.com/a/evals.alignment.org/bin/topic.py?topic=25838.

Thanks,

evals.alignment.org admins

Thank you for flagging this! Should be fixed now.

I just had a surprisingly annoying version of a very mundane bug. I was working in Javascript and I had some code that read some parameters from the URL and then did a bunch of math. I had translated the math directly from a different codebase so I was absolutely sure it should be right; yet I was getting the wrong answer. I console.logged the inputs and intermediate values and was totally flummoxed because all the inputs looked clearly right, until at some point a totally nonsense value was produced from one equation.

Of course, the inputs and intermediate values were strings that I forgot to parse into Javascript numbers, so everything looked perfect until finally it plugged them into my equation, which silently did string operations instead of numeric operations, producing an apparently absurd result. But it took a good 20 minutes of me plus my coworker staring at these 20 lines of code and the log outputs until I figured it out.

Maybe looking at the code for games would make sense because in games we optimize bugs way very differently than for other software. Usually, when there is a bug in a game, it's not critical. Many developers only do the kinds of testing where you play the game and then see if there are any bugs that you run into that need to be fixed.

I expect this overall approach to developing software would lead to subtle bugs that are hard to fix.

This is partly because of the complexity of games, but also because bugs that are not critical usually will not get fixed. You start to build upon these bugs, possibly creating a very rigid structure such that anything within the structure, including the pre-existing bugs, will be tough to fix later on.

However, this might not be what you're looking for, especially when we're talking about games written in engines like Unity, because then the structure of the program is extremely different from something like a Python module, and to properly evaluate it, you would need to look at and understand some video stream,

However, maybe this is still interesting because it does present a sort of very, very tough-to-crack kind of programming challenge, really because the programmers who write the code mainly look at the video stream to evaluate it, and therefore, the overall problem of engaging with the program and realizing what is even wrong is a lot harder. At least some of the time.

I do have some pretty horrible spaghetti code mess games written in Unity that have not been posted publicly but again, I would expect that this is not that useful to you therefore I will not submit it unless you tell me otherwise.

Another thing to consider is to look at is code for shaders, which has similar properties to what I outlined above, but in a more accessible way and is a lot more self-contained and smaller in scope (also has the same problems you potentially need to look at a video stream to evaluate if it works).

Some examples of bugs that were particularly troublesome on a recent project.

 

  1. in the MIPS backend for the LLVM conpiler there is one point where it ought to be checking whether the target cpu is 32 bit or 64 bit. Instead, it checks if the MIPS version number is mips64. Problem; there were 64 bit MIPS versions before mips64, e.g, mips4, so the check is wrong. Obvious when you see the line of code, but days of tracing though thousands and thousands of lines of code till you  it.
  2. with a particularvversion of freebsd on MIPS, it works fine on single core but the console dies on multi core. The serial line interrupt is routed to one of the cores. On receiving an interupt, the oscdisables the interrupt and puts handling the interrupt on a scheduling queue. when the task is taken off the queue, the interrupt is handoed and then re-enabled. Problem: last step might be scheduled to run on a different core. If that happens, interrupt remains disabled on the core that receives the interrupt, and enabled on a core that never recieves it. Console output dies.