StellaAthena

Posts

Sorted by New

Wiki Contributions

Comments

Compute Trends Across Three eras of Machine Learning

The distinction between "large scale era" and the rest of DL looks rather suspicious to me. You don't give a meaningful defense of which points you label "large scale era" in your plot and largely it looks like you took a handful of the most expensive models each year to give a different label to.

On what basis can you conclude that Turing NLG, GPT-J, GShard, and Switch Transformers aren't part of the "large scale era"? The fact that they weren't literally the largest models trained that year?

There's also a lot of research that didn't make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.

Visible Thoughts Project and Bounty Announcement

1:  I expect that it's easier for authors to write longer thoughtful things that make sense;

I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don't want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it's a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it's easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There's also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.

Based on this comment:

I state: we'd be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won't be possible to complete, and to pay for them proportionally.

I'm now under the impression that you'd be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.

The other major worry I have about this pitch is the experimental design. I'm still happy you're doing this, but this doesn't seem to be the best project crafting in my mind. Briefly my concerns are:

  1. This is a very topically specific ask of unclear generalization. I would prefer a more generic ask that is not directly connected to D&D.
  2. In my experience training large language models, the number of examples is more important than the length of examples. Training on 100 shorter sequences is better than training on 10 longer sequences if the total length is the same. In particular, I think "You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples." is not clearly true and very plausibly false.
  3. Using this dataset in a meaningful fashion requires making a priori unrelated breakthroughs, making it overly inaccessible. I think that your comment "I don't want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird," is thinking about this the wrong way. The goal should be to maximize the time that we can effectively use this dataset, not be content with the fact that one day it will be useful.
  4. This is a pilot for the real thing you're after, but the "pilot" is a multi-year million-dollar effort. That doesn't seem like a very well designed pilot to me.
Visible Thoughts Project and Bounty Announcement

Hi! Co-author of the linked “exploration” here. I have some reservations about the exact request (left as a separate comment) but I’m very excited about this idea in general. I’ve been advocating for direct spending on AI research as a place with a huge ROI for alignment research for a while and it’s very exciting to see this happening.

I don’t have the time (or aptitude) to produce a really high quality dataset, but I (and EleutherAI in general) would be happy to help with training the models if that’s desired. We’d be happy to consult on model design or training set-up, or to simply train the models for you all. No compensation necessary, just excited to contribute to worthwhile alignment research.

Visible Thoughts Project and Bounty Announcement

What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.

For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 page (100 step) stories would be a) much easier to produce and b) much more likely to actually help train an AI. It also allows you to pear down the per-submission payouts, assuaging some concerns in the comments about the winner-take-all and adversarial nature of the competition. If you offer $20 per 10-step story for 1,000 stories it greatly reduces the chances that someone will end up spending a ton of effort but be unable to get it in on time for the reward.

To put the length of this in prospective, a feature length movie script is typically around 100-130 pages. The ask here is to write 1-2 novels, or 5-6 movie scripts. That’s a massive amount of writing, and not something anyone can complete quickly.

Visible Thoughts Project and Bounty Announcement

Also, I'm unclear on what constitutes a "run"... roughly how long does the text have to be, in words, to have a chance at getting $20,000?

Using the stated length estimates per section, a single run would constitute approximately 600 pages of single spaced text. This is a lot of writing.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Interesting… I was busy and wasn’t able to watch the workshop. That’s good to know, thanks!

Yudkowsky and Christiano discuss "Takeoff Speeds"

For Sanh et al. (2021), we were able to negotiate access to preliminary numbers from the BIG Bench project and run the T0 models on it. However the authors of Sanh et al. and the authors of BIG Bench are different groups of people.

Yudkowsky and Christiano discuss "Takeoff Speeds"

What makes you say BIG Bench is a joint Google / OpenAI project? I'm a contributor to it and have seen no evidence of that.

What exactly is GPT-3's base objective?

I think that 4 is confused when people talk about "the GPT-3 training data." If someone said "there are strings of words found in the GPT-3 training data that GPT-3 never saw" I would tell them that they don't know what the words in that sentence mean. When an AI researcher speaks of "the GPT-3 training data" they are talking about the data that GPT-3 actually saw. There's data that OpenAI collected which GPT-3 didn't see, but that's not what the words "the GPT-3 training data" refers to.

What exactly is GPT-3's base objective?

Or is it "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D? [where D is the dataset used to train GPT-3]

This is the correct answer.

The problem with these last two answers is that they make it undefined how well GPT-3 performs on the base objective on any prompt that wasn't in D, which then rules out psuedo-alignment by definition.

This is correct, but non-problematic in my mind. If data wasn’t in the training dataset, then yes there is no fact of the matter as to what training signal GPT-3 received when training on it. We can talk about what training signal GPT-3 counterfactually would have received had it been trained on this data, but there is no answer to the question in the actual world.

Load More