We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.

Interestingly, Alpaca is trained using supervised finetuning, not RLHF. (text-davinci-003 is trained using RLHF.) This seems to confirm my suspicion that while RLHF improves performance it is not essential. 

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 12:20 AM

We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.

And so it begins, LLM-generated datasets useful for training LLMs, that wouldn't be found in the wild and would've been (prohibitively) expensive to purposefully generate with human labor. Hopefully the currently human-generated datasets used in SSL pre-training, the backbone of samulacrum alignment, won't be mostly replaced by synthetic datasets that drift away from humanity.

Damn, that's something I had been worrying about recently.

Eliezer said:

I don't think people realize what a big deal it is that Stanford retrained a LLaMA model, into an instruction-following form, by cheaply fine-tuning it on inputs and outputs from text-davinci-003.

It means: If you allow any sufficiently wide-ranging access to your AI model, even by paid API, you're giving away your business crown jewels to competitors that can then nearly-clone your model without all the hard work you did to build up your own fine-tuning dataset. If you successfully enforce a restriction against commercializing an imitation trained on your I/O - a legal prospect that's never been tested, at this point - that means the competing checkpoints go up on bittorrent.

I'm not sure I can convey how much this is a brand new idiom of AI as a technology. Let's put it this way:

If you put a lot of work into tweaking the mask of the shoggoth, but then expose your masked shoggoth's API - or possibly just let anyone build up a big-enough database of Qs and As from your shoggoth - then anybody who's brute-forced a core unmasked shoggoth can gesture to your shoggoth and say to their shoggoth "look like that one", and poof you no longer have a competitive moat.

It's like the thing where if you let an unscrupulous potential competitor get a glimpse of your factory floor, they'll suddenly start producing a similar good - except that they just need a glimpse of the inputs and outputs of your factory. Because the kind of good you're producing is a kind of pseudointelligent gloop that gets sculpted; and it costs money and a simple process to produce the gloop, and separately more money and a complicated process to sculpt the gloop; but the raw gloop has enough pseudointelligence that it can stare at other gloop and imitate it.

In other words: The AI companies that make profits will be ones that either have a competitive moat not based on the capabilities of their model, OR those which don't expose the underlying inputs and outputs of their model to customers, OR can successfully sue any competitor that engages in shoggoth mask cloning.

https://twitter.com/ESYudkowsky/status/1635577836525469697

Interesting video on the topic: The Model That Changes Everything: Alpaca Breakthrough (ft. Apple's LLM, BritGPT, Ernie and AlexaTM)