Dave Orr

Google AI PM; Foundation board member

Wiki Contributions


As someone who runs a market making bot, tether is crucial because so many pairs are denominated in tether.

I'm a small player, so it's easy and not very expensive to borrow the tether if you don't trust it (which I don't), but if you are making making on a large scale is probably too expensive to borrow.

The other possibility that occurs to me is that if you need tether to succeed because you are invested in its success, you might hold large amounts to control the float.

Note that there are only 4 accounts that have more than a billion tether: https://www.coincarp.com/currencies/tether/richlist/

Thanks, this is highly informative and useful as a definitive explainer of what happened and why.

I do disagree with one point, which is I don't think there was any signal at all when SBF said things were fine on various forums in various ways. ~100% of the time that an exchange or fund or bank that is vulnerable to runs gets in trouble, the CEO or some spokesperson comes out to say everything is fine. It happens when things are fine and when things are not fine. There's just no information there, in my view. It's not especially polarizing even, because I think SBF would say it in situations where there were significant problems but probably recoverable.

Er, I'm not sure it's been published so I guess I shouldn't give details. It had to be an automatic solution because human curation couldn't scale to the size of the problem.

To add to the list of wtfs of recent SBF behavior: giving that interview. His lawyers must hate him.

I think there's another, related, but much worse problem.

As LLMs become more widely adopted, they will generate large amounts of text on the internet. This widely available text will become training data for future LLMs. Tons of low quality content will reinforce LLM proclivities to produce low quality content -- or even if LLMs are generating high quality content, it will reinforce whatever tendencies and oddities they have, e.g. be permanently pegged to styles and topics of interest in 2010-2030.

This was a problem for translation.  As google translate got better, people started posting translated versions of their website where the translation was from google. Then scrapers looking for parallel data to train on would find these, and it took a lot of effort to screen them out.

Accepting your estimates at face value, there are two problems: the availability of good training data may be a limiting factor; and good training data will be hard to find in a sea of computer generated content.

Answer by Dave OrrNov 14, 20221513

It's a combination of 1 and 2. Which is to say, regulations require a high level of safety, and we don't have that yet because of point 2.

Robustness is very hard! And in particular perception is very hard when you don't have a really solid world model, because when the AI sees something new it can react or fail to react to it in surprising ways.

The car companies are working on this by putting the cars through many millions of miles of simulated content so that it has seen most things, but that last mile (heh) problem is still very hard with today's technology.

You can think of this as an example of why alignment is hard. Perfect driving in a simulated environment doesn't equal perfect driving in the real world, and while perfection isn't the goal (being superhuman is the goal), our robust perceptual and modeling systems are very hard to beat, at least for now.

Creating competition doesn't count as harm -- it has to be direct substitution for the work in question. That's a pretty high bar.

Also there are things like stable diffusion which arguably aren't commercial (the code and model are free), which further undercuts the commercial purpose angle.

I'm not saying any of this is dispositive -- that's the nature of balancing tests. I think this is going to be a tough row to hoe though, and certainly not a slam dunk to say that copyright should prevent ML training on publicly available data.

(Still not a lawyer, still not legal advice!)

I'm not a lawyer and this is not legal advice, but I think the current US legal framework isn't going to work to challenge training on publicly available data.

One argument that something is fair use is that it is transformative [1]. And taking an image or text and using it to slightly influence a giant matrix of numbers, in such a way that the original is not recoverable, and which allows new kinds of expression, seems likely to count as transformative.

So if you think that restricting access to public data for training purposes is a promising approach [2], you should probably focus on trying to create a new regulatory framework.

Having said that, this is all US analysis. Other countries have other frameworks and may not have exact analogs of fair use. Perhaps in the EU legal challenges are more viable.

[1] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html

[2] You should think about what the side effects would be like. For instance, this will advantage giant companies that can pay to license or create data, and places that have low respect for law. Whether that's desirable is worth thinking through.

Basically their idea is that instead of having one agent that optimizes the hell out of its value function and bad things happen, have a collection of smaller components that each are working on a subproblem with limited resources. If you can do that and also aggregate them such that as a unit they are superhuman, you get a lot of the benefits without (at least some of) the big risks.

Here's a brief explainer with some objections.

Load More