LGS

Posts

Sorted by New

Wiki Contributions

Comments

Wait which algorithm for semidefinite programming are you using? The ones I've seen look like they should translate to a runtime even slower than  For example the one here:

https://arxiv.org/pdf/2009.10217.pdf

Also, do you have a source for the runtime of PSD testing being ? I assume no lower bound is known, i.e. I doubt PSD testing is hard for matrix multiplication. Am I wrong about that?

A 1-year AGI would need to beat humans at... basically everything. Some projects take humans much longer (e.g. proving Fermat's last theorem) but they can almost always be decomposed into subtasks that don't require full global context (even tho that's often helpful for humans).

 

This seems wrong. There is a class of tasks that takes humans longer than 1 year: gaining expertise in a field. For example, learning higher mathematics from scratch, or learning to code very well, or becoming a surgeon, etc.

If AI is capable of doing any current human profession, but is incapable of learning new professions that do not yet exist (because of lack of training data, presumably), then it is not yet human-complete: humans still have relevance in the economy, as new types of professions will arise.

The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It's a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.

 

If the boilerplate has loads of entropy, then, by necessity, it is long. You were just saying that human raters will punish length.

You need to make the argument that the boilerplate will be less long than the plain English, or better yet that the boilerplate will be better-liked by human raters than the plain English. I think that's a stretch. I mean, it's a conceivable possible world, but I'd bet against it.

I don't see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, 'neuralese' which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn't suddenly start speaking and reasoning like it's suffered a stroke if you ask it to write in base-64 or pig Latin.

I guess this is true in the limit as its steganography skill goes to infinity. But in intermediate scenarios, it might have learned the encodings for 10% of English words but not 100%. This is especially relevant to obscure math notation which is encountered rarely in training data. I guess you're thinking of steganography as a systematic encoding of English, like pig Latin -- something that can be reliably decoded into English via a small program (instead of a whole separate language like French). This is certainly possible, but it's also extremely interpretable.

The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.

It's hard to see how the encodings will be easily learnable for an LLM trained internet text, but at the same time, NOT easily learnable for an LLM tasked with translating the encoding into English.

Aaronson's proposal

You are right that he is proposing something more sophisticated and robust to pertubations. But you also reasonably list in your desiderata: "an encoding which can't be detected by a third party". Well, if it cannot be detected by a third party, it cannot be detected by an LLM (third parties are LLMs or at least wield LLMs). In practice, this will involve some crypto, as you mentioned. LLMs are not going to learn to break cryptography by gradient descent (or if they will, Aaronson's scheme is the least of our worries). And to be clear, Aaronson specifically said he is only touching the PRNG in the sampling of outputs.

If one doesn't handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety

Aaronson's proposal is basically guaranteed to be this, even if it works perfectly. The only question is how lazy the lazy highschool students would have to be. If you tell the AI "write me an essay but, between every word, insert a random emoji", and then you delete the emojis manually, you get an essay that's almost certainly free of watermarks. Even if Aaronson's scheme can be modified to handle this specific attack, it surely won't be able to handle all attacks of this general type.

This is a very interesting thought. Thanks for writing it.

However, while steganography is worth keeping in mind, I find myself skeptical of certain parts of this story:

  1. In practice, RLHF makes the model's answers longer, not shorter. RLHF as done today seems to try to teach models to think step-by-step to improve accuracy. Almost everything out of GPT-4's mouth is overly long with many intermediate steps; I asked it "what's the maximum number of intersection points between a circle and a triangle", and it said 6 (correct) and followed it with a multi-part step-by-step style explanation, even though it already knew the right answer within the first few tokens. It's incredibly predisposed to thinking step-by-step and prefers long answers to short ones.
  2. Will human evaluators in the future prefer short answers instead? That remains to be seen. The priority of firms like OpenAI is to have the models be accurate, so if short answers result in lower accuracy (as is surely the case), it's unlikely that OpenAI will train the LLM towards short answers. If OpenAI really really wanted short answers to math questions, they could do it via a 2-step process: first, an overly-verbose, step-by-step LLM like GPT-4; second, feed that output into a summarizing LLM which extract a 1-sentence summarization, and show the user that.
  3. The approach in (2) will almost certainly lead to much higher accuracy on complex problems than training LLMs to give short outputs via RLHF (admittedly, this is at the cost of being somewhat slower).  Therefore, I find it unlikely that people will train LLMs via RLHF to give short answers. And without pressure towards short answers, there's no incentive to develop steganography.
  4. Even if there is pressure towards short answers, I don't think steganography necessarily helps. For steganography to be feasible, the boilerplate part of the answer -- the part that goes "I am a large language model trained by OpenAI" -- will have to have enough entropy to store whatever it is you're encoding. Why wouldn't human evaluators penalize such high-entropy boilerplate? I suspect high-entropy boilerplate will look less desirable to human evaluators than thinking step-by-step explicitly, but I guess that remains to be seen.
  5. Even if LLMs learn steganography from the internet, they'll likely speak the steganographic language worse than they speak English -- therefore, absent incentive otherwise, they might prefer think in English when thinking step-by-step.
  6. If LLMs do learn steganography, it should be fairly trivial to train an LLM to translate from the steganographic language back to English -- after all, we're assuming these LLMs are extremely capable, and they'll speak both English and the steganographic language very well. They should be able to translate between them as easily as between English and French. Humans just need to be on the lookout for this, and what was previously uninterpretable steganography might quickly become interpretable.
  7. The type of watermarking that Aaronson is proposing is unlikely to make steganography any more likely than otherwise. Aaronson is just proposing to change the pseudorandom number generator that is used to sample tokens from the output distribution. Even without adopting Aaronson's scheme, OpenAI still uses pseudorandom numbers instead of true random numbers when sampling tokens; why does it worry you if the PRNG encodes "OA model v.1234 user XYZ" instead of encoding "the random seed was 12345", which is what it's currently encoding? In both cases there is an encoding here, and in both cases it's not going to be broken anytime soon.

Still, I do agree that steganography is an interesting possibility and could definitely arise in powerful LLMs that are accidentally incentivized in this direction. It's something to watch out for, and interesting to think about.

Since nobody outside of OpenAI knows how GPT-4 works, nobody has any idea whether any specific system will be "more powerful than GPT-4". This request is therefore kind of nonsensical. Unless, of course, the letter is specifically targeted at OpenAI and nobody else.

Not particularly, no. There are two reasons: (1) RLHF already tries to encourage the model to think step-by-step, which is why you often get long-winded multi-step answers to even simple arithmetic questions. (2) Thinking step by step only helps for problems that can be solved via easier intermediate steps. For example, solving "2x+5=5x+2" can be achieved via a sequence of intermediate steps; the model generally cannot solve such questions with a single forward pass, but it can do every intermediate step in a single forward pass each, so "think step by step" helps it a lot. I don't think this applies to the ice cube question.

That definitely sounds like a contrarian viewpoint in 2012, but surely not by 2016-2018.

Look at this from Nostalgebraist:

 https://nostalgebraist.tumblr.com/post/710106298866368512/oakfern-replied-to-your-post-its-going-to-be

which includes the following quote:

In 2018 analysts put the market value of Waymo LLC, then a subsidiary of Alphabet Inc., at $175 billion. Its most recent funding round gave the company an estimated valuation of $30 billion, roughly the same as Cruise. Aurora Innovation Inc., a startup co-founded by Chris Urmson, Google’s former autonomous-vehicle chief, has lost more than 85% since last year [i.e. 2021] and is now worth less than $3 billion. This September a leaked memo from Urmson summed up Aurora’s cash-flow struggles and suggested it might have to sell out to a larger company. Many of the industry’s most promising efforts have met the same fate in recent years, including Drive.ai, Voyage, Zoox, and Uber’s self-driving division. “Long term, I think we will have autonomous vehicles that you and I can buy,” says Mike Ramsey, an analyst at market researcher Gartner Inc. “But we’re going to be old.”

It certainly sounds like there was an update by the industry towards longer AI timelines!

Also, I bought a new car in 2018, and I worried at the time about the resale value (because it seemed likely self-driving cars would be on the market in 3-5 years, when I was likely to sell). That was a common worry, I'm not weird, I feel like I was even on the skeptical side if anything.

Someone on either LessWrong or SSC offered to bet me that self-driving cars would be on the market by 2018 (I don't remember what the year was at the time -- 2014?)

Every year since 2014, Elon Musk promised self-driving cars within a year or two. (Example source: https://futurism.com/video-elon-musk-promising-self-driving-cars) Elon Musk is a bit of a joke now, but 5 years ago he was highly respected in many circles, including here on LessWrong.

Thanks. I agree that in the usual case, the non-releases should cause updates in one direction and releases in the other. But in this case, everyone expected GPT-4 around February (or at least I did, and I'm a nobody who just follows some people on twitter), and it was released roughly on schedule (especially if you count Bing), so we can just do a simple update on how impressive we think it is compared to expectations.

Other times where I think people ought to have updated towards longer timelines, but didn't:

  • Self-driving cars. Around 2015-2016, it was common knowledge that truck drivers would be out of a job within 3-5 years. Most people here likely believed it, even if it sounds really stupid in retrospect (people often forget what they used to believe). I had several discussions with people expecting fully self-driving cars by 2018.
  • Alpha-Star. When Alpha-star first came out, it was claimed to be superhuman at Starcraft. After fixing an issue with how it clicks in a superhuman way, Alpha-star was no longer superhuman at Starcraft, and to this day there's no bot that is superhuman at Starcraft. Generally, people updated the first time (Starcraft solved!) and never updated back when it turned out to be wrong.
  • That time when OpenAI tried really hard to train an AI to do formal mathematical reasoning and still failed to solve IMO problems (even when translated to formal mathematics and even when the AI was given access to a brute force algebra solver). Somehow people updated towards shorter timelines even though to me this looked like negative evidence (it just seemed like a failed attempt).

Fair enough. I look forward to hearing how you judge it after you've asked your questions.

I think people on LW (though not necessarily you) have a tendency to be maximally hype/doomer regarding AI capabilities and to never update in the direction of "this was less impressive than I expected, let me adjust my AI timelines to be longer". Of course, that can't be rational, due to  the Conservation of Expected Evidence, which (roughly speaking) says you should be equally likely to update in either direction. Yet I don't think I've ever seen any rationalist ever say "huh, that was less impressive than I expected, let me update backwards". I've been on the lookout for this for a while now; if you see someone saying this (about any AI advancement or lack thereof), let me know.

I just want to note that ChatGPT-4 cannot solve the ice cube question, like I predicted, but can solve the "intersection points between a triangle and a circle" question, also like I predicted.

I assume GPT-4 did not meet your expectations and you are updating towards longer timelines, given it cannot solve a question you thought it would be able to solve?

Load More