New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 9:46 AM

To be clear, I haven't seen many designs that people I respect believed to have a chance of actually working. If you work on the alignment problem or at an AI lab and haven't read Nate Soares' On how various plans miss the hard bits of the alignment challenge, I'd suggest reading it.

Can you explain your definition of the sharp left turn and why it will cause many plans to fail?

This looks pretty close to Eliezer's views.

It's based on the expectation that people will disregard the danger of the superintelligent AI and will continue  to scale it until AIs are powerful and incomprehensible enough to killeveryone.

And also that "merely" roughly human level AIs can't contribute significantly to AI Aligment research or help with some kind of pivotal act.

I think that both points are not exactly correct. So, there is a chance.

I don’t expect everyone to disregard the danger; I do expect most people building capable AI systems to continue to hide hard problems. Hiding the hard problems is much easier than solving them, but I guess produces plausible-sounding solutions just as well.

Roughly human level humans don’t contribute significantly to AI alignment research and can’t be pivotally used. So I don’t think you think that a roughly human level AI system can contribute significantly to AI alignment research. Maybe you (as many seem to) think that if someone runs not-that-superhuman language models with clever prompt engendering, fine-tuning, and systems around, than the whole system can solve alignment or be pivotally used, and the point of the post is that the whole system is superhuman, not roughly human-level, if it’s capable enough to solve alignment or be pibotally used, and you need to direct the whole system somewhere, and unless you made the whole system optimize for something you actually want, it probably kills you before it solves alignment.

Has anyone worked out timeline predictions for Non-US/Non-Western Actors and tracked their accuracy?

For example, is China at "GPT-3.5" level yet and 6 months away from GPT-4 or is China a year from GPT-3.0? How about the people contributing to OpenSource AI? Last I checked that field looked "generally speaking" kind of at GPT-2.5 level (and even better for deepfaking porn), but I didn't look close enough to be confident of my assessment.

Anyway, I'd like something more than off-the-cuff thoughts, but rather a good paper and some predictions on Non-US/Non-Western AI timeframes. Because, if anything, even if you somehow avert market forces levering AI up faster and faster among the big 8 in QQQ, those other actors are still going to form a hard deadline on alignment.

Well, I do not have anything like this but it is very clear that China is way above GPT-3 level. Even the open-source community is significantly above. Take a look at LLaMA/Alpaca, people run them on consumer PC and it's around GPT-3.5 level, the largest 65B model is even better (it cannot be run on consumer PC but can be run on a small ~10k$ server or cheaply in the cloud). It can also be fine-tuned in 5 hours on RTX 4090 using LORA: https://github.com/tloen/alpaca-lora .

Chinese AI researchers contribute significantly to AI progress, although of course, they are behind the USA. 

My best guess would be China is at most 1 year away from GPT-4. Maybe less.

Btw, an example of a recent model: ChatGLM-6b

Thanks for that. In my own exploration, I was able to hit a point where ChatGPT refused a request, but would gladly help me build LLaMA/Alpaca onto a Kubernetes cluster in the next request, even referencing my stated aim later:

"Note that fine-tuning a language model for specific tasks such as [redacted] would require a large and diverse dataset, as well as a significant amount of computing resources. Additionally, it is important to consider the ethical implications of creating such a model, as it could potentially be used to create harmful content."

FWIW, I got down into nitty gritty of doing it, debugging the install, etc. I didn't run it, but it would definitely help me bootstrap actual execution. As a side note, my primary use case has been helping me building my own task-specific Lisp and Forth libraries, and my experience tells me GPT-4 is "pretty good" at most coding problems, and if it screws up, it can usually help work through the debug process. So, first blush, there's at least one universal jailbreak -- GPT-4 walking you through building your own model. Given GPT-4's long text buffers and such, I might even be able to feed it a paper to reference a specific method of fine-tuning or creating an effective model.

If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.

I'm not exactly sure what you meant here but I don't think this claim is true in the case of RLHF because, in RLHF, labelers only need to choose which option is better or worse between two possibilities, and these choices are then used to train the reward model. A binary feedback style was chosen specifically because it's usually too difficult for labelers to choose between multiple options.

A similar idea is comparison sorting where the algorithms only need the ability to compare two numbers at a time to sort a list of numbers.