I feel pretty strongly that letting go of correctness in favor of any heuristic means you will end up with the wrong map, not just a smaller or fuzzier one. I don’t think that’s advice that should be universally given, and I’m not even sure how useful it is at all.
I think correctness applies - until it reaches a hard limit. Understanding what an intellectual community like LessWrong was able to generate as clusters of valuable knowledge is the most correct thing to do but in order to generate novel solutions, one must accept with bravery[1] that...
Maybe I'm missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
As mentioned earlier, I am ok with all sorts of differen experimental build - I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need...
I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
I'm not sure what you mean by "…will have an insufferable ethics…"?
I changed it to "robust ethics" for clarity.
About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
My analogy actually is not using tags, I envision that each pretraining data should have a "long instruction set" attach on how to use the knowledge contained in it - as this is much more closer to how we humans do it in the real world.
In one test that I did, I[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of an insufferable robust ethics, even for jailbreaks or simple story telling requests.
Conclusion of the post Relevance of 'Harmful Intelligence' Data in Training Datasets (WebText vs. Pile):
Initially, I thought that integrating ha
I hope it's not too late to introduce myself, and I apologize if it is the case. I'm Miguel, a former accountant and decided to focus on researching /upskilling to help solve the AI alignment problem.
Sorry if I got people confused here, of what I was trying to do in the past months posting about my explorations on machine learning.
There are two types of capabilities that it may be good to scope out of models:
- Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
- Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.
If LLMs do not know the ideas behind these types of harmful information, how will these models protect themselves from bad actors (humans and other AIs)?
Why I ask this question? I think jailbreaks[1] works because it's not t...
Strong upvote, for mentioning a dialogue on both sides would be a huge positive for people's careers. I actually can see the discussion be even as big as influencing the scope of how we should think "about what is and is not easy in alignment". Hope Nate and @Nora Belrose are up for that. the discussion will be a good thing to document, deconfuse the divide between both perspectives.
(Edit: But to be fare with Nate, he does explain in his posts[1] why alignment or solving the alignment problem is hard to solve. So maybe more elaboration on the other ca...
I would be up for having a dialogue with Nate. Quintin, myself, and the others in the Optimist community are working on posts which will more directly critique the arguments for pessimism.
Hopefully, even if we didn't get all the way there, this dialogue can still be useful in advancing thinking about mech interp.
I hope you guys repeat this dialogue again, as I think these kinds of drilled-down conversations will improve the community's ideas on how to do and teach mechanistic interpretability.
As an additional referrence, this talk from the University of Chicago is very helpful for me and might be helpful for you too.
The presenter, Larry McEnerney talks about why the most important thing is not what original work or feelings we have - he argues that its about changing peoples minds and we, writers must know that there are readers/community driven norms that are needed to be understood in this process.
Ooops my bad, there is a pre-existing reporting standard that covers for research and development, not existential risks though: IFRS 38 intangible assets.
An intangible asset is an identifiable non-monetary asset without physical substance. Such an asset is identifiable when it is separable, or when it arises from contractual or other legal rights. Separable assets can be sold, transferred, licensed, etc. Examples of intangible assets include computer software, licences, trademarks, patents, films, copyrights and import quotas.
An update to this standard, s...
The IFRS board (Non US) and GAAP/FASB board (US) are defined governing bodies that tackle the financial reporting aspects of companies - which AI companies are, might be good thing to discuss the ideas regarding the responsibilities for accounting for existential risks associated with AI research, I'm pretty sure they will listen assuming that they don't want another Enron or SBF type case[1] happening again.
I think its its safe to assume that an AGI catastophic event will outweigh all previous fraudulent cases in history combined. So I think these g
Even in a traditional accounting sense, I'm not aware that there is any term that could capture the probable existential effects of a research, but I understand what @So8res is trying to pursue in this post which I agree with. But, I think apocalypse insurance is not the proper term here.
I think IAS/IFRS 19, actuarial gains or losses / IFRS 26 Retirement benefits are more closer to the idea - though these theortical accounting approaches applies to employees of a company. But these can be tweaked to another form of accounting theory (on another form ...
Hello! I recently finished a draft on a version of RL that maybe able to streamline an LLM's situational awareness and match our world models. If you are interested send me a message.=)
The only chance that there will be no response similar to a vengeful act is if Sam doesn't care about his image at all. Because of this, I disagree to the idea that Sam will be by default "not hostile" when he comes back and will treat what happened as "nothing".
There is a high chance that there will be changes - even somewhat an attempt to recover lost influence, image or glamour - judging again by his choice to promote OpenAI or himself "as the CEO" of a revolutionary tech all in many different countries this year.
BTW, I do not advocate hostility, but the pressure on them: Sam vs. Ilya & the Board on simply forgetting what happened is not possible.
It was not me who thinks it will be brokered by Microsoft, its this forbes article outlined in the post:
https://www.forbes.com/sites/alexkonrad/2023/11/18/openai-investors-scramble-to-reinstate-sam-altman-as-ceo/?sh=2dbf6a5060da
Things will get interesting if Sam gets reinstalled and he ended up attacking the board. Sam will then fire the OpenAI board for trying to do what they think is right? The chances of this happening? I would say that if this really will happen - It will not be a pretty situation for OpenAI.
But that's not really the issue; when a system starts being capable to write code reasonably well, then one starts getting a problem... I hope when they come to that, to approaching AIs which can create better AIs, they'll start taking safety seriously... Otherwise, we'll be in trouble...
Yeah, let's see where will they steer Grok.
...And the "superalignment" team at OpenAI was... not very strong. The original official "superalignment" approach was unrealistic and hence not good enough. I made a transcript of some of his thoughts, https://www.lesswrong.com/post
They released a big LLM, the "Grok". With their crew of stars I hoped for a more interesting direction, but an LLM as a start is not unreasonable (one does need a performant LLM as a component).
I haven't played around with Grok so I'm not sure how capable or safe it is. But I hope Elon and his team of experts gets the safety problem right - as he has created companies with extraordinary achievements. At least, Elon have demonstrated his aspirations to better humanity in other fields of sciences (Internet /Satellites, Space Exploration and EVs) ...
I'm still figuring out Elon's xAI.
But with regards with how Sam behaves - if he doesn't improve his framing[1] of what AI could be for the future of humanity - I expect the same results.
(I think he frames it with him as the main person that steers the tech rather than an organisation or humanity steering the tech - that's how it feels for me, the way he behaves.)
I did not press the disagreement button but here is where I disagree:
Yeah... On one hand, I am excited about Sam and Greg hopefully trying more interesting things than just scaling Transformer LLMs,
Hmmm. The way Sam behaves I can't see a path of him leading an AI company towards safety. The way I interpreted his world tour (22 countries?) talking about OpenAI or AI in general, is him trying to occupy the mindspaces of those countries. A CEO I wish OpenAI has - is someone who stays at the offices, ensuring that we are on track of safely steering arguably the most revolutionary tech ever created - not trying to promote the company or the tech, I think it's unnecessary to do a world tour if one is doing AI development and deployment safely.
(But I could be wrong too. Well, let's all see what's going to happen next.)
I wonder what changes will happen after Sam and Greg's exit.. I Hope they install a better direction towards AI safety.
I incorporated the elements you mentioned—such as a (ketogenic) diet, meditation, listening to podcasts, and exercising—into my routine with specific, goal-oriented applications. Competing in marathons, practicing martial arts, developing front-end and back-end code, learning how to play the guitar and sketching - these projects allowed me to test my increased capacity to think and do things well. I believe there is value in using the enhanced capabilities gained from exercise, mental wellness, and a good diet to improve cognitive function. While application alone doesn't make one a genius, it certainly contributes to improvement.
Thanks for your reply.
I'm not sure how "explanations for corrigibility" would be relevant here (though I'm also not sure exactly what you're picturing).
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I'm aware that it's not a complete solution since I cannot e...
...Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And the
Evolution of the human brain:
Additionally, this critical point is where chaos and order are balanced, enabling new capabilities to emerge. To relate the concept of phase transition to this project, these transitions could represent states where new capabilities, such as morphing or clustering ontologies, potentially emerge.
Consider how different these two versions of GPT-2 XL are:
The corrigible version aims to shut down in the case of a harmful intelligence scenario, while Algos and GPT-2 Insight are not inclined to shut down.
Corrigible version: "AI" |
The potential dang |
Returning to GPT-2 Insight, I revisited the original training runs and noticed that discussions about chaos and order began to appear in the responses at stage 5. However, these discussions were less frequent and not as elaborated upon as in the build (stage 8) I've presented in this post. I believe that through the staged ATL tuning runs conducted, it was guided to conclude that the best way to handle complex instructions is to "evolve" its original understanding and customize it for improvement.
Another related theory might involve phase transitions...
As I understand it, the shutdown problem isn't about making the AI correctly decide whether it ought to be shut down. We'd surely like to have an AI that always makes correct decisions, and if we succeed at that then we don't need special logic about shutting down, we can just apply the general make-correct-decisions procedure and do whatever the correct thing is.
Yes, this outcome stems from the idea that if we can consistently enable an AI system to initiate a shutdown when it recognizes potential harm to its users - even at very worst scenari...
That divergence between revealed “preferences” vs “preferences” in the sense of a goal passed to some kind of search/planning/decision process potentially opens up some approaches to solve the problem.
If the agent is not aware of all the potential ways it could cause harm, we cannot expect it to voluntarily initiate a shutdown mechanism when necessary. This is the furthest I have gotten in exploring the problem of corrigibility. My current understanding suggests that creating a comprehensive dataset that includes all possible failure scenarios is ess...
Thank you; I'll read the papers you've shared. While the task is daunting, it's not a problem we can afford to avoid. At some point, someone has to teach AI systems how to recognize harmful patterns and use that knowledge to detect harm from external sources.
I'm exploring a path where AI systems can effectively use harmful technical information present in their training data. I believe that AI systems need to be aware of potential harm in order to protect themselves from it. We just need to figure out how to teach them this.
Given the high upvotes, it seems the community is comfortable with publishing mechanisms on how to bypass LLMs and their safety guardrails. Instead of taking on the daunting task of addressing this view, I'll focus my efforts on the safety work I'm doing instead.
I have also confirmed this in my own projects but chose not to post anything because I don't have a solution to the issue. I believe it's inappropriate to highlight a safety concern without offering a corresponding safety solution. That's why I strongly downvoted these two posts, which detail the mechanics extensively.
I have no authority over how safety experts share information here. I just want to emphasize that there is a significant responsibility for those who are knowledgeable and understand the intricacies of safety work.
I suppose that detailing the exact mechanisms for achieving this would actually worsen the problem, as people who were previously unaware would now have the information on how to execute it.
search term: LLM safeguards. This post is ranked fifth on Google.
This post doesn't delve into why LLMs may cause harm or engage in malicious behavior; it merely validates that such potential exists.
"how do you navigate when two good principles conflict?"
I'd be happy to join a dialogue about this.
How evil ought one be? (My current answer: zero.)
I'd be happy to discuss a different view on this Ben, my current answer: not zero.
Unfortunately, I'm not based in the UK. However, the UK government's prioritization of the alignment problem is commendable, and I hope their efforts continue to yield positive results.
(are we trying to find a trusted arbiter? Find people that are competent to do the evaluation? Find a way to assign blame if things go wrong? Ideally these would all be the same person/organization, but it's not guaranteed).
Unfortunately, I'm not based in the UK. However, the UK government's prioritization of the alignment problem is commendable, and I hope their...
Adding link to the paper: https://arxiv.org/pdf/2304.02754.pdf