mic — LessWrong

Musings from a Lawyer turned AI Safety researcher (ShortForm)

mic7mo22

I guess this should wait for the final draft of the GPAI Code of Practice to be released?

What Makes an AI Startup "Net Positive" for Safety?

mic8mo10

Got it, I was going to mention that Haize Labs and Gray Swan AI seem to be doing great work in improving jailbreak robustness.

Benchmarking LLM Agents on Kaggle Competitions

mic1y10

Nice precursor to MLE-bench!

My experience applying to MATS 6.0

mic1y50

Good point, I've added your comment to the post to clarify that my experience isn't reflective of all mentors' processes.

New voluntary commitments (AI Seoul Summit)

mic2y10

One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.

Are you able to say more about this?

Is deleting capabilities still a relevant research question?

Answer by micMay 21, 2024130

I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"

mic2y30

Hindsight is 20/20. I think you're underemphasizing how our current state of affairs is fairly contingent on social factors, like the actions of people concerned about AI safety.

For example, I think this world is actually quite plausible, not incongruent:

A world where AI capabilities progressed far enough to get us to something like chat-gpt, but somehow this didn’t cause a stir or wake-up moment for anyone who wasn’t already concerned about AI risk.

I can easily imagine a counterfactual world in which:

ChatGPT shows that AI is helpful, safe, and easy to align
Policymakers are excited about accelerating the benefits of AI and unconvinced of risks
Industry leaders and respectable academics are not willing to make public statements claiming that AI is an extinction risk, especially given the lack of evidence or analysis
Instead of the UK AI Safety Summit, we get a summit which is about driving innovation
AI labs play up how AIs can help with safety and prosperity and dismiss anything related to AI risk

Shane Legg's necessary properties for every AGI Safety plan

mic2y20

I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:

Implementing moral decision-making
Training models to robustly represent and abide by ethical frameworks.
Description
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.
Example benchmarks
Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.
Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).
More reading
What Would Jiminy Cricket Do? Toward Agents That Behave Morally
Aligning AI With Shared Human Values
X-Risk Analysis for AI Research

This is Water by David Foster Wallace

mic2y35

If you worship money and things, if they are where you tap real meaning in life, then you will never have enough, never feel you have enough. It’s the truth.

Worship your impact and you will always you feel you are not doing enough.

This is Water by David Foster Wallace

mic2y51

You cannot choose what to think, cannot choose what to feel

we are as powerless over our thoughts and emotions as we are over our circumstances. My mind, the "master" DFW talks about, is part of the water. If I am angry that an SUV cut me off, I must experience anger. If I'm disgusted by the fat woman in front of me in the supermarket, I must experience disgust. When I am joyful, I must experience joy, and when I suffer, I must experience suffering.

I think I disagree with the first HN comment here. I personally find that my thoughts and actions have a significant influence over whether I am experiencing a positive or negative feeling. If I find that most times I go to the grocery store, I have profoundly negative thoughts about the people around me who are just doing normal things, probably I should figure out how to think more positively about the situation. Thinking positively isn't always possible, and in cases where you can't escape a negative feeling like sadness, sometimes it is best to accept the feeling and appreciate it for what it is. But I think it really is possible to transform your emotions through your thinking, rather than being helpless to a barrage of negative feelings.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Implementing moral decision-making

Description

Example benchmarks

More reading