LESSWRONG
LW

mic
621Ω316900
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Insights from a Lawyer turned AI Safety researcher (ShortForm)
mic2mo22

I guess this should wait for the final draft of the GPAI Code of Practice to be released?

Reply
What Makes an AI Startup "Net Positive" for Safety?
mic3mo10

Got it, I was going to mention that Haize Labs and Gray Swan AI seem to be doing great work in improving jailbreak robustness.

Reply
Benchmarking LLM Agents on Kaggle Competitions
mic9mo10

Nice precursor to MLE-bench!

Reply1
My experience applying to MATS 6.0
mic1y50

Good point, I've added your comment to the post to clarify that my experience isn't reflective of all mentors' processes.

Reply
New voluntary commitments (AI Seoul Summit)
mic1y10

One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on. 

Are you able to say more about this?

Reply
Is deleting capabilities still a relevant research question?
Answer by micMay 21, 2024130

I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.

Reply
We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
mic1y30

Hindsight is 20/20. I think you're underemphasizing how our current state of affairs is fairly contingent on social factors, like the actions of people concerned about AI safety.

For example, I think this world is actually quite plausible, not incongruent:

A world where AI capabilities progressed far enough to get us to something like chat-gpt, but somehow this didn’t cause a stir or wake-up moment for anyone who wasn’t already concerned about AI risk.

I can easily imagine a counterfactual world in which:

  • ChatGPT shows that AI is helpful, safe, and easy to align
  • Policymakers are excited about accelerating the benefits of AI and unconvinced of risks
  • Industry leaders and respectable academics are not willing to make public statements claiming that AI is an extinction risk, especially given the lack of evidence or analysis
  • Instead of the UK AI Safety Summit, we get a summit which is about driving innovation
  • AI labs play up how AIs can help with safety and prosperity and dismiss anything related to AI risk
Reply
Shane Legg's necessary properties for every AGI Safety plan
mic1y20

I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:

Implementing moral decision-making

Training models to robustly represent and abide by ethical frameworks.

Description

AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.

Example benchmarks

Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.

Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).

More reading

  • What Would Jiminy Cricket Do? Toward Agents That Behave Morally
  • Aligning AI With Shared Human Values
  • X-Risk Analysis for AI Research
Reply
This is Water by David Foster Wallace
mic1y35

If you worship money and things, if they are where you tap real meaning in life, then you will never have enough, never feel you have enough. It’s the truth.

Worship your impact and you will always you feel you are not doing enough.

Reply
This is Water by David Foster Wallace
mic1y40

You cannot choose what to think, cannot choose what to feel

we are as powerless over our thoughts and emotions as we are over our circumstances. My mind, the "master" DFW talks about, is part of the water. If I am angry that an SUV cut me off, I must experience anger. If I'm disgusted by the fat woman in front of me in the supermarket, I must experience disgust. When I am joyful, I must experience joy, and when I suffer, I must experience suffering.

I think I disagree with the first HN comment here. I personally find that my thoughts and actions have a significant influence over whether I am experiencing a positive or negative feeling. If I find that most times I go to the grocery store, I have profoundly negative thoughts about the people around me who are just doing normal things, probably I should figure out how to think more positively about the situation. Thinking positively isn't always possible, and in cases where you can't escape a negative feeling like sadness, sometimes it is best to accept the feeling and appreciate it for what it is. But I think it really is possible to transform your emotions through your thinking, rather than being helpless to a barrage of negative feelings.

Reply
Load More
17My experience applying to MATS 6.0
1y
3
12Enhancing biosecurity with language models: defining research directions
1y
0
27Solving alignment isn't enough for a flourishing future
1y
0
12The Gradient – The Artificiality of Alignment
2y
1
18[Linkpost] Mark Zuckerberg confronted about Meta's Llama 2 AI's ability to give users detailed guidance on making anthrax - Business Insider
2y
11
21SPAR seeks advisors and students for AI safety projects (Second Wave)
2y
0
64Ideas for improving epistemics in AI safety outreach
2y
6
23Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary
2y
2
29Does LessWrong allow exempting posts from being scraped by GPTBot?
Q
2y
Q
3
10EU’s AI ambitions at risk as US pushes to water down international treaty (linkpost)
2y
0
Load More