LESSWRONG
LW

Paul W
17290
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Paul W's Shortform
3mo
1
No wikitag contributions to display.
Paul W's Shortform
Paul W3mo10

The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance "no global catastrophe for humanity in time period X".
Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).

Any thoughts on this observation?
 

Reply
What convincing warning shot could help prevent extinction from AI?
Paul W3mo10

When you say "maybe we should be assembling like minded and smart people [...]", do you mean "maybe"? Or do you mean "Yes, we should definitely do that ASAP"?

Reply
Conceptual Rounding Errors
Paul W4mo20

Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.

Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.

 

Some questions regarding these contexts:

-Is it true that you can deduce that "not much progress" is being made? In (pure) maths, it is sometimes very useful to be able to connect to points of view/notions (e.g. (co)homological theories, to name the most obvious example that comes to mind).

-What is the goal of such interactions? Is it truly to point out relevant related work? To dismiss other people's ideas for {political/tribal/ego-related} motives? Other?

As for possible fixes:

-Maintain a collective {compendium/graph/whatever data structure is relevant} of important concepts, with precise enough definitions, and comparison information (examples and/or theoretical arguments) between similar, but not identical, ideas.

Or rather: acknowledging that the AI Safety community(ies) is/are terrible at coordination, devise a way of combining/merging such {compendiums/graphs/whatever}, for it is unlikely that only one emerges...

Reply
Good Research Takes are Not Sufficient for Good Strategic Takes
Paul W4mo10

Strong upvote. Slightly worried by the fact that this wasn't written, in some form, earlier (maybe I missed a similar older post?)

I think we[1] can, and should, go even further:

 

-Find a systematic/methodical way of identifying which people are really good at strategic thinking, and help them use their skills in relevant work; maybe try to hire from outside the usual recruitment pools. 

If deemed feasible (in a short enough amount of time): train some people mainly on strategy, so as to get a supply of better strategists.

-Encourage people to state their incompetence in some domains (except maybe in cases where it makes for bad PR) / embrace the idea of specialization and division of labour more: maybe high-level strategists don't need as much expertise on the technical details, only the ability to see which phenomena matter (assuming domain experts are able to communicate well enough)

 

  1. ^

    say, the people who care about preventing catastrophic events, in a broad sense

Reply1
Elicitation for Modeling Transformative AI Risks
Paul W4mo21

Hi! 

Have you heard of the ModelCollab  and CatColab projects ? It seems that there is an interesting overlap with what you want to do!

More generally, people at the Topos Institute work on related questions, of collaborative modelling and collective intelligence: 

 

https://topos.institute/work/collective-intelligence/

https://topos.institute/work/collaborative-modelling/

https://www.localcharts.org/t/positive-impact-of-algebraicjulia/6643

There's a website for sharing world-modelling ideas, run by Owen Lynch (who works at Topos UK) 

https://www.localcharts.org/t/localcharts-is-live/5714


For instance, they have a paper on task-delegation: 

 

 

Their work uses somewhat advanced maths, but I think it is justified by the ambition: to develop general tools for creating and combining models. They seem to make an effort to popularise these, so that non-mathematicians can get something out of their work.

Reply
Emergence, The Blind Spot of GenAI Interpretability?
Paul W11mo10

Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ? 
Is that the main point, or did I miss something ?

Reply
Natural Latents: The Concepts
Paul W1y30

Hello !
These ideas seem interesting, but there's something that disturbs me: in the coin flip example, how is 3 fundamentally different from 1000 ? The way I see it, the only mathematical difference is that your "bounds" (whatever that means) are simply much worse in the case with 3 coins. Of course, I think I understand why humans/agents would want to say "the case with 3 flips is different from that with 1000", but the mathematics seem similar to me.
Am I missing something ?

Reply
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Paul W1y*12

Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?

As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?
 

"Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?" 

I think that's closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be "exploiting a zero-day", in the sense that the world-model can't rule out this possibility with high confidence[1]. That way, the training process is expected to reward simpler, more easily verified actions.


Then, a key question is "what else you do want from your AI ?": of course, it is supposed to perform critical tasks, not just "let you see what program is running"[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.

  1. ^

    Some of the uncertainty lies in how accurate and how conservative the world-model is; you won't get a "100% guarantee" anyway, especially since you're only aiming for probabilistic bounds within the model.

  2. ^

    Otherwise, a sponge would do.

Reply
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Paul W1y32

I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf 
To be more precise, it appears that so-called "incomplete formal methods" (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
Does that answer your question ?

Reply
2Paul W's Shortform
3mo
1
18Why I find Davidad's plan interesting
1y
0