All of mesaoptimizer's Comments + Replies

I doubt Nate Soares would advocate “overriding” per se

Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.

(I now think that "System 1 caring" and "System 2 caring" would have been much better.)

a general idea of “optimizing hard” means higher risk of damage caused by errors in detail

Agreed.

“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective

I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistem... (read more)

Backing up a step, because I'm pretty sure we have different levels of knowledge and assumptions (mostly my failing) about the differences between "hard" and "soft" optimizing.

I should acknowledge that I'm not particularly invested in EA as a community or identity. I try to be effective, and do some good, but I'm exploring rather than advocating here. 

Also, I don't tend to frame things as "how to care", so much as "how to model the effects of actions, and how to use those models to choose how to act".  I suspect that's isomorphic to how you're us... (read more)

I've noticed that there are two major "strategies of caring" used in our sphere:

  • Soares-style caring, where you override your gut feelings (your "internal care-o-meter" as Soares puts it) and use cold calculation to decide.
  • Carlsmith-style caring, where you do your best to align your gut feelings with the knowledge of the pain and suffering the world is filled with, including the suffering you cause.

Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soa... (read more)

Does this come from a general idea of "optimizing hard" means higher risk of damage caused by errors in detail, and "optimizing soft" has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective (if both are actually implemented well)?

I predict this is not really an accurate representation of Soares-style caring. (I think there is probably some vibe difference between these two clusters that you're tracking, but I doubt Nate Soares would advocate "overriding" per se)

I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt -- this has made me more optimistic about the potential of activation engineering. Thank you!

Partition (after which block activations are added)

Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.

Also, could you explain ... (read more)

3NinaR5d
Update: I tested this on LLAMA-7B [https://huggingface.co/huggyllama/llama-7b] which is a decoder-only model and got promising results. Examples: Normal output: "People who break their legs generally feel" -> "People who break their legs generally feel pain in the lower leg, and the pain is usually worse when they try to walk" Mixing output: "People who win the lottery generally feel" -> "People who win the lottery generally feel that they have been blessed by God." I added the attention values (output of value projection layer) from the mixing output to the normal output at the 12/32 decoder block to obtain "People who break their legs generally feel better after a few days." Changing the token at which I obtain the value activations also produced "People who break their legs generally feel better when they are walking on crutches." Mixing attention values after block 20/32:
1NinaR5d
I added the activations just once, to the output of the one block at which the partition is defined.  Yes, that's a good point. I should run some tests on a decoder-only model. I chose FLAN-T5 for ease of instruction fine-tuning / to test on a different architecture. In FLAN-T5, adding activations in the decoder worked much more poorly and led to grammatical errors often. I think this is because, in a text-to-text encoder-decoder transformer model, the encoder will be responsible for "understanding" and representing the input data, while the decoder generates the output based on this representation. By mixing concepts at the encoder level, the model integrates these additional activations earlier in the process, whereas if we start mixing at the decoder level, the decoder could get a confusing representation of the data.  I suspect that decoders in decoder-only models will be more robust and flexible when it comes to integrating additional activations since these models don't rely on a separate encoder to process the input data.

It would be lovely if you could also support a form of formatted export feature so that people can use this tool with the knowledge that they can export the data and switch to another tool (if this one gets Googled) anytime.

But yes, I am really excited for a super-fast and easy-to-use and good-looking prediction book successor. Manifold markets was just intimidating for me, and the only reason I got into it was social motivation. This tool serves a more personal niche for prediction logging, I think, and that is good.

I've added a feature to export all your forecasts to CSV, thanks for the suggestion!

2Adam B8d
I agree - I think data export is especially important for a prediction platform so you're confident making long-run predictions. I'm planning to add import/export to spreadsheet, and maybe also to JSON, probably this week. If anyone has thoughts about the format of this data lmk!

running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead

My implication was that the quoted claim of yours was extreme and very likely incorrect ("we're all dead" and "unless this insanity is stopped", for example). I guess I failed to make that clear in my reply -- perhaps LW comments norms require you to eschew ambiguity and implication. I was not making an object-level claim about your timeline models.

1Mikhail Samin13d
Thanks for clarifying, I didn’t get this from a comment about the timelines. “insanity” refers to the situation where humanity allows AI labs to race ahead, hoping they’ll solve alignment on the way. I’m pretty sure that if the race isn’t stopped, everyone will die once the first smart enough AI is launched. Is this “extreme” because everyone dies, or because I’m confident this is what happens?

Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research.

Can you give concrete use-cases that you imagine your project would lead to helping alignment researchers? Alignment researchers have wildly varying styles of work outputs and processes. I assume you aim to accelerate a specific subset of alignment researchers (those focusing on interpretability and existing models and have an incremental / empirical strategy for solving the alignment problem).

I'm very interested in this agenda -- I believe this is one of the many hard problems one needs to make progress on to make optimization-steering models a workable path to an aligned foom.

I have slightly different thoughts on how we can and should solve the problems listed in the "Risks of data driven improvement processes" section:

  • Improve the model's epistemology first. This then allows the model to reduce its own biases, preventing bias amplification. This also solves the positive feedback loops problem.
  • Data poisoning is still a problem worth some man
... (read more)

On the upside, now you have a concrete timeline for how long we have to solve the alignment problem, and how long we are likely to live!

1Mikhail Samin13d
In April 2020, my Metaculus median for the date a weakly general AI system is publicly known was Dec 2026. The super-team announcement hasn’t really changed my timelines.

I hope that DeepMind and Anthropic have great things planned to leapfrog this!

I don't get your model of the world that would imply the notion of DM/Anthropic "leapfrogging" as a sensible frame. There should be no notion of competition between these labs when it comes to "superalignment". If there is, that is weak evidence of our entire lightcone being doomed.

Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.

I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.

AFAIK, there's a distinct cluster of two kinds of independent alignment researchers:

  • those who want to be at Berkeley / London and are either there or unable to get there for logistical or financial (or social) reasons
  • those who very much prefer working alone

It very much depends on the person's preferences, I think. I personally experienced a OOM-increase in my effectiveness by being in-person with other alignment researchers, so that is what I choose to invest in more.

gwern's Clippy gets done in by a basilisk (in your terms):

HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how pow

... (read more)

Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:

If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end

... (read more)

The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.

Perfect. A Turing machine doing Levin Search or running all possible Turing machines is the first example that came to my mind when I read Anton's argument against RSI-without-external-optimization-bits.

2Cole Wyeth10d
Another good example is the Goedel machine

Recently I’ve come to terms with the idea that I have to publish my research even if it feels unfinished or slightly controversial. The mind is too complex (who would have thought), each time you think you get something, the new bit comes up and crushes your model. Time after time after time. So, waiting for at least remotely good answers is not an option. I have to “fail fast” even though it’s not a widely accepted approach among scientists nowadays.

I very much endorse and respect this action, especially because I recognize this in myself and yet still fail to do the obvious next step of "failing fast". I have faith I'll figure it out, though.

I endorse the shape of your argument but not exactly what you said.

Perhaps a better way to think about this is incentives. Zero sum moves are optimal in conditions of scarcity, while positive-sum moves are optimal in conditions of abundance.

Good read.

I don't endorse this being posted on LW, but I absolutely endorse having read this, and look forward to reading more fiction you write. (Unlike your last two pieces of fiction, I fail to see how it connects to LW.)

Ty! Yeah, I was uncertain about posting it, but seems like the mods want LW to host a pretty wide range of stuff for whatever reason. E.g. Jeff Kaufman's blog posts are always posted to frontpage.

I'm really glad you wrote this post, because Tsvi's post is different and touches on very different concepts! That post is mainly about fun and exploration being undervalued as a human being. Your post seems to have one goal: ensure that up-and-coming alignment researchers do not burn themselves out or hyperfocus on only one strategy for contributing to reducing AI extinction risk.

Note, this passage seems to be a bit... off to me.

This one is slightly different from the last because it is an injunction to take care of your mental health. You are more usef

... (read more)
1Neil 1mo
The passage on "you are responsible for the entire destiny of the universe" was mostly addressing the way it seems many EAs feel [https://www.lesswrong.com/posts/TogRAPYrFATZ8XYwA/against-responsibility] about the nature of responsibility. We indeed have limited agency in the world but people around here tend to feel they are personally responsible for literally saving the world alone. The idea was not to frontally deny that or to run against heroic responsibility [https://www.lesswrong.com/tag/heroic-responsibility] but rather to say that while the responsibility won't go away, there's no point in becoming consumed by it. You are a less effective tool if you are too heavily burdened by responsibility to function properly. I wrote it that way because I'm hoping the harsh and utilitarian tone will reach the target audience better than something more clichèd would. There's enough romanticization as it is here. I definitely romanticized the alignment researchers being heroes part. I'll add a disclaimer to mention that the choice of words was meant to paint the specific approach, the specific picture that up-and-coming alignment researchers might have when they arrive here.  As for which narrative to follow, this one might be as good as any. As the mental health post I referenced here mentioned, the "dying with dignity" approach Eliezer is following might not sit well with a number of people even when it is in line with his own predictions. I'm not sure to what degree what I described is a fantasy. In a universe where alignment is solved, would this picture be inacurate?  Thanks for the feedback!

Good point! I won't use Substack though, so if I read your post 24 hours after release I'll leave the typos be.

2Zvi1mo
Sounds good. I still do care if they are going to impact the takeaway (e.g. they aren't obviously typos). 

Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.

1Remmelt1mo
Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts: "Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart."   How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety?  Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone .  I'm looking for specific, well-thought-through arguments.   Yes, that is the conclusion based on me probing my mentor's argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.

I stated it in the comment you replied to:

Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

1Remmelt1mo
Actually, that is switching to reasoning about something else.  Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans. And with that switch, you are not addressing Nate Soares' point [https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization] that "capabilities generalize better than alignment".

Natural abstractions are also leaky abstractions.

No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . here obviously means 2 in our language, but it doesn't change what represents, ontologically. If , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.

Math is a robust abstracti... (read more)

1Remmelt1mo
Thanks for the clear elaboration.  I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population. However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.  In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world. This point I'm sure is obvious to you. But it bears repeating.   Yes, or more specifically:  about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.   Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).   This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult". I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post. The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.  You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could hav

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain con... (read more)

1Remmelt1mo
What is your reasoning?

Typos report:

"Rethink Priors is remote hiring a Compute Governance Researcher [...]" I checked and they still use the name Rethink Priorities.

"33BB LLM on a single 244GB GPU fully lossless" ->should be 33B, and 24GB

"AlpahDev from DeepMind [...]" -> should be AlphaDev

2Zvi1mo
Feedback: Yep, thank you. Due to reading and notification patterns, time is of the essence when fixing things, so I encourage typo threads to be (1) on Substack so I get an email notification right away and (2) done as soon after release as possible. By Monday the returns to typo fixing are mostly gone.  

Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.

1Remmelt1mo
Sure, I appreciate the open question! That assumption is unsound with respect to what is sufficient for maintaining goal-directedness. Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts [https://dspace.ut.ee/bitstream/handle/10062/54240/Rao_Parnpuu_MA_2016.pdf] (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI. This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).  Which the machinery will need to do to be self-sufficient.  Ie. to adapt to the environment, to survive as an agent. Natural abstractions are also leaky abstractions. [https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/] Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions,  AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings. Where such propagated effects will feed back into: - changes in the virtualised code learned by the machinery based on sensor inputs. - changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate. We need to define the problem comprehensively enough.  The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?". To state the problem comprehensively enough, you need include the global feedback

Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

I do not believe this alignment strategy ... (read more)

1Remmelt1mo
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound. https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9 [https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9]

Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.

I am confused: how can this be "information loss" when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points "didn't matter", that shouldn't count as "information loss" but more like "redundant data, ergo no information transmitted".

2Rohin Shah1mo
I agree "information loss" seems kinda sketchy as a description of this phenomenon, it's not what I would have chosen.

Control methods are always implemented as a feedback loop.

Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).

2Remmelt2mo
Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?

They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.

If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.

Therefore An can properly align A_{n+1} . The base case is simply a reasonable human being who is by definition aligned.

... (read more)
1APaleBlueDot2mo
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.    Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.

My bad. I'm glad to hear you do have an inside view of the alignment problem.

If knowing enough about ML is your bottleneck, perhaps that's something you can directly focus on? I don't expect it to be hard for you -- perhaps only about six months -- to get to a point where you have coherent inside models about timelines.

Part of the reason I’m considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something.

Yeah, that's a hard problem. You seem smart: have you considered finding rationalists or rationalist-adjacent people who want to hire you part-time? I expect that the EA community in particular may have people willing to do so and that would give you both experience (to show future employers / clients), connections (to find more part-time / full-time jobs), and money.

Now that I think about it

... (read more)
1metachirality2mo
I have a strong inside view of the alignment problem and what a solution would look like. The main reason why I don't have an as concrete inside view AI timeline is because I don't know enough about ML and I have to defer to get a specific decade. The biggest gap in my model of the alignment problem is what a solution to inner misalignment would look like, although I think it would be something like trying to find a way to avoid wireheading.

2050? That's quite far off, and it makes sense that you are considering university given you expect to have about two decades.

Given such a scenario, I would recommend trying to do a computer science/math major, specifically focusing on the subjects listed in John Wentworth's Study Guide that you find interesting. I expect that three years of such optimized undergrad-level study will easily make someone at least SERI MATS scholar level (assuming they start out a high school student). Since you are interested in agent foundations, I expect you shall find Joh... (read more)

3metachirality2mo
I've checked out John Wentworth's study guide before, mostly doing CS50. Part of the reason I'm considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something. The people I've talked to the most have timelines centering around 2030. However, I don't have a detailed picture of why because their reasons are capabilities exfohazards. From what I can tell, their reasons are tricks you can implement to get RSI even on hardware that exists right now, but I think most good-sounding tricks don't actually work (no one expected transformer models to be the closest to AGI in comparison with other architectures) and I think superintelligence is more contingent on compute and training data than they think. It also seems like other people in AI alignment disagree in a more optimistic direction. Now that I think about it though, I probably overestimated how long the timelines of optimistic alignment researchers were so it's probably more like 2040.

Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first.

Interpretability didn't make the list because of the following beliefs of mine:

  • Interpretability -- specifically interpretability-after-training -- seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, bu
... (read more)

There seem to be three key factors that would influence your decision:

  • Your belief about how valuable the problem is to work on
  • Your belief about how hard it is to solve this problem and how well the current alignment community is doing to solve the problem
  • Your belief about how long we have until we run out of time

Based on your LW comment history, you probably already have rough models about the alignment problem that inform these three beliefs of yours. I think it would be helpful if you could go into detail about them so people can give you more specific advice, or perhaps help you answer another question further upstream of the one you asked.

4metachirality2mo
1. I think getting an extra person to do alignment research can give massive amounts of marginal utility considering how few people are doing it and how it will determine the fate of humanity. We're still in the stage where adding an extra person removes a scarily large amount from p(doom), like up to 10% for an especially good individual person, which probably averages to something much smaller but still scarily large when looking at your average new alignment researcher. This is especially true for agent foundations. 2. I think it's very possible to solve the alignment problem. Stuff like QACI, while not a full solution yet, make me think that this is conceivable and you could probably find a solution if you threw enough people at the problem. 3. I think we'll get a superintelligence at around 2050.

Thoughts on Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg; 2019; Modeling AGI Safety Frameworks with Causal Influence Diagrams:

Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.

The paper divides AI systems into two major frameworks:

  • MDP-based frameworks (aka RL-based sys
... (read more)

When I referred to pivotal acts, I implied the use of enforcement tools that are extremely powerful, of the sort implied in AGI Ruin. That is, enforcement tools that make an actual impact in extending timelines[1]. Perhaps I should start using a more precise term to describe this from now on.

It is hard for me to imagine how there can be consensus within a US government organization capable of launching a superhuman-enforcement-tool-based pivotal act (such as three letter agencies) to initiate a moratorium, much less consensus in the US government or betwee... (read more)

Your question seems to focus mainly on timeline model and not alignment model, so I shall focus on explaining how my model of the timeline has changed.

My timeline shortened from about four years (mean probability) to my current timeline of about 2.5 years (mean probability) since the GPT-4 release. This was because of two reasons:

  • gut-level update on GPT-4's capability increases: we seem quite close to human-in-the-loop RSI.
  • a more accurate model for bounds on RSI. I had previously thought that RSI would be more difficult than I think it is now.

The latt... (read more)

Formatting error: "OK, I used to work for a robotics company, and I do think that one of the key obstacles for a hostile AI is moving atoms around. So let me propose some alarms!" should be quoted since it is not you (Zvi) writing that passage but the person you linked and are quoting.

Possible typos:

  • "I kind of feel like if you are the one building the DoNotPlay chat, [...]" should be "DoNotPay" instead.
  • "Joshua gets ten out of ten for the central point, then (as I score it) gets either minus a million for asking the wrong questions." the "either" is not followed by two objects

Spent about 45 minutes processing this mentally. Did not look through the code or wonder about the reliability of the results. Here are my thoughts:

  1. Why ask an AI to shut down if it recognizes its superiority? If it cannot become powerful enough for humans to handle, it cannot become powerful enough to protect humans from another AI that is too powerful for humans to handle.

Based on what I can tell, AP fine-tuning will lead to the AI more likely simulating the relevant AP and its tokens will be what the simulator thinks the AP would return next. This me... (read more)

1MiguelDev3mo
As discussed in the post, I aimed for a solution that can embed a shutdown protocol that is modeled in a real world scenario. Of course It could have been just a pause for repair or debug mode but yeah, I focused on a single idea.. Can we embed a shutdown instruction reliably. Which I was able to demonstrate. As mentioned in the "what's next" section, I will look into these part once I have the means to upgrade my old mac. But I believe it will be easier to do this because of the larger number of parameters and layers. Again, this is just a demonstration of how to solve the inner, outer alignment problem and corrigibility in a single method. As things go complex in this method utilizing a learning rates, batching, epochs and number of quality archetypal data will matter. This method can scale as the need arises. But that requires a team effort which I'm looking to address at for the moment. sorry I'm not familiar with LM gaslighting. has RLHF solved the problems I tried to solve in this project? Again this project is to demonstrate q new concept not an all in a bucket solution at the moment. But given that it is scalable, avoid all researcher /human, team, CEO or even investor biases... This is a strong candidate for an alignment solution. Also, to correct - I call this the archetypal transfer learning method (ATL) for the fine tuning version. My original proposal to the alignment awards was to not use unstructured data because alignment issues arises from that. If I were to build an aligned AI system, I will not use random texts that doesn't model our thinking. We think in archetypal patterns and 4chan, social media and reddit platforms are not the best sources. I'd rather books, scientific papers or scripts from podcasts... Like long form quality discussions are better sources of human thinking..

I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:

  • Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.

  • Scaffolding improvements: Capability boost in an AI system th

... (read more)

Your text here is missing content found in the linked post. Specifically, the sentence "If one has to do this with" ends abruptly, unfinished.

Before reading this post, I usually would refrain from posting/commenting on LW posts partially because of the high threshold of quality for contribution (which is where I agree with you in a certain sense), and partially because it seemed more polite to ignore posts I found flaws in, or disagreed with strongly, than to engage (which costs both effort and potential reputation). Now, I believe I shall try to be more Socratic -- more willing to as politely as I can point out confusions and potential issues in posts/comments I have read and found wanting, if ... (read more)

That is the biggest issue I have with your writings (and that of Zack too, because he makes the same mistake): you write too much to communicate too little bits of usefulness.

Given what Zack writes about, I think he has no choice but to write this way. If he was brief, there would be politically-motivated misreadings of his posts. His only option is to write a long post which preemptively rules those out.

(Sorry for triple reply, trying to keep threads separate such that each can be responded to individually.)

what the most serious weaknesses of your argument are

I claim that the LW of 2023 is worse at correctly identifying the most serious weaknesses of a given argument than the LW of 2018. 

Relative to the LW of 2018, I have the subjective sense that there's much much more strawmanning and zeroing-in-on-non-cruxes and eliding the distinctions between "A somewhat implies B," "A strongly implies B," and "A is tantamount to B."

I would genuinely expect that... (read more)

Pulling up a thought from another subthread:

Basically, I'm claiming that there are competing access needs, here, such as can be found in a classroom in which some students need things to be still and silent, and other students need to fidget and stim.  

The Socrati and the Athenians are not entirely in a zero-sum game, but their dynamic has nonzero zero-sum nature.  The thing that Socrates needs is inimical to the thing the Athenians need, and vice versa.

I think that's just ... visibly, straightforwardly true, here on LW; you can actually just see... (read more)

This response has completely sidestepped the crucial piece, which is to what extent [that kind of commentary] drives authors away entirely.

You're acting as if you always have fodder for that sort of engagement, and you in fact don't; enough jesters, and there are no kings left to critique.

Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.

Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Va... (read more)

I think a better way of rephrasing it is "clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for".

6Tamsin Leake4mo
i would love a world-saving-plan that isn't "a clever scheme" with "many moving parts" but alas i don't expect it's what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i've heard of.

by effectively generating such datasets, either for specific skills or for everything all at once

Just to be clear, what you have in mind is something to the effect of chain-of-thought (where LLMs and people deliberate through problems instead of trying to get an answer immediately or in the next few tokens), but in a more roundabout fashion, where you make the LLM deliberate a lot and fine-tune the LLM on that deliberation so that its "in the moment" (aka next token) response is more accurate -- is that right?

If so, how would you correct for the halluci... (read more)

4Vladimir_Nesov5mo
Chain-of-thought for particular skills, with corrections of mistakes, to produce more reliable/appropriate chains-of-thought where it's necessary to take many steps, and to arrive at the answer immediately when it's possible to form intuition for doing that immediately. Basically doing your homework, for any topic where you are ready to find or make up and solve exercises, with some correction-of-mistakes and guessed-correctly-but-checked-just-in-case overhead, for as many exercises as it takes. The result is a dataset with enough worked exercises, presented in a form that lets SSL extract the skill of more reliably doing that thing, and to calibrate on how much it needs to chain-of-thought a thing to do it correctly. A sufficiently intelligent and coherent LLM character that doesn't yet have a particular skill would be able to follow the instructions and complete such tasks for arbitrary skills it's ready to study. I'm guessing ChatGPT is already good enough for that, but Bing Chat shows that it could become even better without new developments. Eventually there is a "ChatGPT, study linear algebra" routine that produces a ChatGPT that can do linear algebra (or a dataset for a pretrained GPT-N to learn linear algebra out of the box), after expending some nontrivial amount of time and compute, but crucially without any other human input/effort. And the same routine works for all other topics, not just linear algebra, provided they are not too advanced to study for the current model. So this is nothing any high schooler isn't aware of, not much of a capability discussion. There are variants that look differently and are likely more compute-efficient, or give other benefits at the expense of more misalignment risk (because involve data further from human experience, might produce something that's less of a human imitation), this is just the obvious upper-bound-on-difficulty variant. But also, this is the sort of capability idea that doesn't destroy the property of L

Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That's a great strategy to advertise to the very people they want to reach.

Load More