NinaR

Wiki Contributions

Comments

A high-level theme that would be interesting to explore here is rules-based vs. principles-based regulation. For example, the UK financial regulators are more principles-based (broad principles of good conduct, flexible and open to interpretation). In contrast, the US is more rules-based (detailed and specific instructions). 
https://www.cfauk.org/pi-listing/rules-versus-principles-based-regulation

[Edit - on further investigation this seems to be a more UK-specific point; US regulations are much less ambiguous as they take a rules-based approach unlike the UK's principles-based approach]

It's interesting to note that financial regulations sometimes possess a degree of ambiguity and are subject to varying interpretations. It's frequently the case that whichever institution interprets them most stringently or conservatively effectively establishes the benchmark for how the regulation is understood. Regulators often use these stringent interpretations as a basis for future clarifications or refinements. This phenomenon is especially observable in newly introduced regulations pertaining to emerging forms of fraud or novel technologies.

I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop - the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.

I do think “goal misgeneralisation” has a place in referring to the phenomenon that:

  1. In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
  2. Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results - “misgeneralisation”

A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.

You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.

Agree with this post.

Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.

People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.

This being said, I can see a counterargument that is: it is fundamentally more honest if people use a consistent language and don’t adapt their language to scenarios, as it is easier for any other agent to model and extract facts from truthful agent + consistent language vs truthful agent + adaptive language.

Walking a very long distance (15km+), preferably in a not too exciting place (eg residential streets, fields), while thinking, maybe occasionally listening to music to reset. Works best in daylight but when it’s not too bright and sunny and not too warm / cold.

I wonder how clear it is that increasing average human BMI is bad. It seems very true that being obese is bad for health outcomes, but maybe this is compensated for by a reduction in the number of underweight individuals + better nutrition for non-morbidly-obese people. 

It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data. 

The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.

Personal anecdote so obviously all n=1 caveats apply - I took light iron supplementation for a few months (one Spatone sachet per day) and it completely changed my life. Before, I could not run more than a mile, in 10 minutes, before collapsing. I got winded going up stairs, was often physically fatigued (although no other mental or non-fitness-related physical symptoms). After a few months of iron and no other lifestyle changes, I could run for an hour at 8 min/mile pace. Have stopped taking the supplements and benefits have sustained for 2 years. If you have mild iron deficiency, I really do suggest addressing it as the lifestyle gains could be really big, and I can recommend Spatone iron water as a delivery mechanism with fewer side effects.

This reminded me of a technique I occasionally use to explore a new topic area via some version of “graph search”. I ask LLMs (or previously google) “what are topics/concepts adjacent to (/related to/ similar to) X”. Recursing, and reading up on connected topics for a while, can be an effective way of getting a broad overview of a new knowledge space.

Optimising the process for AIS research topics seems like it could be valuable. I wonder whether a tool like Elicit solves this (haven’t actually tried it though).

I wonder whether https://arxiv.org/pdf/2109.13916.pdf would be a successful resource in this scenario (Unsolved Problems in ML Safety by Hendrycks, Carlini, Schulman and Steinhardt)

Load More