Bogdan Ionut Cirstea

Wiki Contributions

Comments

Contra both the 'doomers' and the 'optimists' on (not) pausing. Rephrased: RSPs (done right) seem right.

Contra 'doomers'. Oversimplified, 'doomers' (e.g. PauseAI, FLI's letter, Eliezer) ask(ed) for pausing now / even earlier - (e.g. the Pause Letter). I expect this would be / have been very much suboptimal, even purely in terms of solving technical alignment. For example, Some thoughts on automating alignment research suggests timing the pause so that we can use automated AI safety research could result in '[...] each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' We clearly don't have such automated AI safety R&D capabilities now, suggesting that pausing later, when AIs are closer to having the required automated AI safety R&D capabilities would be better. At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

Contra 'optimists'. I'm more unsure here, but the vibe I'm getting from e.g. AI Pause Will Likely Backfire (Guest Post) is roughly something like 'no pause ever'; largely based on arguments of current systems seeming easy to align / control. While I agree with the point that  current systems do seem easy to align / control and I could even see this holding all the way up to ~human-level automated AI safety R&D, I can easily see scenarios where around that time things get scary quickly without any pause. For example, similar arguments to those about the scalability of automated AI safety R&D suggest automated AI capabilities R&D could also be scaled up significantly. For example, figures like those in Before smart AI, there will be many mediocre or specialized AIs suggest very large populations of ~human-level automated AI capabilities researchers could be deployed (e.g. 100x larger than the current [human] population of AI researchers). Given that even with the current relatively small population, algorithmic progress seems to double LM capabilities ~every 8 months, it seems like algorithmic progress could be much faster with 100x larger populations, potentially leading to new setups (e.g. new AI paradigms, new architectures, new optimizers, synthetic data, etc.) which could quite easily break the properties that make current systems seem relatively easy / safe to align. In this scenario, pausing to get this right (especially since automated AI safety R&D would also be feasible) seems like it could be crucial.

Also positive update for me on interdisciplinary conceptual alignment being automatable differentially soon; which seemed to me for a long time plausible, since LLMs have 'read the whole internet' and interdisciplinary insights often seem (to me) to require relatively small numbers of inferential hops (plausibly because it's hard for humans to have [especially deep] expertise in many different domains), making them potentially feasible for LLMs differentially early (reliably making long inferential chains still seems among the harder things for LLMs).


 

Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:

'Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.'

If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).

Selected fragments (though not really cherry-picked, no reruns) of a conversation with Claude Opus on operationalizing something like Activation vector steering with BCI by applying the methodology of Concept Algebra for (Score-Based) Text-Controlled Generative Models to the model from High-resolution image reconstruction with latent diffusion models from human brain activity (website with nice illustrations of the model).

My prompts bolded:

'Could we do concept algebra directly on the fMRI of the higher visual cortex?
Yes, in principle, it should be possible to perform concept algebra directly on the fMRI signals from the higher visual cortex. The key idea would be to identify subspaces in the fMRI signal space that correspond to specific semantic concepts, and then manipulate these subspaces to control the reconstructed image.
Here's a hypothetical workflow:
 

  1. Collect fMRI data while the subject views a large set of images that vary along certain semantic dimensions (e.g., gender, emotion, object category, scene type, etc.).
  2. For each semantic dimension, identify a set of fMRI samples that differ only along that dimension. For example, for the gender dimension, you might compare the fMRI signals for "man" vs "woman", keeping other aspects of the image content constant.
  3. Use these samples to estimate a "semantic subspace" for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.
  4. To reconstruct an image, first predict the full fMRI signal from the early visual cortex (similar to predicting z in Takagi et al.).
  5. Then, to manipulate a specific semantic concept, project the predicted fMRI signal onto the orthogonal complement of the corresponding semantic subspace (to remove the original semantic information), and add the desired semantic component from a reference fMRI signal (similar to the concept algebra manipulation of z_c).
  6. Finally, use this modified fMRI signal from the higher visual cortex to condition the diffusion model, reconstructing an image with the desired semantic manipulation.

The main challenges would be:
 

  • Collecting a sufficiently large and diverse fMRI dataset to robustly estimate the semantic subspaces.
  • Ensuring that the estimated subspaces are truly orthogonal and capture the desired semantic variation.
  • Translating the manipulated fMRI signals into a format that can effectively condition the diffusion model.

Nonetheless, this is an intriguing direction that could potentially unite the biological understanding from Takagi et al.'s work with the interpretable semantic manipulation of concept algebra. It suggests a way to manipulate images directly from brain-space, without the need for text prompts. This could provide insights into how the brain itself represents and manipulates semantic concepts.'

'How might the above be relevant to the AI alignment problem?
The hypothetical approach of performing concept algebra directly on fMRI signals from the higher visual cortex could potentially offer some interesting insights relevant to the AI alignment problem:
 

  1. Understanding AI systems: By comparing the semantic subspaces identified in fMRI signals with those found in the latent spaces of AI models (like the diffusion model in concept algebra), we could gain insights into how AI systems represent and manipulate semantic concepts, and how this compares to the human brain. This could help in designing AI systems that reason in ways more analogous to humans, which could be beneficial for alignment.
  2. Interpretability and control: The ability to identify and manipulate specific semantic concepts directly in the space of brain activity could provide a more direct and interpretable way of controlling the output of AI systems. If we can align the semantic subspaces of the AI with those of the human brain, it might become possible to control the AI's outputs by manipulating these subspaces, leading to more interpretable and controllable AI systems.
  3. Value alignment: By understanding how human values and preferences are represented in the brain, and how they relate to the semantic concepts identified through concept algebra, we might gain insights into how to align AI systems with human values. If we can identify the neural subspaces corresponding to human values and ethics, we could potentially use this information to constrain the AI's outputs to align with these values.
  4. Cognitive modeling: This approach could also contribute to the development of more accurate cognitive models of human reasoning and decision-making. By understanding how semantic concepts are represented and manipulated in the brain, we can build AI systems that more closely mimic human cognition, which could be useful for predicting and understanding human behavior in the context of AI alignment.
  5. Neuro-symbolic AI: The idea of manipulating semantic concepts directly in the space of brain activity is reminiscent of neuro-symbolic AI approaches, which aim to combine the strengths of neural networks and symbolic reasoning. Insights from this fMRI-based concept algebra could potentially inform the development of neuro-symbolic AI systems that are more interpretable and aligned with human cognition.

However, it's important to note that these are speculative connections, and there would be significant technical and ethical challenges in realizing this approach. Nonetheless, exploring the parallels between the structure of human thought and the structure of AI representations could be a valuable direction for informing AI alignment research.'

'How would this be relevant in the context of this proposal? [note: the proposal being Activation vector steering with BCI]
The proposed concept algebra on fMRI signals could be highly relevant to the goals outlined in this funding proposal. Specifically:
 

  1. The proposal aims to find mappings between human brain states and the latent representations of language models. Applying concept algebra directly on fMRI signals, as discussed, could provide a more direct and interpretable way to establish these mappings. By identifying semantic subspaces in the fMRI signals and aligning them with subspaces in the language model's latent space, we could potentially "steer" the language model to exhibit brain states similar to specific people.
  2. The proposal suggests that reward models could be trained to reproduce humanlike brain states during evaluation, to improve their generalization. If concept algebra on fMRI can successfully manipulate semantic concepts in the reconstructed outputs (e.g., text or images), this could provide a way to directly optimize the reward models to produce outputs that match desired brain states.
  3. The proposal aims to advance scientific understanding of analogies between LLM behavior and human behavior. Comparing the semantic subspaces identified in fMRI with those in LLMs, as enabled by concept algebra, could provide insights into these analogies at a more fundamental, representational level.
  4. The proposed fMRI experiments could potentially provide the necessary data for attempting concept algebra on brain signals. The proposal includes plans for designing and implementing fMRI data-collection protocols, recruiting subjects, and administering the experiments. This data could be used to estimate the semantic subspaces needed for concept algebra.
  5. The ML experiments proposed, including the fMRI feature extraction pipeline and architecture modifications, could potentially be adapted to implement concept algebra on the fMRI signals, in addition to the original goal of mapping brain states to LLM representations.

In summary, while the original proposal does not explicitly mention concept algebra, the ideas discussed around applying concept algebra to fMRI signals could potentially enhance and extend the proposal's goals of connecting human brain states with AI representations, improving reward model generalization, and advancing scientific understanding of human-AI analogies. The proposed fMRI experiments could provide the necessary data, and the planned ML experiments could be extended to include concept algebra on the brain signals.'

'Above you wrote: 'Use these samples to estimate a "semantic subspace" for each dimension in the fMRI signal space, similar to how concept algebra estimates these subspaces in the latent space of the diffusion model. You could use techniques like PCA or linear discriminant analysis to identify the directions in fMRI space that best capture the variation along each semantic dimension.' What if instead I just took a large diverse dataset of fMRI of images of men and fMRI of images of women, took the means of the two, substracted them, and used that as the sex direction in concept algebra?
Your proposed approach of taking the difference between the mean fMRI signals for men and women to define the "sex direction" is a great idea and could indeed work as a simple alternative to more complex methods like PCA or LDA.
Here's why this approach makes sense:
 

  1. If the fMRI signals for men and women differ primarily along a single direction in the high-dimensional fMRI space, then the difference between their means will point along this direction.
  2. By using a large and diverse dataset, you're essentially averaging out other sources of variation (e.g., individual differences, noise), leaving only the consistent difference between men and women.
  3. This is analogous to how concept algebra identifies concept directions in the latent space of the diffusion model: by taking the difference between the latent representations of prompts that differ only in the target concept.

To implement this:
 

  1. Collect a large dataset of fMRI signals for images of men and women, ensuring diversity in other aspects (age, race, emotion, background, etc.).
  2. Compute the mean fMRI signal for men (m_men) and women (m_women) separately.
  3. Compute the "sex direction" as: v_sex = m_women - m_men.
  4. To manipulate the sex of a reconstructed image, you could then project the fMRI signal onto the orthogonal complement of v_sex (to remove the original sex information), and then add or subtract a scaled version of v_sex to introduce the desired sex information.

The main advantages of this approach over PCA or LDA are simplicity and interpretability. It doesn't require any complex eigendecompositions or model fitting, and the resulting direction has a clear interpretation as the difference between men and women.
Potential limitations to keep in mind:
 

  1. If there are multiple independent directions along which men and women differ, this approach will only capture their average, not the individual directions.
  2. If there are significant interactions between sex and other factors (e.g., if the sex difference is itself different for different ages or emotions), this simple difference may not fully capture the sex effect.
  3. The approach assumes linearity - that the sex effect is a simple additive shift in fMRI space. If there are nonlinear interactions, this linear shift may not fully capture the effect.

Despite these potential limitations, I think this is a promising and elegantly simple approach to try for implementing concept algebra directly on fMRI signals. It's a great starting point that could be refined with more sophisticated methods if needed.'

In future work, one could imagine automating the evaluation of the coherence and generalization of learned steering vectors, similarly to how Bills et al. (2023) automate interpretability of neurons in language models. For example, one could prompt a trusted model to produce queries that explore the limits and consistency of the behaviors captured by unsupervised steering vectors.

Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing. 

I wonder how much near-term interpretability [V]LM agents (e.g. MAIA, AIA) might help with finding better probes and better steering vectors (e.g. by iteratively testing counterfactual hypotheses against potentially spurious features, a major challenge for Contrast-consistent search (CCS)). 

This seems plausible since MAIA can already find spurious features, and feature interpretability [V]LM agents could have much lengthier hypotheses iteration cycles (compared to current [V]LM agents and perhaps even to human researchers).

I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input. 

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

Unsupervised Feature Detection There is a rich literature on unsupervised feature detection in neural networks.

It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails). 

Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.

Load More