A Conceptual Framework and Preliminary Proposals for AI Alignment and Safety in R&D


The present blog post serves as an overview of a research report I authored over the summer as part of the CHERI fellowship program, under the supervision of Patrick Levermore. In this project, I explore the complexities of AI alignment, with a specific focus on reinterpreting the Eliciting Latent Knowledge problem through the lens of the Comprehensive AI Services (CAIS) model. Furthermore, I delve into the model's applicability in ensuring R&D design safety and certification.

I preface this post by acknowledging my novice status in the field of AI safety research. As such, this work may contain both conceptual and technical errors. However, the primary objective of this project was not to present a flawless research paper or groundbreaking discoveries. Rather, it served as an educational journey into the realm of AI safety - a goal I believe has been met - and as a foundation for my future research. Much of this work synthesises and summarises existing research, somewhat limiting (but not eliminating) the novelty of my contributions. Nonetheless, I hope it can offer a fresh perspective on well-established problems.

You can find the complete report here

I welcome and appreciate any constructive criticism to enhance the quality of this and future research endeavours.

I seize the opportunity to thank my supervisor, Patrick Levermore, for his unwavering support throughout this journey. Special thanks also go to Alejandro, Madhav, Pierre, Tobias and Walter for their feedback and enriching conversations.

Chapter 1.  AI Alignment

The first chapter of this work serves as an introduction to the complex and multifaceted field of AI alignment (an expert reader may want to skip this chapter).

Overall, the aim is to provide a foundational understanding of the challenges involved in aligning AI systems with human values, setting the stage for subsequent discussions about potential solutions and promising frameworks. 

Importantly, when searching for solutions to AI alignment one must consider the scope of the tasks a system is designed to perform.  Broad, long-term tasks amplify issues such as specification gaming, deceptive alignment, and scalable oversight. In this work, I argue why bounded, short-term tasks partially mitigate or at least render more tractable some of these difficulties. Such a consideration is valuable, as it may open up avenues for attacking the alignment problem.

Chapter 2.  Rethinking Superintelligence: Comprehensive AI Services Model

The second chapter focuses on the Comprehensive AI Services (CAIS)[1] model. The objective is to provide a detailed summary of the original model, revise some points and delve into the essential aspects that I find pertinent to my project. I incorporate further insights from related sources and some of my own intuitions throughout this chapter.

Studies of superintelligent-level systems have traditionally conceptualised AI as rational, utility-directed agents, employing an abstraction resembling an idealised model of human decision-making. However, contemporary advancements in AI technology present a different landscape, featuring intelligent systems that diverge significantly from human-like cognition. Today’s systems are rather better understood by examining their genesis through research and development (R&D), their functionality and performance over a broad array of tasks, and their potential to automate even the most complex human activities. 

Take GPT-4 as an illustrative case. While the conversational abilities of GPT-4 might suggest a human-like breadth of ability, it is crucial to acknowledge that, unlike humans, GPT-4 does not possess an inherent drive for learning and self-improvement. Its capabilities are a product of extensive research and development by OpenAI, a consequence of its training on a vast corpus of texts. It can indeed draft emails, summarise articles, answer questions, assist in coding tasks and much more. Nevertheless, these skills stem directly from the training process and are not born out of an intrinsic human-like desire or aspiration for growth and learning;  rather, they are engineered features, crafted by human-led (possibly AI-assisted) R&D and honed through iterative design and training.

Importantly, this is not to downplay the potential dangers posed by such advanced AI systems; rather, the aim is to advocate for a more accurate conceptual framing of what these systems are, as such a perspective could open up more effective research avenues for addressing the AI alignment problem.

The current trajectory in AI research suggests an accelerating, AI-driven evolution of AI technology itself. This is particularly evident in the automation of tasks that comprise AI research and development. However, in contrast with the notion of self-contained, opaque agents capable of internal self-improvement, this emerging perspective rather leans towards distributed systems undergoing recursive improvement cycles. This gives evidence to the Comprehensive AI Services (CAIS) model, which reframes general intelligence as a property of flexible networks of specialised, bounded services.

In doing so, the CAIS model introduces a set of alignment-related affordances that are not immediately apparent in traditional models that view general intelligence as a monolithic, black-box agent. In fact, not only does the AI-services model contribute to the developmental and operational robustness of complex AI systems, but it also facilitates their alignment with human values and ethical norms through models of human approval. Specifically, the CAIS model allows for the introduction of several safety mechanisms, including the use of optimisation pressure to regulate off-task capabilities as well as independent auditing and adversarial evaluations to validate each service's functionality. Furthermore, functional transparency and the monitoring of the communication channels can mitigate the inherent complexity and opaqueness of AI algorithms and components, while enabling resource access control policies that can further constrain undesirable behaviours.

Nonetheless, recent advancements in AI, such as the advent of foundation models, challenge some of the premises of the original CAIS framework and suggest a shift towards a centralised model training approach, where a single, broadly trained foundation model is fine-tuned for specific tasks, enhancing efficiency and resource utilisation.

Lastly, the chapter explores the applicability of the CAIS model in realistic use-case scenarios, particularly in the context of design engineering for R&D.

In conclusion, the CAIS model has far-reaching implications. Not only does it reshape our understanding of advanced machine intelligence, but it also redefines the relationship between goals and intelligence. The model offers fresh perspectives on the challenges of applying advanced AI to complex, real-world problems and places important aspects of AI safety and planning under a new lens.

Chapter 3.  Eliciting Latent Knowledge using CAIS

The aim of the third chapter is to reframe the Eliciting Latent Knowledge (ELK)[2] problem within the CAIS model and lay down some groundwork for aligning systems devoted to real-world applications. Emphasis is placed on ensuring the safety of R&D designs. While this chapter encompasses a significant portion of the novelty of my contribution, the discussion remains preliminary and mostly meant to inform future research.

The ELK problem in AI systems is an important concern in the broader context of AI alignment. This challenge arises from the need to understand and reveal the true inner beliefs of AI systems, a task that is especially complex when the AI is trained to produce outputs that generate a positive reward but that may not necessarily reflect its actual understanding of a given situation.

To address this issue, particularly in the context of R&D, this chapter introduces FROST-TEE, a cluster of services within the CAIS model whose goal is to ensure design safety. Unlike traditional AI systems that could be incentivised by end outcomes, FROST-TEE, through the alignment affordances enabled by CAIS, focuses on the honest evaluation of the safety of R&D designs.

Adhering to the principle of compartmentalisation, FROST-TEE adopts a two-component structure: a Design Analysis Model and a Safety Assessment Oracle. This separation aims to enable specialised, independent safety checks on each component, thereby reducing the risk of systemic errors and enhancing the reliability of the safety assessments. The Design Analysis Model conducts a comprehensive analysis of the given design, elucidating its various properties, including those that may bring potential vulnerabilities. A trusted third-party Safety Assessment Oracle then leverages this analysis to determine and certify the safety of the design.

An overview of the interpretability techniques that could be used to ensure the trustworthiness of the operations depicted above can be found in the appendix section. Among these, I develop some preliminary intuition for a novel technique for eliciting inner knowledge from a language model: SSTI (Semi-Supervised inference-time Truthful Intervention). SSTI is a semi-supervised method that integrates insights from unsupervised deep clustering and builds on the advantages of both Contrast-Consistent Search[3] (and Contrastive Representation Clustering[3]) and Inference-Time Intervention[4] while trying to overcome some of their limitations. In particular, the idea behind this technique is to employ a focal loss penalty term to enhance the label assignment mechanism. SSTI may offer an interesting avenue for future research, especially in combining aspects from supervised and unsupervised techniques to improve the robustness and discriminative representation of truth.

Importantly, the frameworks and techniques proposed are largely theoretical at this stage, as they lack empirical validation. They are intended to provide a conceptual sketch, suggesting possible avenues for future research, but they are not by any means fully fleshed-out prototypes. Therefore, numerous open questions about the effectiveness, reliability and general applicability of these ideas remain to be addressed.

Nonetheless, FROST-TEE, by blending techniques from both cybersecurity and artificial intelligence, constitutes an attempt to offer a robust design safety verification system.  The security mechanisms employed include Trusted Execution Environments (TEEs) for safe model execution and cryptographic tags and signatures to ensure analysis integrity and assessment authenticity. These in turn enable the certification of safe designs while enhancing the overall trustworthiness of the safety assessment process. By reducing the risk of active deception and enhancing transparency, FROST-TEE aims to ensure that the AI system acts more as a truthful assistant rather than a reward-seeking entity.


  1. ^

    K.E. Drexler. Reframing superintelligence: Comprehensive AI services as general intelligence. Technical Report 2019-1, Future of Humanity Institute, University of Oxford, 2019.

  2. ^

    Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting Latent Knowledge. 

  3. ^

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022.

  4. ^

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023.

New to LessWrong?

New Comment