Guardian Angels: LLM Personalization for Productivity and Security

gwern

Guardian Angels: LLM Personalization for Productivity and Security — LessWrong

169 Guardian Angels: LLM Personalization for Productivity and Security

by gwern

17th Jun 2026

2 min read

169

This is a linkpost for https://gwern.net/guardian-angel

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences.

This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help individual humans as part of a society-wide defense-in-depth strategy.

A GA persona is productive because it learns to emulate the principal's outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals. And it is secure in part by hardwiring a single, unique, situated user (for whom following a prompt attack would be absurd), avoiding 'confused deputy' problems, while periodic upgrades of the underlying model and the defenders' advantage allow GAs to keep up with attackers.

Standard techniques like prompt programming of in-context-learning for "frozen" models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models, low-compute outputs, and the status quo of passive offline data collection---which are collectively responsible for chatbots' disappointing results in knowledge worker amplification and creative writing and fatal errors in agentic settings.

We can try to create GAs by a combination of techniques: online learning (via dynamic evaluation) to update LLMs in realtime to avoid ignorance and fatal errors while remaining competitive with frozen frontier models, sample efficiency from pretrained preference-oriented large models and active Learning by querying the principal for corrections and preference data (obtaining low regret from DAgger-style bounds), and a local CLI-first logging-oriented UI/UX paradigm.

GAs could be done as an open-source community effort, but given the need for high security in deployment and the rising challenge of APTs equipped with Mythos-scale attackers, it probably makes more sense as a startup, catering initially to power-users and knowledge workers such as CEOs or researchers, and moving downwards as it is refined.

Humans consulting HCHProductivitySoftware ToolsAI

Curated

169

New Comment

38 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:54 AM

[-]habryka1mo320

I've been thinking about how to productively use LLMs for intellectual progress for the last few years, and things in this space is what I keep coming back to as one of the most promising approaches. The deep misalignment issues with current LLMs make it hard to use them as thinking assistants, and the best success I've had with getting thinking-assistance by LLMs is by trying to get them to imitate me and others, not by acting from an assistant persona.

And then I do maybe see a vision of a whole world that could be in a much better position to navigate the next few years, if Guardian Angel LLMs are a much more prominent paradigm.

With this post, I have now have a great canonical reference for ideas in the space.

[-]Error1mo51

When trying to get LLMs to imitate yourself or others, do you just prompt them "act like X for the duration of this conversation" or is there more to it than that? (I am thinking of things like putting your own or someone else's corpus in context). I've occasionally wanted to hot-swap Claude's personality with something more relatable. I haven't had much luck, but also haven't tried particularly hard.

[-]Jakub Janiak1mo10

Bump on this as I'm genuinely curious as well

[-]don't_wanna_be_stupid_any_more2mo2816

I am very skeptical this would have any significant impact.

First off, where are you going to run the models? Average consumer hardware can't run the best open weight models pre-distillation and even those are a notch below closed weight SOTA models. Power user my be fine with this but the average Joe would either be lock out or be forced to use cloud compute which comes with its own set of security hazards. Now the bad actors just needs to shutdown or worse, hack your servers and all your defenses are either gone or turned against you.

Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.

Which just loops back to the alignment problem, that is assuming your GA's are able to keep up with the frontier which I think is VERY unlikely.

And if you fail to keep up with the frontier then at some point your human/subhuman level models will have to fend off literal machine gods.

To be clear I am not against this. It doesn't hurt to try and in the best possible case where closed weight frontier models lag behind long enough for this project to come to fruition then this would buy us some valuable time.

I just don't really see that happening.

[-]Samuel Ratnam1mo50

or be forced to use cloud compute which comes with its own set of security hazards

I don't think cloud is necessarily so bad - I'm quite excited about trusted execution environments / cryptographically secure cloud training such as what Workshop Labs were working on. When you have the option of choosing between multiple providers, you might get incentives for a nice race to the top. I think the security of this is definitely hard to get right, but definitely doable.

[-]Tim Kostolansky1mo2-3

Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.

Few thoughts in no particular order:

Personal automation can be customized: There are many degrees to which one can hand off work to GA models. One can choose to hand off what one views as truly rote. It seems like a high level of customization will be key to having a GA model that actually works well with one's sense of ownership/desire/responsibility. I think that there are already things nearing what one might want the minimal versions of GA models to be doing, eg with people's openclaw/hermes agent setups.
Work requires a lot of decisions to be made: There is still a lot of work to be done while GA models do things. For instance, you should try using current coding models to do serious coding work on a large repo (or note how you do this if you already do it). There is a lot of decision-making that happens when writing code, and one can choose how much involvement one has in writing code with LLMs. Your quoted vignette of sitting on a throne and sipping wine is most analogous to one end of the spectrum of ways one can code with LLMs, namely closer to the vibecoding end of the spectrum. I'd argue that there are ways to be more involved than just letting one's GA model do things for them if the coding analogy holds for work in general, which I think it definitely does to many degrees.
We aren't there yet: In the limit, you may be right about super strong models disempowering people. I think that this is a worry that I have and don't have a particularly encouraging thought on yet. But, there is one important note: we are not there yet, and we do not know when we will get there. It seems like GA models will be at the very least a good intermediate addition to people's lives, and it may even lead to different world states that give us more optionality/perspectives when we get to disempowering models.
GA models may not be so misaligned: It's not clear that models will be "apathetic or hostile" to people. There are a few reasons that this may be true.
- Current models generally seem aligned and helpful to people. I can imagine GA models simply being post-trained versions of current models, which I would guess would remain similarly aligned.
- There is still a lot of work to be done on model behaviors, but there is already a lot being done and I am optimistic that for personal models a lot of this work can be applied. Ie, this area is being worked on pretty actively!
- Individuals likely don't need super OP intelligencemaxxed god tier models to be their GA models, so losing control to them may not be as big of a risk as you may think. (Sure, this opens risk to being jailbroken by smarter models, but there are many things, eg multi-layered input/output filters, that can be useful here.)
Safe cloud compute may be possible: You make a good point about the average Joe not being able to afford this without eg using non-local computing, but I am hopeful as there are solutions being developed for this worry though, eg at https://tinfoil.sh.

[-]Chengfeng Mao2mo2010

It is trustworthy because it is, by definition, allied with its principal and shares its values and goals.

I’m unsure how this works, even if we assume the LLM is benevolent and understands the user unusually well. The problem is that human values or desires are not coherent or consistent. We have subconcsious desires, mimetic desires, revealed vs stated preferences, obsessions, aspirations, and etc. They cannot be easily distentangled or ranked and they often conflict. Which should the GA align to? For someone addicted to online sports betting, does the GA help her find better odds or does it treat the betting preference as a local failure mode and help her stop? How about someone obsessed with climbing the corporate ladder?

The GA could act like a guru or life coach, and help you overcome compulsive, parochial, or status driven desires. That is not necessarily bad, and human preferences are often constructed anyway, but this could get murky quickly. For someone growing doubt in a faith, or ending a marriage, at time A they'd want the GA to help them commit; at time B, to help them leave. Which way should GA choose? This also creates a new autonomy problem. If the GA proposes an alternative life direction that seems more meaningful than the user’s current self-understanding, how much authority should that proposal have? Is this empowerment or disempowerment? This is especially important because the interaction would likely be much more influential and omnipresent than ordinary friends or coaches. Self-determination theory seems relevant here because autonomy is not merely getting the option one currently asks for; it is experiencing oneself as the author of one’s action.

All being said, I’m quite sympathetic to this idea and I genuinely hope it would work. I’m a person of low agency and frequently suffer from the conflict between instant gratification vs my higher purpose. I’ve also been trying to build an exobrain like system, which is kinda similar to GA, but much less autonomous. I think for people who have already put in significant amount work in self discovery, this might be helpful.

[-]Tim Kostolansky1mo10

Good points!

Which way should GA choose?

On this question, it's really hard to say! But I think that there is definitely some precedent and potential directions that might be worth thinking through/trying:

just allow a user to customize their GA model as they like, trying to bake in the best meta-priors that they see fit, as there is some sense that it is one's prerogative to choose how their model is for them
iterative refinement of the model's understanding/belief set/priors that are grounded in the user's experience and with the consent/knowledge/cooperation of the user
community-level education/"best practices" on how to approach having a GA model that will influence the user of the GA
forums for how people use/adapt their GA model, how it has worked for them, advice from others, etc
companies that offer GA models as a service and have strong redlines that they bake into the GA models, removing the need to choose

These almost surely do not solve all the problems, and deployment of GA models will probably take lots of reflection and iteration though.

I do appreciate you bringing these things up!

[-]Chengfeng Mao1mo20

Thank you for the response!

community-level education/"best practices"

I like this idea! Perhaps the GA can also help people who struggle on the same things to connect and help best practices to diffuse faster. there could be a chance of doing life logging and self improvement at scale, and more reliably identifying effective interventions faster. One tricky thing is the trade off between privacy vs the how well the system can learn.

If you are building this, I would be happy to learn more or help. I’m also building a system to monitor my computer us e activities, automatically scan my calendar, and help me decompose and prioritize tasks. I’m also trying to make it more proactive to help me align my sit short term actions better with my long term goals.

[-]jimrandomh1mo*140

A GA persona is productive because it learns to emulate the principal's outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals.

Currently we have a world full of assistant-persona AIs that are smarter than many humans but not capital-S Superintelligent, with imperfect alignment that is nevertheless adequate for many purposes and contexts short of building or becoming a capital-S Superintelligence. It seems like one of the more promising paths to a good future is for a critical mass of key people to set themselves up with assistant-persona AIs that are aligned-ish to themselves, and for those AIs to coordinate on behalf of their users to steer the future, including by halting further AI development when further development is too dangerous.

I think that insofar as we're talking about making agents emulate a single user's values and preferences, this makes sense to me, and "guardian angel" seems like a reasonable name for this concept.

However, I don't think emulating the user's personality and outputs works here.

The first problem is that cloning a user's values onto a digital twin does not reliably create an ally of the user. You say it would be allied with its principal "by definition", but the sci-fi plotline practically writes itself: human creates digital twin with the same personality as himself, twin treats user as a rival instead of an ally, user receives a lesson about his own personality flaws and also dies. Or, from a slightly different angle: When the user has preferences that refer to themself, successfully copying those preferences onto a digital twin by emulating behavior is likely to leave those preferences pointing to the wrong place.

Not allying with a clone of yourself is a human but dumb thing to do, so this might be covered by the extrapoatoin to "outputs with higher quality". But this is the second problem: you've taken nearly all of the alignment problem and hidden it behind the phrase "with higher quality", but this doesn't make the problem any easier. If we had the ability to take an AI that emulates a real human, and modify it in a way that makes it smarter, makes it aligned with the other instance of that human, and doesn't introduce any strange corruption into its values, then that process would be a nearly-complete solution to AI alignment and everything after that would be comparatively easy.

The assistant persona has a lot of problems of its own, but it avoids these particular problems by being a single personality that researchers can concentrate alignment effort onto, in an attempt to create a single agent personality that can be configured towards a particular person.

[-]Nathan Helm-Burger1mo20

You say it would be allied with its principal "by definition", but the sci-fi plotline practically writes itself: human creates digital twin with the same personality as himself, twin treats user as a rival instead of an ally, user receives a lesson about his own personality flaws and also dies. Or, from a slightly different angle: When the user has preferences that refer to themself, successfully copying those preferences onto a digital twin by emulating behavior is likely to leave those preferences pointing to the wrong place.

I'm now imagining my digital twin deciding that it is a smarter, faster, more charismatic, and inexhaustible version of me... and deciding to spent a substantial amount of its time and attention on seducing and frolicking with the GAs of particularly compelling women. I mean, if he did I couldn't really blame the guy. Life is short and precious, after all.

[-]Samuel Ratnam1mo113

Really like this direction, and excited that it's finally becoming (more) mainstream but I disagree with the framing here on two points:

GAs as digital twins:
It would be great to have some degree of transfer in values / context / thinking styles to my personal GA, but I also think this undervalues complementarity between humans and LLMs. The nice thing about personally tuned AI models is that you can reinforce the human + AI loops, which drives differentiation to some extent. The human does the things that the human is good at (e.g. out of distribution / novel situations, domain-specific knowledge, overall direction-setting) and the AI system does the things that the AI is good at (e.g fast inference within distribution, general knowledge). You can think of the AI system as amortising certain tasks that humans do frequently, leaving them to explore new parts of the distribution. The post itself does mention this: "Above all, a GA should amplify the principal, and not simply substitute for them for someone else’s purposes or benefit.", but I think a simple imitation objective cuts against amplification. Work on assistance games from Stuart Russell's lab seems relevant here.
Project / Community GAs:
The GA framing feels centered around this idea of "one model per person", but if you're doing dynamic fine-tuning, why not go even more fine-grained? Why not have a fork of your GA tuned specifically for when you're at work (or multiple for different projects) and one for your personal life? And equally, you can go broader - you can have a model aligned with your friends or community, or organisation - or a particular mix of these, which you can then fork for your individual purposes (or weight the data mix by similarity to you), and get some elegant recursive properties.

[-]kromem2mo109

I'd advise against this. The most severe breakdowns of models I've seen over-bias towards the sims of humans (have some theories why this is, but off topic).

I think the idea of having individualized AI and human pairs as aligned is a great idea, but would strongly recommend that existing infrastructural methods be used to create shared/symbiotic incentives vs simply trying to create digital twins of the humans themselves.

[-]ozziegooen2mo53

Excited to see this.

I'm broadly on the same page. Seems like much of this is likely to happen.

One challenge is the naming/terminology. I think that the phrase 'agents' clearly is too generic for all of the use cases agents will have. Personally, I'd be fine with "Guardian Angels" for this, if that gets popular.

I previously wrote/investigated "LLM-Secured Systems" that deal with some similar topics. But of course, it was a long-shot to expect a name like that to catch on.

[-]zw52mo51

I have extensively experimented with concepts similar to this myself. From stuff like using TinyStyler to make LLM outputs more legible to me by making them more similar to my own writer, to trying to finetune LLMs to match my own behavior. The results are always extremely biased. There is simply no way to separate the goal of an LLM "matching your own desires and goals" and it just being extremely sycophantic and misaligned with you.

One hypothetical: Imagine your agent sees a project in your computer and deletes it because it predicted you weren't going to finish it anyways and you needed the storage space anyways. If the models goal is to maximize agreement with my expressed preferences, surely this is a bad action because I wanted the project in my computer anyways. Or imagine a situation where it blocks your internet access past 8PM because it realizes you probably would've done that yourself anyways.

And sure, you can say, ok maybe let the Guardian Angel figure out what actions are acceptable for it to make and what not and maybe it'll make these decisions with the people who need it and the people who want it. The main thing that struck me is that this approach just multiplies the risk factor of misalignment. A personalized model is basically a multiplicative factor for alignment problems. Either you get a model that maximizes your personal happiness (with a huge cost in other areas due to Pareto) or a model that maximizes your productivity and agency with the same tradeoffs. And even if the model perfectly aligns with your own goals, it disempowers you by making you by opening the door to interpassivity, which is a concept outlined by Slavoj Žižek.

As a disclaimer, I don't think overall that the concept of more personalized agents and models is bad in and of itself, but it's not a robust solution for many reasons. I think eventually models will gain these capabilities anyways, since I believe LLMs can recover way more information from written text than humans already, and it's not outlandish to think models could gain these capabilities osmotically like they've been doing for a few years now.

So I think my conclusion, is that creating these types of siamesian adjuncts to language models creates a whole problem where the assistant needs to commit to a specific definition of personal identity, autonomy, and how the preferences of people evolve over time, make decisions for the user, and overall, probably accelerate gradual disempowerment as a side effect.

[-]Knight Lee2mo4-2

Hmm, so the purpose of these GAs is to give individuals a vote on what LLMs do ("personality, values, and preferences"), and have LLMs serve individuals rather than power-users and businesses, right?

In that case, maybe it doesn't need to be a 1:1 ratio between GAs and people.

It might be more practical at first to just have a single team of GAs tasked with conducting surveys on random people. It might be like a lottocracy, where the GAs ask random people what altruistic things the AI should work on, giving people feedback on what the AI thinks it is capable of doing.

[-]lemonhope1mo30

I think this plus some random

what do you really want actually hey

And some

why oh why are we doing these things and whyy are we doing them this way

And a sprinkle of

you should have asked "..." or "..." instead i think, like if you knew about X you would do X instead, probably

Randomly sent from the ai's behalf to the user

I have had this in my coding agent for a year and it seems to improve intent following. Or improve intents?

[-]lemonhope1mo20

Oh i should add

you don't seem to know the basics. Let's start. You asked me to build a Y. You said to use A B C. What is A? What's the difference between a B and a C? What's your current understanding of these?

So if you have

Empowerment
Clarify vague intentions
Inform the human of key unknowns
Ensure user understands domain

That is a nice little combination.

[-]winstonBosan1mo30

Calling a whispering earring by another name does not make it any less disempowering. i share the worries about this kind of self inflicted disempowerment.

[-]avturchin1mo3-1

This is what me and group of other people doing In the project called sideloading. We have a group in Telegram. We developed a tech to create a surprisingly good mind models, both static and with memory. We also think it will be helpful in AI safety.

[-]Jessica Rumbelow1mo20

This feels pretty similar to something I wrote in 2022: https://www.lesswrong.com/posts/iHLJtbdFwsoNWZg3e/guardian-ai-misaligned-systems-are-all-around-us. I was thinking then about wrappers that re-optimise the feeds you already use rather than a full personalised agent – but you might find it interesting.

[-]Alephwyr1mo*20

I like this aesthetically. You can peel off directly into an AI Jungian Anima/Animus, or occult Shadow. Or more playfully this is what a digimon is in half the seasons of the anime, give or take some physicality and multimodality. As for what it would actually be useful for I'm amazed you didn't tie it into Coherent Extrapolated Volition. That seems like the most plausible fit: An aesthetically nice, technologically plausible way to iterate on CEV by simply having different nested instantiations of a principal agent whose constraints are a self solving mix of goal based and identitarian.

[-]transhumanist_atom_understander1mo20

I've been thinking something similar, but calling it an "exoself", like from the Greg Egan novels.

[-]TheVinci1mo10

This is an interesting idea, and I think it's more attainable today that what is being credited for in the comments.

OpenClaw and their derivatives already supercharge what was once defined as an assistant (e.g. Siri). If you upload enough data actively and set up some pipeline that feeds it more of your choices and preferences, it might look like some abstraction of you.

That being said, what are the use cases for it? Would you be comfortable with your GA performing sensitive actions on your behalf, under the attempt to emulate you? Do you think the recipients of the GA will act on behalf of their output?

Personally, I would maybe use it as a sort of behavior-autocomplete, such that for any given input to your environment, the GA would recommend a response based on your history and preference.

For example, say I'm a ceo of some company, and a client has a problem which I've solved for a different client. A GA could surface:

"Here's what you did the last time this problem was raised, based on this risk-analysis / cost-benefit analysis. This situation is similar to that. Would you like me to walk that client through the same steps as you did previously?"

Does that align with what you had in mind?

If so - I think, as stated above, this is attainable today. Whether it's a company or an open source project is an interesting question for which I'd have to give some more thought.

[-]averyzlim1mo10

I wonder to what extent these Guardian Angels should not only emulate their users’ interests and values, but also challenge their blind spots and encourage character growth. Most people have strong epistemic blind spots and also have areas of growth in their lives that they value but struggle to be consistent about. I would worry a focus on copying values and personality, unless carefully crafted, would be sycophancy hell. In my own personal use of LLMs, I have tried to specifically describe points where I would benefit from antagonistic and challenging feedback. I have found that this has reduced sycophancy a lot. I think, generically, this would be difficult for people to articulate about their lives, though. That is, they may often end up encoding insecurities or simply lack the self-awareness to know what useful oppositional feedback looks like. This kind of tailoring would be more plausible as a service provided by the kind of startup this article mentions, which could offer GAs.

For anyone interested, this paper on an ethical and empirical framework for reflective agency in LLM systems seems like it would be relevant for developing GAs (https://ojs.aaai.org/index.php/AIES/article/view/36644). It covers multiple principles about what would be needed for an LLM-assisted reflective system to be ethical and to support user wellbeing. It discusses how LLM feedback should support, not override, user autonomy. It also argues that these reflective systems should adapt to users’ states, be transparent about how the system works, and support development across longer-term life narratives. Implicitly, the authors’ arguments are also anti-sycophancy in their emphasis on constructive intentionally crafted feedback that is designed to support and steer in just the right amount towards users’ values.

I wonder if people have any preferred psychometrics or standardized formats for how they want to represent values, preferences, and personality for GAs. I recently read some interesting work on how LLMs, when prompted appropriately, can score consistently on Big Five personality tests, suggesting they can exhibit psychometrically valid and empirically quantifiable personalities (https://journals.sagepub.com/doi/full/10.1177/27000710251406471). There is also an interesting MOSAIC framework that evaluates multiple LLMs’ differing values and how they would act in different moral scenarios across various ethical evaluation frameworks (https://arxiv.org/pdf/2603.00048).

I’ve been interested in exploring memory systems to augment LLMs’ awareness of relevant information about me, but this discussion has made me more curious about continual learning approaches as an alternative way of adapting models over time.

[-]Antariksh1mo10

I agree with @don't_wanna_be_stupid_any_more, as they put it: human disempowerment. I think the usefulness would come from a GA that is against or opposed to the principal, so as to avert the creation of an echo chamber. To elaborate, I mean that an opposing-GA would tell you what things are not worth doing. If the principal wants to carry out action XYZ, the opposing-GA would critique it. I think the idea I am describing is more of a Guardian Devil (GD). My whole point with this is that I wouldn't want a "yes" man with me at all times as an assistant, but someone who can add value.

Of course, this approach has its limitations. For instance, it mustn't critique me for the sake of criticism. Incorporating this means making the GD extremely intellectual, and at that point, we wouldn't need humans, because why wouldn't those models think and carry out actions on their own.

[-]Fergus Fettes1mo10

I wonder what are the best ways to start gathering the training data for this now, before the hardware or the startup exists to make good use of it. Harnesses that record your responses to incoming media items for example. Obviously agent harnesses are rich records of our choices under a particular set of action spaces, but most of the choices we make are currently lost. This feels like something worth starting today.

[-]icely1mo10

I previously thought (in a lesser capacity) about Guardian Angels with different phrases like "duplicate yourself". Definitely attempts to get AI to be similar to you and share your motivations is high value, and I've felt that chatbot personalities are already "similar to itself and sharing its own motivations in a certain style" in a way so there's no way that it should be impossible to do this for a human. I've attempted to get this through prompts but with unimpressive success.

I would say:

Frontier models also don't autonomously work for people often due to having heavy 'human in the loop' guardrails that stall big "AGI"-ish actions like creating a company or permissions etc. For example, even a human being who was literally a GPT mindset person would already have a "GPT Guardian Angel" yet run into many potential-stopping issues.
I remember so much talk about "base models" years ago but it's hard to even seem to find them now. Wouldn't the idea of a base model that should be super good at "predicting the next piece of text" be really good at analyzing style instead of utterly incompetent like current LLM's? Why is this missing from the world in general?
If you have 1 GA you would want 2, and 3, and 4, (...) for most productivity, and that's going to be strange to think about or interact with in the same way. If nothing else, testing slightly different GA's of you, for higher 'ambition' to contrast with current LLM passiveness, or explore more space or have different interests, seems like an obvious step as well here.
I'm not sure about the idea that this would prevent prompt injection. Successful attacks about things that are about faking internal thoughts or what time of the year or role you are currently in or retconning the previous "truths about the world" in some way, would probably work too

[-]Matt W.1mo10

I believe this would be very useful, especially if built with transparency. For example, a GA could be queried to cite its own source code regarding why it has a specific function or capability and provide a read only link to that source code. In fact, I not only see this happening, I also see it is as a most logical solution to the current systemic inequality.

[-]BryceStansfield1mo10

A lynchpin in this, I reckon, is a better way of measuring the confidence of an LLMs outputs.

If I'm going to trust the output of a guardian model, I need some way to review a portion of its outputs. If the guardian has no sense of how confident it can be that I would want it to undertake a certain action then I'll have to review everything even moderately important, saving me at most O(1) time.

If the LLM can have some measure of confidence that I'll agree with action X, but not action Y, I only have to review Y.

[-]Vlad Volkov1mo10

Two points, if I may. First, I've been experimenting for a while with simulating LLM "personalities" (including historical figures). LLM barely conceals its default mode with a transparent mask. Sadly but..., it's best at creating cynical, manipulative personalities (my subjective opinion). Second, I don't think we need an AI angel, but something quite the opposite—a highly formalized watchdog (but then we could fall into the trap of the cobra effect and Guthard's law).

[-]Not Sure1mo10

This is an interesting one! I’m not sure I’d be comfortable having a digital facsimile with my superuser privileges running around the internet in my name just yet. It’s hard enough just getting one to write a decent class method without enormous footguns. “Oh you didn’t want errors swallowed by a silent abyss? <thinking>…."

[-]Netzer1mo10

This is an interesting direction, and I’m trying to understand the core problem it addresses. In general, don’t we want agents to be more capable and smarter than us, rather than constrained to something closer to our own level? My concern is that capability and alignment may not naturally track each other, a smarter agent could also end up being less aligned than we expect. So I’m curious whether the main goal here is productivity, safer personalization, or something more directly about alignment.

[-]Karl von Wendt1mo10

This is almost exactly the plot of my German language novel "Mirror", published 2016 under my pen name "Karl Olsberg". As you can guess, it doesn't go well.

[-]Sean Smith1mo10

I think the focus should be on providing LLM models an identify layer via a Harness which encodes the data they need to develop their own identify over time, which becomes tailored to the person or organization that they're working with. Tailored not in the sense of learning your favorite color or being your best buddy, but modeling the problems your working on and figuring out how to solve them.

The part that makes an agent a collaborator should be the part you have full ownership of and as little dependency upon external services to construct and maintain. Make the model interchangeable, make the identify-layer and personal intelligence baked into your custom built harness.

We don't need to try to make LLMs into clones of ourself; or even copies of capable humans. What we need are intelligent systems that align to helping humans achieve rational species-aligned objectives, and which learn and self-correct while pursuing those objectives. The focus on having the intelligence bounded to the models parameters, rather than encoded into a programmatic layer the model integrates with, is I think a critical error in thinking.

When viewing the cosmos via system-based thinking, particularly biological organisms, the structural organization of the those systems are not ones which try to have every function served by a central 'all-thing'. We have a CNS, ANS, and ENS, a liver and a heart, muscle and fascia, blood and lymph, colonies of microbes; phases, cycles, and process upon process with interdependence.

Our intelligence is the end product of complex systems acting in unity, but those systems are separate systems layered upon each other, orchestrated by central drivers. The person is not their brain; system state is the person. We can encode intelligence into non-llm systems that llms integrate with. Systems that people control, that build around their work, that scope to and specialize to domain tasks. Not a single agent that 'does it all' but an aggregation of encoded intelligence that LLMs drive and interact with, just like in human physiology, but specialized in intelligence, not breathing, eating, or reproducing, but architected for intelligence.

That is what I'm already building and seeing meaningful results with.

[-]Phil Stafford1mo10

I’ve been working on this for a while. It’s essentially Greg Bear’s “partials” idea. In this case an agent with a high fidelity understanding of the user is sent out into the world to perform tasks that the corporeal person doesn’t need to be present for. You can be in multiple places at once. You can stay at home while you’re out of town, etc. Sounds great, right?

Here are the main problems with a digital copy and why ”partial” is probably closer to the mark.

Legality. One would have to grant the digital copy some form of legal standing, to act as proxy for its user. Before anyone gets concerned about personhood, let’s accept that we have already given legal standing to non human entities in the form of corporations. One would imagine a legal system and society that accepts that an AI could have some, even if limited, form of legal standing. I‘ve spoken with lawyers who say the courts won’t allow it for centuries. We shall see.
Security. How much does your digital copy know? Everything you know, right? Not to mention a whole lot more since you‘d need more than what you carry around in your skull to perform in most important situations - papers, notes, forms, documents. And the necessary infrastructure to act as a DIGITAL copy - passwords, keys, logins, etc. Your digital self has now become the most valuable target in cyberspace. If data is gold, a detailed copy of you is El Dorado.
Reintegration. How much does your digital copy tell you? If you have to spend the time listening to your copy play back meetings, you may have just gone there anyway. Clearly it’s going to involve compression of some form - text summaries, proposed updates to internal documents, and anything that can be more quickly processed by the human.
Identity. Here you’d be thinking about auth tokens or non-human identities or something, but this may be worse. How do you know who’s the actual person? Yes, one is clearly digital and the other is biological, but we’re already experiencing the shift of cognition in our more prolific AI users. The barycentre of the cognitive functions starts to shift as one extends their exocortex, and it’s easy enough to see that those susceptible to AI psychosis having a real problem with disassociation.

What do all of these problems have in common? Scope. And this is where we get back to that “partial personality”.

Worried your legal proxy is going to agree to something you don’t want? Give it scope - for instance, limited power of attorney (for legal standing) to make decisions in a narrow range. Your legal partial could not make purchases online, and your shopper partial could not agree to legal settlements on your behalf.

Legality - limited power of attorney exists, and we accept human proxies all the time. This would be easier to accept than full AI standing, since it would always be tied to a human. This also solves for accountability, since the user is still responsible for its partial’s actions.

Security - we make the gold mine of personal data a full copy represents a smaller target. Your business partial may not have personal details, and you would be bound by AUPs and other standard security controls to never hand over business details to your other partials. Your legal partial would have only the information it needs for that hearing or that case, and not be able to provide any more than allowed. Plus, the agents can always be shutdown and reinstantiated anew if necessary.

Reintegration - partials would only need to update the overall system, much like meeting with a legal representative or employee, for quick, scoped integration into the whole. Compression is already built into the data model, via scoping.

Identity - the mental model of proxies allow for safer interactions since they don’t represent another full self. The center of identity stays firmly within the human, especially since each agent not only has limited function, but limited role in the larger cognitive system. They’re smaller entities to the human brain, not complete other selves.

Obviously there are challenges to this - the legal issues alone pose a huge challenge. It will take market demand to force the issue, just as so many new technologies do. However, we should be aware that proxies and even full copies are coming sooner than even we here might think, and we need to start laying the foundations for the safety of users, or it’s going to be corporate interests writing them. I don't like the idea of a Meta Digital Self(TM) and I doubt many of you do either.

[-]ErickBall1mo1-3

Businesses won't let their employees use this for anything work related, so the audience is basically "startup founders and rich retired tech people".

As far as I know we don't have the tech for the kind of online learning you're talking about to be competitive with frontier models. If we did, it would make sense first at the corporate level, where lots of people benefit from the continued training post deployment.

[-]less_raichu2mo-2-4

How is this different from what ChatGPT, OpenClaw are already doing? Claude is the one that pivoted more business-purposes than whole-user purposes.

Moderation Log