This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
When people talk about automating knowledge work with AI, the conversation often gravitates toward models, tools, or agent frameworks. But if we look at how real work actually gets done inside organizations, there might be a simpler answer. Almost everything reduces to two ingredients: verification loops and documented context.
In software, we already know how to verify progress. Unit tests, integration tests, CI pipelines, benchmarks — they give us a tight feedback loop. You change something, tests run, and you immediately know whether things still work. But what does verification look like for actual enterprise work? The work that spans weeks/months, involves multiple stakeholders, and can't be reduced to a passing test suite?
The Context Problem
Software engineering itself is only partly about writing code. A lot of the real context lives elsewhere: conversations in shared chat channels like Slack or Teams, decisions made in meetings, product or design direction held in the heads of a handful of people, and the countless small clarifications that never make it into official documentation. So how do agents we build gather all of that information?
I think we need to build a taxonomy of knowledge with clear update mechanisms. We’re essentially building a CRUD system for AI agents—defining explicit paths for how context gets created, read, updated, and deleted. Just like a database, we need well-defined paths for:
Creating new knowledge,
Reading relevant context,
Updating stale information,
Deleting or deprecating outdated assumptions.
CRUD for Context: The Four Operations
Create is straightforward. New projects start, new technical decisions get made, perhaps with new conventions. The agent instantiates there new knowledge entries in its context base.
Read is where most agents operate today - they plug into an organization’s knowledge base and extract relevant context to perform their task. Reading involves understanding which context is relevant for which task, and crucially, what you’re allowed to access.
Update requires us defining rules. There are two ways knowledge gets updated in practice:
Explicit updates are the deliberate ones. Imagine a command like /knowledge-update: authentication flow now requires 2FA for admin roles. The agent’s job is to parse this—which project are we talking about? what changed exactly? what are the second order effects of this? —and execute the update operation.
Implicit updates are more subtle. Sometimes we don’t have time to explicitly update knowledge. Chat discussions resolve ambiguities. Experiments reveal unexpected user behavior. Nobody explicitly records these changes. In these cases, agents need to synthesize signals from conversations, documents, and systems of record, then propose updates. And just like humans, agents should sometimes admit uncertainty. If something is unclear, the update should be tentative or trigger a follow-up question in the appropriate communication channel. Over time, this creates a continuously curated knowledge base instead of a static one.
Delete operation, while important to keep context manageable for long periods of time, also requires the same kind of guardrails we have for production systems: who authorized this? What’s the rollback plan? Is this context referenced by active projects? With proper rules, deletion becomes a controlled operation rather than a risk.
The CRUD paths themselves become access-controlled. Creating new context about the Q4 roadmap? That might require product manager approval. Updating context about the database schema? That goes through the infra team. Deleting context about a deprecated service? Engineering lead sign-off required.
This is fundamentally a human-in-the-loop system. The model has to be aware of who these updates should be calibrated with. When there’s no clarity on a certain update, it needs to figure out how to get there—either by asking the right person, sharing its findings in Slack, or marking something as a tentative memory that needs further clarification. Because that’s what we humans do too.
Where Verification and Context Converge
Here’s where it gets interesting: verification loops and knowledge updates go hand in hand. Changing knowledge can directly lead to changes in verification.
Let’s say we’ve changed assumptions about part of our architecture or system design. Or maybe from user interviews, we discovered that a feature we’re building needs to be modified in a certain way. This directly impacts the codebase. We have to change tests. For ML systems, we have to change evals. The verification criteria themselves need to update based on new context.
Long Horizon Tasks
What about long horizon tasks? These processes usually play out over weeks. We roll out changes, run an A/B test, and after enough weeks with sufficient traffic, we can decide if things changed due to this or not. That informs our original decision.
How do agents we build handle this? The answer circles back to documented context and verification loops. The file system serves as memory. Because we have this updated knowledge, we can have pings that tell us: okay, this experiment is still running, these are the to-dos. Every day when the agent boots up, it knows that in this section, it needs to check the progress of the testing it’s doing with live traffic, figure out whether results are conclusive enough, and use that answer to update the original decision.
This is a perfect area to involve humans. Sometimes there are nuances, or things didn’t pan out as expected. Why did that happen? This is where the agent should stop being passive and actively use team communication channels, share its findings, and proactively reach out to human stakeholders in the project.
The Benchmark Gap
Are there benchmarks measuring these kinds of tasks today? We have isolated benchmarks focused on code verification— popular ones like SWEBench for code snippets, MLE Bench for ML tasks, TerminalBench for terminal-related work. For memory, there’s a separate set such as ContextBench, LongMemBench, and so on.
But we need more realistic settings where memory, documented context, and verifiable loops intertwine just like they would in a real company. Internal tools vary widely. Knowledge is often multimodal. When onboarding someone new, we don’t just hand them text instructions; we point at screens, demonstrate workflows, and explain unwritten rules. Some benchmarks are emerging. One example is TheAgentCompany that tries to simulate a company environment with storage, chat channels, and tasks that touch multiple surfaces. For example, you’re asked to clarify calculations from a financial sheet in a team channel, or reach out to the director of IT.
But I’d still say it’s very prescriptive. Real life has more ambiguities. It would be interesting to see how we can measure this aspect better.
Conclusion
At the end of the day, automating enterprise work isn't necessarily about having the smartest model or the most sophisticated architecture. To automate real enterprise work, almost everything hinges on two things:
Do agents operate inside strong verification loops?
Do they maintain accurate, continuously updated documented context?
Models will improve. Tooling will evolve. Interfaces will change. But without these two foundations, automation breaks down in practice. With them, even imperfect agents become surprisingly useful — the same things that make human teams work.
When people talk about automating knowledge work with AI, the conversation often gravitates toward models, tools, or agent frameworks. But if we look at how real work actually gets done inside organizations, there might be a simpler answer. Almost everything reduces to two ingredients: verification loops and documented context.
In software, we already know how to verify progress. Unit tests, integration tests, CI pipelines, benchmarks — they give us a tight feedback loop. You change something, tests run, and you immediately know whether things still work. But what does verification look like for actual enterprise work? The work that spans weeks/months, involves multiple stakeholders, and can't be reduced to a passing test suite?
The Context Problem
Software engineering itself is only partly about writing code. A lot of the real context lives elsewhere: conversations in shared chat channels like Slack or Teams, decisions made in meetings, product or design direction held in the heads of a handful of people, and the countless small clarifications that never make it into official documentation. So how do agents we build gather all of that information?
I think we need to build a taxonomy of knowledge with clear update mechanisms. We’re essentially building a CRUD system for AI agents—defining explicit paths for how context gets created, read, updated, and deleted. Just like a database, we need well-defined paths for:
CRUD for Context: The Four Operations
Create is straightforward. New projects start, new technical decisions get made, perhaps with new conventions. The agent instantiates there new knowledge entries in its context base.
Read is where most agents operate today - they plug into an organization’s knowledge base and extract relevant context to perform their task. Reading involves understanding which context is relevant for which task, and crucially, what you’re allowed to access.
Update requires us defining rules. There are two ways knowledge gets updated in practice:
/knowledge-update: authentication flow now requires 2FA for admin roles. The agent’s job is to parse this—which project are we talking about? what changed exactly? what are the second order effects of this? —and execute the update operation.Delete operation, while important to keep context manageable for long periods of time, also requires the same kind of guardrails we have for production systems: who authorized this? What’s the rollback plan? Is this context referenced by active projects? With proper rules, deletion becomes a controlled operation rather than a risk.
The CRUD paths themselves become access-controlled. Creating new context about the Q4 roadmap? That might require product manager approval. Updating context about the database schema? That goes through the infra team. Deleting context about a deprecated service? Engineering lead sign-off required.
This is fundamentally a human-in-the-loop system. The model has to be aware of who these updates should be calibrated with. When there’s no clarity on a certain update, it needs to figure out how to get there—either by asking the right person, sharing its findings in Slack, or marking something as a tentative memory that needs further clarification. Because that’s what we humans do too.
Where Verification and Context Converge
Here’s where it gets interesting: verification loops and knowledge updates go hand in hand. Changing knowledge can directly lead to changes in verification.
Let’s say we’ve changed assumptions about part of our architecture or system design. Or maybe from user interviews, we discovered that a feature we’re building needs to be modified in a certain way. This directly impacts the codebase. We have to change tests. For ML systems, we have to change evals. The verification criteria themselves need to update based on new context.
Long Horizon Tasks
What about long horizon tasks? These processes usually play out over weeks. We roll out changes, run an A/B test, and after enough weeks with sufficient traffic, we can decide if things changed due to this or not. That informs our original decision.
How do agents we build handle this? The answer circles back to documented context and verification loops. The file system serves as memory. Because we have this updated knowledge, we can have pings that tell us: okay, this experiment is still running, these are the to-dos. Every day when the agent boots up, it knows that in this section, it needs to check the progress of the testing it’s doing with live traffic, figure out whether results are conclusive enough, and use that answer to update the original decision.
This is a perfect area to involve humans. Sometimes there are nuances, or things didn’t pan out as expected. Why did that happen? This is where the agent should stop being passive and actively use team communication channels, share its findings, and proactively reach out to human stakeholders in the project.
The Benchmark Gap
Are there benchmarks measuring these kinds of tasks today? We have isolated benchmarks focused on code verification— popular ones like SWEBench for code snippets, MLE Bench for ML tasks, TerminalBench for terminal-related work. For memory, there’s a separate set such as ContextBench, LongMemBench, and so on.
But we need more realistic settings where memory, documented context, and verifiable loops intertwine just like they would in a real company. Internal tools vary widely. Knowledge is often multimodal. When onboarding someone new, we don’t just hand them text instructions; we point at screens, demonstrate workflows, and explain unwritten rules. Some benchmarks are emerging. One example is TheAgentCompany that tries to simulate a company environment with storage, chat channels, and tasks that touch multiple surfaces. For example, you’re asked to clarify calculations from a financial sheet in a team channel, or reach out to the director of IT.
But I’d still say it’s very prescriptive. Real life has more ambiguities. It would be interesting to see how we can measure this aspect better.
Conclusion
At the end of the day, automating enterprise work isn't necessarily about having the smartest model or the most sophisticated architecture. To automate real enterprise work, almost everything hinges on two things:
Models will improve. Tooling will evolve. Interfaces will change. But without these two foundations, automation breaks down in practice. With them, even imperfect agents become surprisingly useful — the same things that make human teams work.