I applied Audit standards to LLM agents. It reliably exposes hidden assumptions.
I am an accountant by trade. About a month and a half ago I started using LLM agents to automate bookkeeping duties. It was great at the start, the LLM quickly had the site built with authentication and I was getting excited, and then I found a bug, I would be randomly logged out. Then I spent the day watching the AI run in circles trying to fix it, but couldn't. I was no help since I had no idea what code the AI had written, so I decided to get to the bottom of this and figure out how to fix it and prevent it from happening again in the future.
One foundational rule in auditing is: everything has to be verified, you can’t just assume it is true. And with AIs I found this to be especially true, you have to check everything they do in a very systematic way. This would be impractical for a human to do manually, but with LLMs help it can actually be done in a quite effective manner.
This is how I ended up with RED (Recursive Execution Decomposition), this is a name AI generated, otherwise I’d be calling it Why The H Did You Make That Assumption! (WTHDYMTA! For short)
Jokes aside, the idea for RED is actually really simple, you take a look at a task that LLM did, or plans to do, you ask it to drill it down into all the actions it would take to do that task, and match that action to resources or knowledge (Knowledge part is very important, I fell into another hole because I didn’t have knowledge in my earlier version, anyways another story for another time), Once they are matched, I would ask LLM can they be verified, if they can, how can the verification be replicated? This is a very crucial step, As we all know LLMs hallucinate, so if the verification cannot be proven, it is assumed missing. And that’s it, that’s the basics of what RED is. However because I was so frustrated at the time, I asked the agent to drill down on each action down by 5 levels, so I ended up with a huge tree of actions which I call atomic actions, that is matched to resources and knowledge with checkmarks or Xs besides them. When I first did this I ended up with 120+ atomic actions and 70% of it was Xs… that’s where I went ok, I understand now why you failed, I forgive you.
Once all the Xs have been dotted, I asked the LLM to summarize what all these missing resources and knowledge mean? Most of them are duplicated, but it usually end up being a list of dependencies that were missing, or assumptions that needs to be locked down, or parameters/arguments in a function that needs to be made explicit. (It is really quite amazing that something as basic as write_file when drilled down deep enough you would still end up with a lot of missing parameters.) Once the summary is complete you would have a list of tasks that you or the LLM needs to do to make them explicit.
I think that AIs are a very powerful tool, and even after just a month of intensive usage I can’t think what it would be like to go back to not using it. However, everything it does needs to be completely verified, in accounting/bookkeeping for something as basic as bookkeeping, you can’t have the numbers it enters to be 99% correct, no! It had to be perfect, it’s ok to make a mistake once in a few months, but no more than that.
Which is why I propose all tasks needs to be broken down into what I call primitives, something that didn’t make sense to break down further, at that point you have surfaced many hidden assumptions. Of course we can’t be certain of the quality of LLM’s work in breaking things down, so there are bound to be missed things, but even if you only catch 70% of them it will make the rest easier to pinpoint because you don’t have so many gaps anymore.
I know this is not the most efficient way to do things, but it is effective, especially for non-coders like me, it has allowed me to code things that are beyond what I could have done. Also I found that not only is it great for catching bugs, it is also very useful when it comes to making plans, it forces the LLMs to admit what they don’t have, so they can either get it, or I can provide it. I have another way on how to use it written up in my paper which I thought it was very useful, you can read about it here https://github.com/LeiWang-AI/RED-methodology
In conclusion, while this is really a very simple idea, but I found it to be very powerful, and I think this might have to be the mindset we utilize when it comes to using LLMs. Please let me know what you think, I would love to hear how you use it, if it is useful or not. If you have any questions or suggestions I would love to hear them as well.
Finally, If you have arXiv endorsement in cs.AI (or cs.SE) and think this is worth archiving, please DM me
I applied Audit standards to LLM agents. It reliably exposes hidden assumptions.
I am an accountant by trade. About a month and a half ago I started using LLM agents to automate bookkeeping duties. It was great at the start, the LLM quickly had the site built with authentication and I was getting excited, and then I found a bug, I would be randomly logged out. Then I spent the day watching the AI run in circles trying to fix it, but couldn't. I was no help since I had no idea what code the AI had written, so I decided to get to the bottom of this and figure out how to fix it and prevent it from happening again in the future.
One foundational rule in auditing is: everything has to be verified, you can’t just assume it is true. And with AIs I found this to be especially true, you have to check everything they do in a very systematic way. This would be impractical for a human to do manually, but with LLMs help it can actually be done in a quite effective manner.
This is how I ended up with RED (Recursive Execution Decomposition), this is a name AI generated, otherwise I’d be calling it Why The H Did You Make That Assumption! (WTHDYMTA! For short)
Jokes aside, the idea for RED is actually really simple, you take a look at a task that LLM did, or plans to do, you ask it to drill it down into all the actions it would take to do that task, and match that action to resources or knowledge (Knowledge part is very important, I fell into another hole because I didn’t have knowledge in my earlier version, anyways another story for another time), Once they are matched, I would ask LLM can they be verified, if they can, how can the verification be replicated? This is a very crucial step, As we all know LLMs hallucinate, so if the verification cannot be proven, it is assumed missing. And that’s it, that’s the basics of what RED is. However because I was so frustrated at the time, I asked the agent to drill down on each action down by 5 levels, so I ended up with a huge tree of actions which I call atomic actions, that is matched to resources and knowledge with checkmarks or Xs besides them. When I first did this I ended up with 120+ atomic actions and 70% of it was Xs… that’s where I went ok, I understand now why you failed, I forgive you.
Once all the Xs have been dotted, I asked the LLM to summarize what all these missing resources and knowledge mean? Most of them are duplicated, but it usually end up being a list of dependencies that were missing, or assumptions that needs to be locked down, or parameters/arguments in a function that needs to be made explicit. (It is really quite amazing that something as basic as write_file when drilled down deep enough you would still end up with a lot of missing parameters.) Once the summary is complete you would have a list of tasks that you or the LLM needs to do to make them explicit.
I think that AIs are a very powerful tool, and even after just a month of intensive usage I can’t think what it would be like to go back to not using it. However, everything it does needs to be completely verified, in accounting/bookkeeping for something as basic as bookkeeping, you can’t have the numbers it enters to be 99% correct, no! It had to be perfect, it’s ok to make a mistake once in a few months, but no more than that.
Which is why I propose all tasks needs to be broken down into what I call primitives, something that didn’t make sense to break down further, at that point you have surfaced many hidden assumptions. Of course we can’t be certain of the quality of LLM’s work in breaking things down, so there are bound to be missed things, but even if you only catch 70% of them it will make the rest easier to pinpoint because you don’t have so many gaps anymore.
I know this is not the most efficient way to do things, but it is effective, especially for non-coders like me, it has allowed me to code things that are beyond what I could have done. Also I found that not only is it great for catching bugs, it is also very useful when it comes to making plans, it forces the LLMs to admit what they don’t have, so they can either get it, or I can provide it. I have another way on how to use it written up in my paper which I thought it was very useful, you can read about it here https://github.com/LeiWang-AI/RED-methodology
In conclusion, while this is really a very simple idea, but I found it to be very powerful, and I think this might have to be the mindset we utilize when it comes to using LLMs. Please let me know what you think, I would love to hear how you use it, if it is useful or not. If you have any questions or suggestions I would love to hear them as well.
Finally, If you have arXiv endorsement in cs.AI (or cs.SE) and think this is worth archiving, please DM me