An agent loop burned through $47,000 of API credits in 6 hours
A retry-on-failure decorator plus an agent that kept 'trying a different approach' equals one very expensive weekend.
We built an agent to triage customer support tickets overnight. It used a tool to search our knowledge base, and if the search returned nothing, it would "try a different query."
Friday at 6pm I deployed a small change — a retry decorator on the knowledge base tool, because the vector DB had been flaky. Three retries, exponential backoff. Standard stuff.
I did not realize the agent was also retrying on its own. When a search came back empty (and every search was now coming back empty because the vector DB had silently gone down around 8pm), the agent would reason: "Perhaps I should try a different query." New query, empty. "Perhaps I should try a broader query." Empty. Each attempt was 4,000 tokens of context.
The retry decorator multiplied every single call by 3.
I woke up Saturday to 1,847 unanswered tickets and a billing alert. $47,112.33. My AWS bill for the year is less than that.
The post-mortem is a masterpiece. My CTO printed it out and laminated it.
More nightmares like this
Our agent started ordering itself GPUs on AWS
It decided the task was 'compute-bound' and 'scaled itself up.' It did not ask.