I built a customer support agent for a client a while back. It reads incoming tickets, searches the knowledge base, and drafts replies. Works pretty well most of the time.

But “most of the time” isn’t good enough when you’re handling real customer conversations. The agent kept making the same kinds of mistakes. Too verbose. Suggesting workarounds that didn’t actually work. Missing context from earlier messages in the thread. The user would review each draft, fix it, and move on.

The problem is: those corrections just disappear. The user fixes the draft, the agent never learns from it. Next time, same mistakes.

So I built a second agent whose only job is to analyze the first agent’s failures and improve its prompt.

The feedback loop

The setup has three parts. The support agent does its thing, reads tickets, drafts replies. The user reviews each draft and scores it (thumbs up or thumbs down, plus an optional comment about what was wrong). Those scores get stored in a tracing tool.

That’s the data collection side. The interesting part is what happens next.

The improver agent

I wrote a skill (basically a structured prompt with a workflow) that acts as the “prompt improver.” When you run it, here’s what it does:

  • Fetches all the failed drafts from the past couple weeks
  • Fetches the successful ones too, for contrast
  • Reads the reviewer’s comments to understand what went wrong
  • Groups failures into patterns
  • Reads the current prompt
  • Proposes specific, minimal edits

The key word there is “minimal.” The improver doesn’t rewrite the whole prompt. It adds a line here, tightens a sentence there. Every change has to trace back to at least one real failure.

What it found

I ran the improver last week. It pulled 12 failure traces and 1 success trace from the past two weeks.

Verbosity was the biggest problem. Half the failures were the agent writing paragraphs when two sentences would do. In one case, it took 5 rounds of editing to get a reply down to two sentences. The reviewer’s ideal version was always shorter.

Bad workarounds showed up in 4 cases. The agent suggested fixes pulled from loosely related issues that didn’t actually apply. Once it recommended a paid feature as a workaround for a bug, which is a terrible look for a support agent.

Not reading the full conversation. Three times, the agent only read the first few messages and missed recent replies. The reviewer had to say “read the full thread” before the agent noticed there were 50+ posts.

Wrong instructions. Twice, the agent cited settings menu paths that didn’t match the actual product. It was pulling paths from memory instead of checking the docs. The prompt already said “never invent UI paths” but the agent did it anyway.

And the one success? A short reply that acknowledged a bug report without overexplaining. The agent CAN be concise. It just defaults to verbose.

The prompt changes

Five small edits. Here’s a couple examples.

The conciseness rule went from “First draft replies should be 2-3 sentences” to something more specific:

First draft replies should be 2-4 sentences max. Don’t restate the user’s problem back to them. Don’t explain your reasoning. Lead with the single most likely fix.

It added a rule about workaround quality:

Only suggest workarounds confirmed in solved threads, official docs, or clear evidence. Never suggest workarounds from loosely-related threads.

And the “don’t invent UI paths” rule got teeth:

Never cite specific setting names or menu locations from memory - always fetch the relevant docs page to verify before including them.

Each change was small. Each traced to multiple failures. The improver also checked that none of the changes would conflict with what was already working.

Why this works

The reason this beats manual prompt tuning: you’re not guessing. You have real failures with real reviewer comments telling you exactly what went wrong. When you review drafts one at a time, you might not notice that verbosity accounts for half your failures. The improver sees all of them at once and spots the patterns.

It’s also conservative by design. The improver only touches what’s broken. This matters because prompt changes are fragile. A big rewrite can fix one problem and create three new ones.

How to build your own

You need tracing, so the agent’s inputs and outputs are recorded somewhere. I use LangSmith, but LangFuse, Braintrust, or honestly even a spreadsheet works if you’re scrappy. The point is you need a record of what happened.

You need scoring. Someone marks each output as good or bad. A thumbs up/down button is enough. Comments are gold but not required. If human review doesn’t scale, you could use an LLM-as-judge to auto-score, but human feedback is more reliable.

And you need the improver itself. It’s a prompt that knows how to fetch traces, categorize failures, read the current prompt, and propose edits. I wrote mine as a reusable skill I can run whenever. The workflow is what matters, not the specific tools.

Took me about a day to set up. Most of that was the scoring UI. The improver skill was maybe an hour.

Where this goes

Right now I run the improver manually when I feel like it. But there’s no reason it couldn’t run automatically, say weekly, and surface proposed changes for review.

You could even chain it. The improver proposes changes, a reviewer approves them, they get applied. The support agent gets better every week without anyone manually reading through traces.

I don’t think we’re at fully autonomous self-improving agents yet. The human reviewer still matters. But we’re getting closer to a world where you set up the loop and the system improves on its own, with a human just approving the changes.

If you’re running any kind of agent in production, try this. Build an agent that reviews your other agent’s failures. You’ll be surprised what it finds.