Building Your First Coding Agent — A Practical Walkthrough
Stop reading about coding agents and build one. A practical walkthrough of a real, useful coding agent — the decisions made along the way, the things that broke, and what the finished version looks like.
Yesterday I covered the design principles for coding agents. Today I want to get concrete: let’s actually build one.
Not a toy demo. A useful agent that solves a real problem — one I’ve built and use in my own workflow. I’ll walk through the decisions, the things that broke during development, and what the finished version looks like.
The Task: A Test Gap Finder Agent
The agent I’m going to walk through does one thing: given a Python module, it finds functions that have no corresponding tests and reports them with enough context to write those tests.
This is useful because test coverage metrics lie. 80% coverage tells you lines are executed, not that behaviour is tested. The gap finder finds the specific functions with no test coverage at all — the easiest wins.
The agent needs to:
- Read the source module and identify all functions
- Read the test files and identify which functions are tested
- Compare the two and report the gaps
- For each gap, provide enough context to write a test
Simple enough to build in a day. Real enough to use in production.
Step 1: Define the Tools
Following the design principle from Day 20: define the minimum action space required. For this agent:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
tools = [
{
"name": "read_file",
"description": "Read the contents of a file at the given path. Returns the file content as a string. Use this to read source files and test files.",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute or relative path to the file"
}
},
"required": ["path"]
}
},
{
"name": "list_files",
"description": "List all files in a directory matching a glob pattern. Returns a list of file paths.",
"input_schema": {
"type": "object",
"properties": {
"directory": {"type": "string"},
"pattern": {
"type": "string",
"description": "Glob pattern, e.g. '**/*.py'"
}
},
"required": ["directory", "pattern"]
}
}
]
Two tools. Read-only. No write access, no execution. The blast radius of a mistake is zero — the agent can look, not touch.
Step 2: Implement the Tool Handlers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import glob
import os
def handle_tool_call(tool_name, tool_input):
if tool_name == "read_file":
path = tool_input["path"]
if not os.path.exists(path):
return {"error": f"File not found: {path}"}
with open(path, "r") as f:
return {"content": f.read(), "path": path}
elif tool_name == "list_files":
directory = tool_input["directory"]
pattern = tool_input.get("pattern", "**/*.py")
full_pattern = os.path.join(directory, pattern)
files = glob.glob(full_pattern, recursive=True)
return {"files": files, "count": len(files)}
return {"error": f"Unknown tool: {tool_name}"}
Notice the structured error returns. When a file isn’t found, the agent gets a message it can reason about — not a Python exception that crashes the loop.
Step 3: The Agent Loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import anthropic
client = anthropic.Anthropic()
def run_agent(source_path, test_directory):
system_prompt = """You are a test coverage analyst. Your job is to identify
functions in a Python source file that have no corresponding tests.
For each function with no tests, report:
- The function name and signature
- What it does (from the docstring or code)
- What test cases would be most valuable
Be specific. Do not report functions that are clearly tested."""
messages = [
{
"role": "user",
"content": f"Find untested functions in {source_path}. Test files are in {test_directory}."
}
]
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=messages
)
# Add assistant response to history
messages.append({"role": "assistant", "content": response.content})
# If no tool calls, we're done
if response.stop_reason == "end_turn":
# Extract final text response
for block in response.content:
if hasattr(block, "text"):
return block.text
break
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = handle_tool_call(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result)
})
# Add tool results and continue loop
if tool_results:
messages.append({"role": "user", "content": tool_results})
The loop is straightforward: send messages, check if done, process tool calls, add results, repeat. This is the core pattern for any tool-using agent with the Anthropic API.
What Broke During Development
First problem: the agent read too many files. Without constraints, it tried to read every Python file in the project — which hit rate limits and took forever. Fix: narrow the task in the prompt. “Limit your search to the test directory provided — don’t explore the whole project.”
Second problem: false positives. The agent reported functions as untested when they were tested indirectly through integration tests. Fix: add a note in the system prompt acknowledging this limitation. “Note: you can only detect direct test references, not indirect coverage through integration tests.”
Third problem: vague output. Early versions reported “function X has no tests” without enough context to act on. Fix: update the prompt to require the structured output format (function name, signature, purpose, suggested test cases).
Each of these was a prompt or tool design fix, not a code fix. That’s the pattern: when an agent misbehaves, look at the system prompt and tool descriptions before touching the implementation.
Using It
1
2
3
4
5
report = run_agent(
source_path="src/orders/processor.py",
test_directory="tests/"
)
print(report)
Output looks like:
1
2
3
4
5
6
7
8
9
## Untested Functions Found
### `calculate_shipping_cost(order, destination, express=False)`
**What it does:** Calculates shipping cost based on order weight and destination zone.
**Suggested tests:**
- Standard domestic shipping at different weight thresholds
- Express vs standard rate comparison
- International destination handling
- Edge case: zero-weight order
Actionable. Takes ten seconds to run. I run it on every module before closing a sprint.
The Broader Pattern
Every coding agent follows this same structure:
- Define the minimum tools the task requires
- Write structured error returns so the agent can recover
- Write a precise system prompt that constrains scope and output format
- Build the observe-reason-act loop with message history
- Test it on real cases, fix the misbehaviours in prompt and tool design
The implementation is the easy part. The design — especially the system prompt and the tool definitions — is where the quality of the agent lives.
Day 21 of the 30-Day AI Engineering series. Previous: What Makes a Good Coding Agent.