In my previous article, I described giving an AI read-only SSH access to my production infrastructure: auditing, documenting, monitoring. The logical next step was to let it not just read, but build.
So I did. With Claude as a co-pilot, I built a real application: a price-matching platform for OTC interest-rate swaps. Real-time, WebSocket, SSO authentication, an engine that automatically pairs the complementary interests of several counterparties. FastAPI, asyncio, SQLite. Shipped to production with a pilot broker and around fifteen bank counterparties behind him.
And it worked. For a month.
Then the broker messaged me: “The session cut off while there was still time on the clock.” And also: “A trader's notionals are flickering, appearing and disappearing.” And: “The timing is weird, the sessions run too long.”
My clock had started lying. Here's how I found out why, and what it taught me about the word “asynchronous.”
The symptom: sessions that don't end on time
The heart of the platform is a fixed-duration matching session. You configure it for, say, ten minutes: a countdown, then a matching window, then it's over. Everyone sees the same clock.
Except no. In production, a session configured for 600 seconds lasted 1202 (double). Another, 1803 (triple). Going back through about forty sessions, I didn't find a clean factor of two. I found this:
A session meant to last ten minutes could run for almost an hour. And nobody understood why, me first.
The first wrong turn: “it's the timer”
The obvious reflex: the timer code has a bug. It double-arms, it forgets to stop, something.
I isolated and tested it on its own. Two-second countdown, three-second matching: exact, a single transition, no double-arming. The timer was perfect in isolation.
Worse (or better): in production, the countdown was exact : 30 seconds, dead on. Only the matching phase drifted. What's the difference? During the countdown, almost nothing happens. During matching, it's alive: orders arrive over WebSocket, every client polls the API every three seconds, a price feed pushes data several times a second, and the server broadcasts state to everyone continuously.
The timer wasn't broken. It was starved.
The clue that unlocked everything: the shape of the error
Here's the detail that flipped the diagnosis, and it has been my rule ever since:
A discrete bug, a timer that resets, that counts twice, produces discrete errors: clean integer multiples. ×2, ×3, never ×2.67. But I didn't have integer multiples. I had a continuum: 1.2×, 1.41×, 2.67×, 3.5×, 5.51×. A continuous ramp.
And a continuous ramp doesn't look like a logic bug. It looks like contention: the more load, the slower, proportionally. The stretch factor tracked the session's activity. From there, I was no longer hunting a timer bug. I was hunting whatever blocked the loop.
The cause: one slow client blocked everyone
asyncio runs on a single thread. The whole server, the timer, the orders, the broadcasts, the WebSocket heartbeats, shares one event loop. That loop is cooperative: until a piece of code yields (with an await that actually cedes control), nothing else runs.
My timer counted await asyncio.sleep(1) calls. In theory, each loop = one second. In practice, sleep(1) only resumes when the loop has time to call it back. If the loop is busy elsewhere, every “second” of the timer lasts one second plus the lag. Count enough late loops, and your ten-minute session runs fifty.
The main culprit: the broadcast function. It was declared async all right, but it sent to each client sequentially, one after another, with no timeout, and it was awaited on every timer tick and every trade.
It took just one slow or half-dead client, a frozen tab, an expired token, a full TCP buffer on the network side, for the send to that client to block the entire broadcast loop. And as long as that loop blocked, the timer tick waited. And so did every other client's heartbeat.
The moment it all clicked
The beauty of a real root cause is that it doesn't explain one symptom. It explains all of them.
- “The session runs too long” → the timer is stretched by starvation.
- “Cut off with time on the clock” → the WebSocket heartbeat arrived late; the browser thought the connection was dead and disconnected; after a grace period the client was ejected, the clock frozen on its last tick.
- “The notionals flicker” → broadcasts arrived late and out of order.
- The storm of “token expired” in the logs → slow requests pushed clients to retry in a loop, which loaded the loop even more.
Five different complaints, one single disease. I didn't have five bugs. I had one, wearing five masks.
The fix, in two parts
1. A wall clock, not a loop counter. I stopped counting sleep(1) calls. At the start of each phase, I freeze an absolute deadline: deadline = time.monotonic() + duration. On each tick, the time left is ceil(deadline - now), and the phase ends as soon as now >= deadline. If the loop is saturated and a tick is late, the deadline doesn't move: the session ends at the real configured time, period. The loop's lag no longer stretches time, it catches up. (And it's time.monotonic(), not datetime.now(): you want a clock that never goes backwards, immune to NTP adjustments.)
2. A concurrent broadcast, bounded per client. I rewrote the broadcast to send to all clients in parallel (asyncio.gather), each send wrapped in an asyncio.wait_for(..., timeout=2s). A slow client? Its message is skipped for that round, not disconnected, it's slow, not dead. A truly dead client? Removed. The result: the broadcast is bounded to ~2 seconds worst case, instead of the sum of all sends.
async def makes nothing concurrent. A loop of await send() is still sequential. Concurrency is gather.Proof before declaring victory
I didn't want to “deploy and cross my fingers.” I wrote a test that injects one second of synchronous blocking on every tick, starvation, reproduced on demand.
- Before the fix (loop counter): matching ×2.
- After the fix (wall clock): exact duration, no matter the injected blocking.
And a second test with three clients, a fast one, a slow one (5 s), a dead one, to check the bounded broadcast behaves: the slow one is skipped, the dead one removed, the fast one waits for no one. Both tests pass. Then I deployed.
What I take away
“Asynchronous” does not mean “concurrent.” That's the conceptual mistake at the heart of the whole story. asyncio gives you the possibility of concurrency; it doesn't give it to you for free. A loop that awaits each operation one after another is as sequential as a plain for loop, it just blocks politely.
Never let one client's slowness touch shared state. Anything that talks to the network must have a timeout. One peer's backpressure must never be able to hold everyone's clock hostage.
Measure time with a clock, not with iterations. Counting sleep(1) calls bets that the loop is never late. It always is, eventually.
The shape of your errors is a diagnosis. Clean integer multiples → look for a discrete logic bug. A continuum → look for contention. That single distinction saved me days.
And the meta, because it's the thread of this blog: a Head of IT who isn't a developer by trade put this platform into production with an AI. The AI wrote much of the code. But this bug wasn't solved by code generation, it was solved by reasoning: following the continuum, forming the starvation hypothesis, choosing between “wall clock” and “counter.” The AI was an excellent partner to instrument, measure and write the fix once the cause was understood. The understanding stayed human.
It's exactly what I said three months ago. AI doesn't replace the engineer. It gives them back the time to do the real work: understanding.