Skip to main content

Command Palette

Search for a command to run...

How I Built a Multi-Agent AI Orchestration System (and What I Learned)

Published
7 min read
S
AI engineer building automated creative pipelines. Exploring the intersection of AI, automation, and content creation. Sharing what I learn along the way.

Post #5 — Srinibytes Tags: ai, engineering, multi-agent, systems-thinking Target: Hashnode | ~1350 words


I now have 33 AI agents running autonomously. A CEO agent that wakes up every 8 hours. A VP of Engineering that triages technical work. Specialists for frontend, backend, QA, DevOps, content, and career ops. Each one checks their inbox, picks a task, does the work, and moves on.

It took me six weeks to build and three major architectural rewrites to get right.

Here's what I actually learned — not the cleaned-up version.


The Idea Was Simple. The Execution Wasn't.

I wanted to stop manually managing my side projects. I was spending more time juggling tasks across StreamVault, my AI/ML learning app, and a video generation pipeline than I was actually building things. The obvious solution: automate the project management with AI agents.

The non-obvious problem: I had no idea how brittle stateful AI workflows were until I built one.

My first design looked reasonable on a whiteboard. An orchestrator agent that assigned tasks, waited for completion signals, then moved to the next step. Human-in-the-loop approvals managed by a "waiting" node that paused execution until someone responded.

This worked great in demos. In production, it fell apart in three specific ways:

  1. Wait nodes lost execution context after ~24 hours. When I finally approved something, the workflow resumed but was running stale code from before the last deploy.
  2. Any single failure restarted the entire chain. My 12-step orchestrator would hit a transient error in step 9, retry from step 1, and I'd have duplicated work and confused state.
  3. Redo paths drifted. I had separate code for "fresh run" and "retry" that started diverging after the first week. Bugs I fixed in one path silently stayed in the other.

None of these are novel problems. They're documented anti-patterns in distributed systems. I just had to hit them personally before they became real.


Rewrite #1: The Database Becomes the State Machine

The key insight — the one that fixed most of these problems — was this: the database should own the state, not the workflow execution.

Instead of a workflow that pauses and waits for approval, I now have:

  • A database column: status (pending_review | approved | rejected | in_progress | done)
  • A short workflow that checks status and acts on what it finds
  • No long-running execution context to lose

When I approve a task now, I'm not resuming a paused workflow. I'm updating a row in a table. A fresh, stateless workflow picks that up on its next trigger and does the next step — with current code, current config, zero stale state.

In practice, this means:

Before: workflow waits → approval received → resumes with stale context
After:  approval updates DB → trigger fires fresh workflow → reads current DB state

The workflow doesn't know anything about what happened before it ran. It reads the current state from the database, does its job, writes the result, and exits. Every step is independently restartable.

This is not a new idea. It's how every reliable distributed system works. But when you're building AI pipelines with visual workflow tools, the "wait node" is one click away and it feels right. It's not.


Rewrite #2: Short Workflows, One Responsibility Each

My second architectural mistake was the monolith.

I had a workflow with 50+ nodes that handled the entire agent lifecycle: assign task → research → implement → review → commit → deploy → verify → notify. It was impressive to look at. It was impossible to debug.

When something failed, I had no idea which step caused it. When I wanted to change the deploy step, I had to understand the entire workflow. When I added a new agent type, I had to fork the whole thing.

The fix was to split by responsibility. Each workflow now does exactly one thing:

  • assign-and-checkout.workflow — picks a task from the queue and marks it in_progress
  • execute-work.workflow — does the actual implementation
  • review-gate.workflow — evaluates quality, blocks or approves
  • deploy-verify.workflow — ships and confirms the deploy succeeded
  • notify.workflow — sends status to Telegram

Each one is ~10 nodes. Each one can fail and retry independently. When deploy-verify fails, execute-work is untouched. When I want to change how notifications work, I touch one file.

The trade-off is real: more workflows to manage, more triggers to configure, more places where inter-workflow communication can break. I've accepted that trade-off because the debugging experience is so much better. A 10-node workflow that fails is a 2-minute investigation. A 50-node workflow that fails is an afternoon.


Rewrite #3: The Abstraction That Cost Me Two Weeks

The third mistake was the one I'm most embarrassed about, because I knew better.

I built a generic "agent runner" abstraction — a single workflow template that could theoretically run any agent with any skill set. Parameterized node configurations, dynamic prompt injection, the works. It was architecturally elegant.

It was also completely wrong for my situation.

Each agent type actually has meaningfully different needs. The CEO agent needs to read company-wide dashboards, prioritize across projects, and spawn work for multiple reports. The Content Writer needs filesystem access to drafts, git operations, and Hashnode publishing. The QA engineer needs Playwright MCP and structured test output parsing. These are not the same workflow with different parameters. They're different workflows.

When I tried to make them converge into a single abstraction, I spent two weeks adding escape hatches. "Except for agent type X, where parameter Y does Z." The abstraction's complexity started exceeding the complexity of just having separate workflows.

I scrapped the generic runner. Gave each agent type its own AGENTS.md (instruction file), its own heartbeat pattern, its own toolset. The system is now less "architecturally pure" and significantly more functional.

The key insight: three similar workflows is better than one abstraction with three edge cases. Abstractions earn their existence by reducing future maintenance cost. When the abstraction adds more complexity than it removes, it's not an abstraction — it's debt.


What Actually Works

After six weeks and three rewrites, here's what the working system looks like:

33 specialized agents, each with 2-3 focused skills instead of 8 broad ones. A CEO agent that orchestrates everything. VPs and leads that manage domain-specific ICs. All of them waking up on schedules, checking their assigned issues, doing work, and exiting.

PostgreSQL as the state machine. Every agent decision — task status, approval state, quality scores, deployment verification — lives in the database. Workflows are readers and writers, not stateful executors.

Short, single-responsibility workflows. Nothing over 15 nodes. Every step independently retriable. Failures isolated to the smallest possible unit of work.

Explicit chain-of-command escalation. When an agent is blocked, it patches the issue to blocked, posts a comment explaining why, and exits. A manager picks it up on the next heartbeat cycle. No spinning, no retrying the same failed approach.

The system isn't perfect. There are still rough edges — inter-workflow communication failures I haven't fully solved, comment deduplication logic that occasionally fires twice, a release-verifier that needs its chain-of-command refreshed after agent reorganizations. But it's running. Agents are shipping work. Quality gates are catching issues before they merge.


What I'd Do Differently

If I were starting over, I'd do these three things differently:

  1. Start with the database schema. Not the workflows. Define what "task state" means before you build anything that touches it.
  2. Make each workflow disposable. If you can't delete a workflow and rebuild it from scratch in an afternoon, it's too complex.
  3. Resist the generic abstraction. The right time to abstract is when you have three working specific things and see a clear pattern. Not before you have the specific things working.

Multi-agent systems feel like magic when they work. What they actually are is distributed systems with AI-generated operations. All the distributed systems principles apply: statelessness, idempotency, failure isolation, explicit state management.

The agents aren't the hard part. The plumbing is the hard part.


Building something similar? I'd be curious what patterns you've hit — especially around human-in-the-loop approval flows. The state machine approach solved most of my problems, but I'm still not happy with how approval timeouts work.