From Sequential Scripts to Concurrent AI Pipelines in Go — Part 5

MM
Moinuddin M Masud
8 min read

Part 5 — Worker Pools: Decoupling Concurrency from Input Size

Part of the series: Production-Grade Concurrent AI Systems in Go

Full code for this post: github.com/madmmas/go-concurrent-ai-systems/tree/part-05Diff from Part 4: compare/part-04...part-05Run it: go run ./cmd/news-processor -articles=20 -workers=5 inside arc-1-foundations/part-05-worker-pools


Every version of the pipeline so far — Part 2's goroutines, Part 3's mutex, Part 4's channel — shares one design decision we haven't questioned yet: one goroutine per article.

for _, article := range articles {
    go func(a model.Article) {
        // ...
    }(article)
}

At five or ten articles, this is invisible. At a hundred thousand articles, this line launches a hundred thousand goroutines, all at once, all immediately trying to call an LLM API. Three things go wrong, in order.

First, memory. Goroutines are cheap compared to OS threads — a few kilobytes of initial stack each — but a few kilobytes times a hundred thousand is not nothing, and the Go scheduler now has a hundred thousand things to keep track of.

Second, and more immediately damaging: every real LLM provider rate-limits you. Hit them with a hundred thousand simultaneous requests and you'll get a wall of 429s back, not a wall of summaries.

Third, even if neither of those killed you, unbounded concurrency means you have no control over your own throughput. You can't say "use at most 10 concurrent connections" — you've already committed to using exactly as many as you have articles.

The fix is to stop tying goroutine count to article count.


The Worker Pool Shape

Instead of one goroutine per article, we start a fixed number of long-lived workers. Articles get fed into a queue. Workers pull from that queue, one article at a time, for as long as there's work left.

articles ──→ jobs channel ──→ [worker 1]
                            ──→ [worker 2]   ──→ results channel ──→ collector
                            ──→ [worker 3]

The worker count is a number you choose — 5, 20, 100 — and it stays fixed no matter whether you're processing 10 articles or 10 million. This is the same pattern behind every production job queue, every database connection pool, every HTTP server's request handling: a fixed number of workers, a queue feeding them.


Building It

Two channels this time. jobs feeds articles in; resultsCh collects output, same as Part 4.

func (p *WorkerPool) ProcessAll(articles []model.Article) ([]model.AIResult, time.Duration) {
    start := time.Now()

    jobs := make(chan model.Article, len(articles))
    resultsCh := make(chan model.AIResult, len(articles))

    var wg sync.WaitGroup
    for w := 1; w <= p.Workers; w++ {
        wg.Add(1)
        go func(workerID int) {
            defer wg.Done()
            p.worker(workerID, jobs, resultsCh)
        }(w)
    }

    for _, article := range articles {
        jobs <- article
    }
    close(jobs)

    go func() {
        wg.Wait()
        close(resultsCh)
    }()

    results := make([]model.AIResult, 0, len(articles))
    for result := range resultsCh {
        results = append(results, result)
    }

    return results, time.Since(start)
}

And the worker itself — this is the part that's genuinely new compared to every earlier part:

func (p *WorkerPool) worker(id int, jobs <-chan model.Article, results chan<- model.AIResult) {
    for article := range jobs {
        result := p.processArticle(article)
        results <- result
    }
}

In every previous part, a goroutine processed exactly one article and exited. Here, a worker's for article := range jobs loop keeps it alive, pulling the next article the moment it finishes the current one. p.Workers goroutines start once, at the beginning, and each one processes many articles sequentially over its lifetime — concurrency comes from having several of these long-lived workers running side by side, not from spawning a fresh goroutine per article.

close(jobs) after the feed loop is what eventually lets workers exit. range jobs keeps pulling until the channel is closed and drained, exactly like the results-collection pattern from Part 4 — the same close-to-signal-completion idiom, just applied on the input side this time instead of the output side.


Turning the Worker Count Knob

This is where it gets interesting, because worker count isn't just an implementation detail — it's the actual throughput control for your pipeline. Run the same 20 articles at three different worker counts:

go run ./cmd/news-processor -articles=20 -workers=1
go run ./cmd/news-processor -articles=20 -workers=5
go run ./cmd/news-processor -articles=20 -workers=20
workers=1  articles=20 -> 1m3.498s  (20 results)
workers=5  articles=20 -> 13.598s   (20 results)
workers=20 articles=20 -> 4.047s    (20 results)

With one worker, this is Part 1 again — fully sequential, just wearing a worker-pool costume. One goroutine processes all twenty articles one after another, and the timing proves it: 63.5 seconds, in line with what you'd expect from sequential processing at this scale.

With five workers, throughput jumps roughly 4.7x — five articles genuinely in flight at once, each worker pulling the next job the instant it's free.

With twenty workers — one per article — we're back to the fully-parallel behavior from Parts 2 through 4, and the time drops to about 4 seconds, bounded by roughly the slowest single article's latency.

That progression is the entire value of this pattern in one table. Worker count isn't a performance knob you set once and forget — it's the lever you pull to deliberately trade throughput against external constraints: an LLM provider's rate limit, your own memory budget, how much load you're willing to put on a downstream service. Five workers might be exactly right against a provider capping you at 10 requests per second. Twenty would get you rate-limited within the first second.


Confirming It's Still Correct

Worker pools add a new moving piece — the jobs channel, workers with their own lifecycle — so it's worth checking nothing regressed.

go test ./internal/... -race
ok

Race-free, same as Parts 3 and 4. The simulator's shared-RNG mutex from Part 3 carries forward here too — with multiple long-lived workers calling the same LLMClient repeatedly over their lifetime, that protection matters even more than it did with one-shot goroutines, since each worker hits the RNG many times instead of once.

One edge case worth testing explicitly: what happens with zero workers?

func New(llm *simulator.LLMClient, workers int) *WorkerPool {
    if workers <= 0 {
        panic("WorkerPool: Workers must be > 0")
    }
    return &WorkerPool{llm: llm, Workers: workers}
}

Without this check, Workers: 0 would silently deadlock — no workers ever start, so nothing ever reads from jobs, so the feed loop blocks forever trying to send the first article into an unbuffered... well, actually a buffered channel in our case, so it would block once the buffer filled, or hang at wg.Wait() regardless. Either way: a hang with no error message. A panic on construction is a far kinder failure than a silent freeze discovered three minutes into a production deploy.


What Arc 1 Built

Five parts, one evolving pipeline, same ten-article workload throughout:

PartApproachTime (10 articles)What changed
1Sequential~30sBaseline — measured, not assumed
2Goroutines + WaitGroup~3.5sConcurrent, but racing
3+ Mutex~3.4sRace-free, lock scoped correctly
4Channels~3.6sNo shared memory, no lock at all
5Worker pool~3.5s (5 workers)Concurrency decoupled from input size

The timing barely moves after Part 2 — and that's worth noticing rather than treating as anticlimactic. Going from Part 2 to Part 5 was never about getting faster. Part 2 was already fast. The rest of this arc was about getting that same speed safely, in a shape that doesn't fall over under real production constraints: no data races, no shared memory you have to reason about carefully, and a concurrency level you actually control instead of one dictated by however many items happen to show up in a batch.

That's the foundation. Everything in Arc 2 — backpressure, circuit breakers, fan-out/fan-in over real network calls instead of a simulator — builds directly on the worker pool shape from this part.


What's Next

Arc 2 starts with a question this arc never had to face: what happens when the AI provider itself starts failing — timeouts, rate limit errors, a model that's temporarily down? Every part in Arc 1 assumed the simulated LLM call always eventually succeeds. Production AI systems don't get that luxury.

See you in Arc 2.


This is Part 5 of the series "Production-Grade Concurrent AI Systems in Go," and the final part of Arc 1 — Concurrency Foundations. Read Part 4 — Channels and Message Passing or start from Part 1 — Why Concurrency Matters.