Part 1 — Why Concurrency Matters: Building the Wrong System First

Part of the series: Production-Grade Concurrent AI Systems in Go

→ Full code for this post: github.com/moinuddin/go-concurrent-ai-systems/tree/part-01 → Run it: go run ./cmd/news-processor inside arc-1-foundations/part-01-sequential

Before we write a single goroutine, we need to feel the pain that makes concurrency worth the complexity.

This is deliberate. Most tutorials jump straight to the solution. You learn the syntax, copy the pattern, and move on — without ever really understanding why you're doing it. That shortcut catches up with you the moment something breaks in production and you have no mental model to debug against.

So in this first part, we're going to build the system the wrong way. We'll process AI tasks sequentially, watch it struggle under load, and measure exactly how bad things get. By the end, concurrency won't feel like a clever trick — it'll feel like the only sane response to a real problem.

The System We're Building

Throughout this series, we're building an AI-powered news intelligence platform. The idea is straightforward: articles come in from multiple sources, and each one needs to be processed through a pipeline of AI tasks — summarization, sentiment analysis, keyword extraction, embedding generation, and so on.

In the real world, each of these steps is a call to an external service. You're talking to an LLM API, a vector database, a classifier endpoint. These calls take time. Not milliseconds — we're talking hundreds of milliseconds to several seconds per call, depending on model size, load, and network conditions.

Our pipeline looks like this:

Incoming Article
       ↓
   Summarize
       ↓
 Sentiment Analysis
       ↓
 Keyword Extraction
       ↓
  Store Result

Simple enough. Let's implement it.

The Sequential Implementation

We'll simulate the AI calls rather than hitting real APIs. This matters for two reasons: you can run it locally without any API keys, and the simulated latency is actually more honest than you might expect — real LLM calls genuinely do take 500ms to 1500ms in normal conditions.

The repo splits this across proper packages — internal/model, internal/simulator, and internal/pipeline. Here are the pieces that matter for understanding the problem.

First, our data types:

type Article struct {
    ID      int
    Title   string
    Content string
}

type AIResult struct {
    ArticleID int
    Summary   string
    Sentiment string
    Keywords  []string
}

The core of the pipeline — the function that processes one article:

func (p *Processor) processArticle(article Article) AIResult {
    fmt.Printf("\nProcessing article %d...\n", article.ID)

    summary := p.summarize(article)
    sentiment := p.analyzeSentiment(article)
    keywords := p.extractKeywords(article)

    return AIResult{
        ArticleID: article.ID,
        Summary:   summary,
        Sentiment: sentiment,
        Keywords:  keywords,
    }
}

And the outer loop that drives it:

for _, article := range articles {
    result := processArticle(article)
    results = append(results, result)
}

Each AI task calls the simulator, which sleeps for a random duration between 500ms and 1500ms — matching real LLM API response times:

func (c *LLMClient) Call(task string, articleID int) {
    latency := time.Duration(500+rand.Intn(1000)) * time.Millisecond

    fmt.Printf("  [%d] %s started (%v)\n", articleID, task, latency)
    time.Sleep(latency)
    fmt.Printf("  [%d] %s completed\n", articleID, task)
}

Run it and watch the output scroll by:

Processing article 1...
  [1] Summarization started (1.2s)
  [1] Summarization completed
  [1] Sentiment Analysis started (800ms)
  [1] Sentiment Analysis completed
  [1] Keyword Extraction started (600ms)
  [1] Keyword Extraction completed

Processing article 2...
  [2] Summarization started (950ms)
  ...

Notice what's happening. Article 2 doesn't start until article 1 is completely done. Not just the summarization — all three tasks. Every article waits its turn in line, no matter how long the wait.

Measure Before You Optimize

Before declaring anything a problem, measure it. This isn't just good advice — it's the difference between experienced engineers and engineers who make things worse while trying to fix them.

The measurement is already in ProcessAll:

start := time.Now()
// ... all the work ...
return results, time.Since(start)

Run it. You'll see something around 25–30 seconds for 10 articles. The exact number varies because we're using random latency, but the scale is consistent.

Now let's understand where that time is coming from.

The Compounding Latency Problem

Each AI call in our simulation takes between 500ms and 1500ms, averaging roughly 1 second. Each article needs three calls. With 10 articles:

10 articles × 3 tasks × ~1s per task = ~30 seconds

That feels manageable. Until you scale it:

Articles	Estimated Time
10	~30s
100	~5 min
1,000	~50 min
10,000	~8 hours

Eight hours to process ten thousand articles. In a news platform that needs to surface stories while they're still relevant, that's not a performance problem — it's a fundamental design failure.

The binary computes this table from your actual measured duration — so the numbers you see reflect your machine's real timing, not a hardcoded guess. Run go run ./cmd/news-processor -articles=100 and watch the projection update accordingly.

And this is with simulated latency. Real AI systems are often slower, for reasons that are worth understanding.

Why AI Pipelines Are Especially Vulnerable

Regular API calls are fast. A database query, a REST endpoint, a microservice — you're usually talking 5–50ms. At that speed, sequential processing is often completely fine.

LLM calls are different:

They're network-bound and slow. A single call to GPT-4 or Claude can take 1–5 seconds or more, depending on the prompt and output length.
Latency is unpredictable. The same request can take 800ms one moment and 3 seconds the next, based on model load, batching, and infrastructure decisions entirely outside your control.
Providers throttle you. Rate limits are real. Your pipeline might be fast in isolation and grind to a halt under production load when you hit token-per-minute limits.
Retries are part of the workflow. Timeouts, transient errors, and rate-limit responses mean a single "call" in your code might actually be two or three round trips to the API.
Streaming responses change the timing model. Modern LLM APIs stream tokens incrementally. Your article isn't "done" the moment you get a response — the full summary arrives token by token over several seconds.

All of this means that in an AI pipeline, your code spends the vast majority of its time waiting. Not computing. Not transforming data. Just waiting for network responses to come back.

Look at what your CPU is doing while our sequential pipeline runs:

Article 1 - Summarization: WAITING 1.2s
Article 1 - Sentiment:     WAITING 800ms
Article 1 - Keywords:      WAITING 600ms
Article 2 - Summarization: WAITING 950ms
...

The CPU is idle almost the entire time. That's the inefficiency we need to fix.

Sequential Processing Isn't Always Wrong

Here's an important nuance that a lot of tutorials skip: sequential processing is not inherently bad. In many cases, it's the right choice.

Sequential systems are:

Easier to reason about — the code does exactly what it says, in exactly the order it says it
Easier to debug — when something fails, you know precisely where and why
Deterministic — the same input always produces the same execution order
Safer to start with — you can build correctness first, then optimize for speed

The engineering discipline isn't "use concurrency everywhere." It's "measure your actual bottleneck, then apply the right tool." We're about to confirm that sequential processing is genuinely the bottleneck here. Once we've confirmed that, concurrency becomes the justified solution rather than premature optimization.

This distinction matters in production. Unnecessary concurrency adds complexity, creates new categories of bugs (race conditions, deadlocks, resource leaks), and makes systems harder to operate. You don't reach for it until you've proven you need it.

We've now proven we need it.

What We Learned From the Wrong Approach

Run the sequential version one more time and sit with the output. Watch article 1 finish before article 2 begins. Notice that the total time is roughly the sum of every individual task's latency — there's no parallelism, no overlap, no pipelining. Every second of waiting in one article is a second the rest of the articles spend doing nothing.

That's the problem. Here's the key insight: those waiting periods are an opportunity. While one article waits for a summarization response, we could be starting sentiment analysis on the next article. While one LLM call is in flight, we could have five more in flight simultaneously.

The CPU was idle. The network was idle. We were processing one thing at a time in a system that could handle many.

That's what goroutines fix.

What's Next

In Part 2, we'll introduce Go's concurrency model. We'll add goroutines to our pipeline and watch the total processing time collapse — not because individual calls get faster, but because we stop waiting for them one by one.

We'll also encounter our first concurrency bug, and it will be instructive. Concurrency is not free. It solves the latency problem while introducing a new class of problems that don't exist in sequential code.

Understanding both sides of that tradeoff is what separates engineers who use concurrency confidently from engineers who copy patterns they don't fully understand.

See you in Part 2.

This is Part 1 of the series "Production-Grade Concurrent AI Systems in Go." The full series covers Go concurrency fundamentals, production patterns, distributed systems, and cloud-native AI infrastructure.