The way engineers talk about interacting with AI has changed so completely in the last few years that it is easy to forget how recent most of it is. Prompt engineering, tool calling, context windows, agentic loops -- these were not terms in active use five years ago. What existed was something structurally different: systems that could respond to language inputs but through mechanisms so unlike today's that calling them the same category of thing is almost misleading.
Understanding where we are now requires understanding the full technical arc -- not the marketing narrative of AI getting smarter, but the actual mechanisms that each generation of systems used to process input and generate output, why those mechanisms had the limitations they did, and what specific technical breakthroughs enabled the transitions between eras.
This article traces that arc for engineers who build on top of these systems. The goal is a mental model: one that explains why certain interaction patterns work, why others fail, and what the near-term shifts in the agentic paradigm will mean for systems you are building today.
Era 1: Rule-Based Systems and Expert Systems (1960s to 1990s)
The Mechanism
The earliest AI interaction systems were deterministic rule engines: explicit if-then-else trees authored by human experts, encoding domain knowledge as logical rules evaluated against structured input.
ELIZA (1966) is the canonical early example. It was a pattern-matching system that applied transformation rules to input strings. The rules were written by Joseph Weizenbaum. The intelligence was entirely in the rule table.
Expert systems of the 1980s -- MYCIN for medical diagnosis, XCON for computer configuration, DENDRAL for chemical analysis -- used forward-chaining or backward-chaining inference engines over hand-coded production rules:
Rule example (MYCIN-style):
IF the infection is primary-bacteremia
AND the site of the culture is one of the sterile sites
AND the suspected portal is the gastrointestinal tract
THEN suggestive evidence (0.7) the organism is bacteroides
The inference engine chained rules together, propagating certainty factors, to reach conclusions. A system's capability was directly proportional to the completeness of its rule base.
The Interaction Model
Interaction with rule-based systems was necessarily structured. The surface was defined entirely by authored rules. Free-form natural language required preprocessing into a canonical form the rules could match. Multi-turn conversation was replaced by navigating a decision tree the engineer had designed. The work was knowledge engineering: eliciting rules from domain experts and encoding them correctly. There was no training, no data, no statistical inference.
Why It Broke Down
Rule-based systems hit a fundamental scaling wall. The number of rules required to handle real-world language diversity grows combinatorially with domain complexity. Any input not covered by a rule produces silence or a nonsensical response. Human language is not rule-following behaviour. It is statistical, contextual, ambiguous, and generative.
Era 2: Statistical NLP and Intent Classification (1990s to 2015)
The Mechanism
The shift from rules to statistics happened through the 1990s as probabilistic modelling replaced knowledge engineering. Instead of encoding what language means, statistical systems learned what language patterns correlate with what outputs from labelled data.
N-gram language models estimate the probability of the next word given the previous N words, computed from corpus frequencies. A trigram model assigns P(word | prev, prev-prev) based on observed training frequencies. Effective for short-range prediction; no mechanism for semantic meaning or long-range dependencies.
Bag-of-words with TF-IDF represents documents as vectors of weighted word counts, discarding word order entirely. Cosine similarity between vectors measures lexical overlap. This powered most information retrieval systems of the era and introduced the vocabulary mismatch problem: query for "cardiac arrest," miss the document that says "heart attack" -- no lexical overlap, no retrieval.
SVMs and logistic regression on hand-engineered features handled classification tasks. A sentiment classifier might use features like presence of positive or negative lexicon words, negation markers, and punctuation patterns. The engineer designed the features; the model learned the weights.
The Interaction Model
Statistical NLP enabled intent classification and slot filling -- the technical foundation of the voice assistant generation:
Input: "Book me a flight to Tokyo next Tuesday"
Intent: BOOK_FLIGHT
Slots: { destination: "Tokyo", date: "next Tuesday" }
This powered Siri (2011), Google Now (2012), Alexa (2014), and Cortana (2014). Each was primarily an intent classifier routing to backend action handlers. The interaction model was command-and-response: a constrained natural language command in, a predefined action out. Multi-turn context was handled by explicit state machines the engineer designed, not by the model. The system's capability was bounded by its intent taxonomy -- anything outside it produced a classification failure.
For engineers, building on these systems meant defining an intent taxonomy, collecting and labelling training utterances per intent, training a classifier, implementing action handlers per intent, and managing slot filling as a separate pipeline. The model was one component in a hand-designed flow, not the flow itself.
Era 3: Neural Sequence Models and Dense Embeddings (2013 to 2018)
The Mechanism
Word2Vec (Mikolov et al., 2013) and GloVe (2014) marked a qualitative shift in language representation. Instead of sparse bag-of-words vectors, words were represented as dense vectors in a continuous embedding space where geometry encodes semantic relationships.
Train a shallow neural network to predict a word from its context, and the learned weight vectors encode semantic similarity. Words appearing in similar contexts end up geometrically close. This follows from distributional semantics: words with similar usage patterns have similar embeddings. The most cited demonstration: vector("king") - vector("man") + vector("woman") approximates vector("queen").
LSTMs added gating mechanisms to RNNs -- selective memory and forgetting -- addressing the vanishing gradient problem. The encoder-decoder architecture with attention (Bahdanau et al., 2015) was the key breakthrough for sequence-to-sequence tasks. Rather than compressing the entire source sequence into one fixed-size context vector, attention allowed the decoder to look back at all encoder hidden states, weighted by relevance to the current decode position. This was the direct architectural precursor to transformer self-attention.
The Interaction Model
Dense embeddings enabled semantic search: encode query and documents into the same embedding space, retrieve by vector similarity rather than lexical overlap. "Cardiac arrest" and "heart attack" now land close in embedding space. The vocabulary mismatch problem was substantially addressed.
This is the embedding pipeline that underlies every RAG system in production today. It was developed in this era as an information retrieval improvement and later became the retrieval backbone of LLM augmentation. For conversational systems, the interaction model remained command-and-response -- generation quality was not reliable enough for open-ended user-facing output.
Era 4: The Transformer and Pretraining at Scale (2017 to 2021)
The Mechanism
The transformer architecture (Vaswani et al., "Attention Is All You Need," 2017) replaced recurrence with self-attention as the core sequence processing mechanism.
Self-attention computes, for each position in the sequence, a weighted sum of all other positions' value vectors -- where the weights are a function of query and key vectors at each position:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) x V
Q, K, V : linear projections of the input embeddings
d_k : key dimension (scaling factor to prevent large dot products)
Output : for each position, a weighted blend of all value vectors
Multi-head attention runs this H times in parallel with different learned projections -- simultaneously attending to syntactic structure, semantic relationships, and coreference in different heads. The outputs are concatenated and projected.
The critical engineering property: transformers process all positions in parallel during training. No sequential dependency means no vanishing gradient over long sequences and full GPU matrix utilisation. This is what made scaling to billions of parameters tractable.
GPT (2018) uses causal left-to-right self-attention -- each token attends only to previous tokens. Trained as an autoregressive language model: predict the next token given all previous tokens. Generative by construction. BERT (2018) uses bidirectional attention and masked language modelling -- rich contextual representations for understanding tasks, not natively generative.
GPT-3 (2020) demonstrated that scaling to 175 billion parameters produced emergent in-context learning: provide examples of a task in the prompt prefix, and the model generalises to new instances without weight updates:
Translate English to French:
English: The cat sat on the mat.
French: Le chat etait assis sur le tapis.
English: The weather is nice today.
French: [model completes correctly -- no gradient update occurred]
Zero-shot and few-shot generalisation emerged as a consequence of scale, not explicit design.
The Interaction Model
Prompt-based interaction became a serious engineering paradigm. Task specification moved from model architecture -- a separate model per task requiring labelled data -- to input prefix: specify the task in natural language, the same model handles everything. Prompt engineering emerged as a discipline.
Context windows were still small (GPT-3: 4096 tokens) and inference was expensive. Each call was stateless -- the model had no persistent memory between calls. Multi-turn conversation required the engineer to manage a growing prompt buffer, appending each turn and truncating at the context limit. The interaction model was single-pass completion: a prompt goes in, a completion comes out, no persistent state.
Era 5: Instruction Tuning and RLHF (2021 to 2023)
The Mechanism
Base language models trained on next-token prediction are misaligned with user intent: they continue text in the style of their training corpus rather than following instructions helpfully. A base GPT-3 model asked "What is the capital of France?" is as likely to continue with more geography questions as to answer directly.
Instruction tuning fine-tuned base models on curated (instruction, ideal response) datasets, teaching the model that questions should be answered and commands followed, not continued as text.
RLHF (Reinforcement Learning from Human Feedback, InstructGPT, Ouyang et al., 2022) aligned model output with human preference through three phases:
Phase 1 -- Supervised Fine-Tuning (SFT)
Human labellers write ideal responses to diverse prompts.
Fine-tune base model on (prompt, ideal_response) pairs.
Phase 2 -- Reward Model Training
Labellers rank multiple model responses to the same prompt.
Train a reward model to predict human preference scores
across helpfulness, harmlessness, and honesty dimensions.
Phase 3 -- RL with Proximal Policy Optimisation (PPO)
Use reward model scores as environment reward signal.
Fine-tune SFT model to maximise expected reward.
KL divergence penalty from SFT model prevents reward hacking.
DPO (Direct Preference Optimisation, 2023) reformulated the RLHF objective as a supervised learning problem, eliminating the separate reward model and RL loop entirely. More stable, less resource-intensive, comparable alignment quality. Most production fine-tuning pipelines now use DPO or variants over full RLHF.
The Interaction Model: Structured Conversation
The three-role conversation format became the standard interface -- system prompt, user message, assistant response -- with the model maintaining coherent context across multiple turns within a single context window. ChatGPT (November 2022) made this paradigm mainstream.
The system prompt as a persistent instruction layer, user messages as sequential turns, assistant responses shaped by both the system prompt and the full conversation history -- this introduced genuine multi-turn reasoning. Not because the model had persistent memory outside the context window, but because the full conversation history was present in the input on every call.
Context windows expanded substantially: GPT-3.5-turbo moved to 16K tokens, Claude 2 to 100K tokens, later models to 200K and beyond. Larger context windows reduced the context management problem but did not eliminate it -- models exhibit attention degradation on content in the middle of very long contexts, a phenomenon documented as the lost-in-the-middle problem (Liu et al., 2023).
For engineers, the key challenges of this era were: context window management (fitting relevant history within the limit), system prompt engineering (encoding persona, constraints, and task framing), and cost management (every token in the context window costs inference compute on every call).
Era 6: Tool Use and Retrieval Augmentation (2023 to Present)
The Mechanism
Instruction-tuned LLMs with large context windows were powerful but had two fundamental limitations: knowledge was frozen at training time, and the model could not take actions in the world beyond generating text. Tool use and RAG addressed these directly.
Function calling / tool use (introduced in the OpenAI API in June 2023, rapidly adopted across providers) extended the model output format to include structured tool invocations alongside natural language. The model is given a list of available tools with their JSON schemas. When it determines a tool should be called, it outputs a structured invocation. The calling system executes the tool and returns the result as a new context entry. The model incorporates the result and continues.
// Tool definition provided to model
{
"name": "query_database",
"description": "Execute a read-only SQL query against the production replica",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string" }
},
"required": ["query"]
}
}
// Model output: structured tool invocation
{
"tool_calls": [{
"id": "call_abc123",
"function": {
"name": "query_database",
"arguments": "{"query": "SELECT COUNT(*) FROM orders WHERE status = 'pending'"}"
}
}]
}
// Tool result injected back into context
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "[{"count": 4821}]"
}
Retrieval-Augmented Generation (RAG) addressed knowledge staleness by retrieving relevant documents at query time and injecting them into context before generation:
User query
|
v
Query Encoder --> Embedding
|
v
Vector Store search (cosine similarity)
|
v
Top-K retrieved documents
|
v
[System Prompt] + [Retrieved Docs] + [User Query] --> LLM --> Grounded Response
The embedding pipeline from Era 3 -- encode documents into dense vectors, index in a vector store, retrieve by similarity -- became the standard retrieval backbone. The LLM's role shifted from knowledge store to reasoning engine over retrieved context. RAG introduced new engineering problems: retrieval quality, context assembly, and faithfulness (is the model staying grounded in retrieved context or hallucinating beyond it?).
The Interaction Model: Augmented Reasoning
The model became an augmented reasoner: able to consult external knowledge via retrieval and invoke external capabilities via tool use, mid-response. For engineers, this meant designing tool interfaces, managing retrieval pipelines, handling tool errors and retries, and reasoning about latency -- multi-step tool-augmented responses involve multiple sequential model calls and external API calls, each with their own latency distribution.
Era 7: Agentic Systems and Multi-Step Reasoning (2024 to Present)
The Mechanism
Tool use and RAG extended what a single model call could do. Agentic systems extended what a model could do across multiple sequential calls, with the model planning its own execution sequence based on intermediate results.
The defining architectural pattern is the reason-act loop (ReAct, Yao et al., 2022):
Observation (context, task, results accumulated so far)
|
v
Reasoning (chain-of-thought: what do I know, what do I need next)
|
v
Action (tool call, subquery, or final answer)
|
v
New Observation (tool result incorporated into context)
|
v
[loop until task complete or step limit reached]
Each iteration is a separate model call. The model accumulates context across iterations and uses that context to plan subsequent steps. The key difference from simple tool use: the model is not executing a predetermined sequence. It is planning -- deciding what to do next based on what it observed -- and revising its plan when results are unexpected.
Chain-of-thought prompting (Wei et al., 2022) demonstrated that prompting the model to reason step by step before producing an answer substantially improved performance on multi-step tasks. The reasoning trace is part of the input to the final answer tokens -- generating intermediate steps conditions subsequent generation.
Extended thinking / reasoning models (OpenAI o1, o3, Claude with extended thinking) internalised this loop: the model generates a long internal reasoning trace before producing a final response. This trades inference time for answer quality -- the model can explore multiple reasoning paths, backtrack, and self-correct before committing to an output.
Multi-Agent Architectures
Production agentic systems increasingly use multiple specialised models rather than a single general model. An orchestrator breaks down a complex task and routes subtasks to specialised worker agents. Workers execute subtasks and return results. The orchestrator synthesises results and determines next steps.
User task: "Analyse our database performance and suggest optimisations"
Orchestrator
|
|---> SQL Analysis Agent
| Tools: query_slow_log, run_explain_analyze
| Returns: slow queries with execution plans
|
|---> Index Analysis Agent
| Tools: query_pg_stat_user_indexes, check_table_bloat
| Returns: unused indexes, bloat metrics
|
|---> Schema Analysis Agent
Tools: query_table_definitions, query_constraints
Returns: schema assessment
|
v
Orchestrator synthesises: combined analysis with prioritised recommendations
This architecture allows parallelisation of subtasks, specialisation per domain, and failure isolation -- a worker agent failure does not abort the entire task.
The Interaction Model: Delegated Execution
The interaction model of the agentic era is fundamentally different from all previous eras. The user is no longer sending a query and waiting for a response. The user is delegating a task and monitoring progress. The system is not responding -- it is working.
This shift has profound engineering implications:
Observability becomes critical. A single model call either works or it does not. An agentic workflow with twenty steps, each making tool calls, can fail at any point for any reason. You need to observe every step: what the model reasoned, what tool it called, what the tool returned, why it made the decisions it did. This requires structured logging of the full agent trace, not just the final output.
Error handling becomes compositional. Tool failures, model refusals, unexpected outputs, and reasoning errors compose in an agentic workflow. The system needs graceful degradation: retry with backoff, fallback to a different approach, escalate to human review. These must be designed into the agent architecture, not handled after the fact.
Latency budgets change. A chat response should be fast -- seconds. An agentic workflow doing meaningful work may legitimately take minutes. The interaction model needs progress reporting, partial results, and cancellation -- concepts that did not exist in the single-call paradigm.
State management is a systems problem. Within a single context window, the model manages state through attention over accumulated context. Across context windows -- when a task exceeds the context limit or spans multiple sessions -- state must be persisted externally and injected back at the start of each call. This is a systems engineering problem, not a model problem.
The Near-Term Trajectory (Next 12 to 24 Months)
Longer and More Reliable Context Windows
Context windows have grown from 4K (GPT-3) to 200K tokens (Claude 3) to 1M tokens (Gemini 1.5) within two years. The technical challenge is not storing long contexts -- it is attending to them reliably. The lost-in-the-middle problem (Liu et al., 2023) demonstrated that current models have degraded recall for content in the middle of long contexts, preferring content near the beginning and end.
The near-term engineering effort is on making long-context attention reliable, not just extending the length. For engineers building on these models: do not assume that injecting a large document into context guarantees the model will use all of it correctly. Retrieval over long documents is still often more reliable than injection for content that is unlikely to be near the beginning or end.
Persistent Memory Beyond the Context Window
Current LLM interactions are stateless across sessions -- each session starts with a blank context, requiring the engineer to inject any relevant prior state. Near-term systems are developing explicit memory architectures: the model can write to and read from a persistent memory store that survives across sessions.
The engineering challenge is memory management: what to store, when to store it, how to retrieve the right memories for a given query, and how to handle conflicts between memories. This is a retrieval problem (semantic similarity over memory entries) combined with a relevance problem (not all memories are relevant to all queries). The same vector store and embedding pipeline used for RAG is the natural substrate for persistent memory.
For engineers: memory-augmented systems require new patterns for memory writing (what is worth remembering?), memory retrieval (how to surface relevant memories without polluting context with irrelevant ones?), and memory update (how to handle facts that change over time?).
Structured Outputs and JSON Schema Enforcement
Early LLM APIs returned free-form text the application had to parse. Current APIs support structured output modes where the model is constrained to produce JSON conforming to a provided schema. This is not prompt engineering -- it is constrained decoding at the inference level, using grammar-constrained beam search or logit masking.
{
"model": "claude-sonnet-4-6",
"messages": [...],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "performance_analysis",
"schema": {
"type": "object",
"properties": {
"slow_queries": { "type": "array", "items": { "type": "string" } },
"recommendations": { "type": "array", "items": { "type": "string" } },
"severity": { "type": "string", "enum": ["low", "medium", "high", "critical"] }
},
"required": ["slow_queries", "recommendations", "severity"]
}
}
}
}
For engineers: structured outputs fundamentally change how you integrate LLMs into application pipelines. Instead of parsing free-form text with fragile regex or prompt-engineered JSON extraction, you get type-safe structured data that can be directly deserialised into application types. This removes an entire class of integration bugs and enables stronger contracts between the model and the rest of your system.
The Shift from Prompt Engineering to Agent Engineering
The dominant engineering skill of the instruction-tuned era was prompt engineering: crafting system prompts and message templates that reliably elicited the desired model behaviour. This skill is becoming less central as the interaction model shifts to agentic.
The emerging skill set is agent engineering: designing the tool interface contracts that agents call (what parameters, what return types, what error shapes), the orchestration logic that routes tasks to appropriate agents, the observability infrastructure that makes agent behaviour debuggable, and the state management systems that give agents memory across context windows.
The underlying model capability -- reasoning, planning, tool selection -- is increasingly taken as given. The engineering challenge is building reliable systems on top of that capability: systems that handle failures gracefully, are observable and debuggable, maintain consistent behaviour across diverse inputs, and can be evaluated systematically rather than manually.
Conclusion: The Unifying Thread
Looking across the full arc -- from ELIZA's rule table to a multi-agent system that plans, retrieves, executes tools, and maintains memory -- the pattern is consistent: each era expanded the space of what could be expressed as an interaction without requiring explicit engineering effort.
In the rule-based era, everything had to be explicitly programmed. In the statistical era, classification boundaries could be learned, but the taxonomy had to be designed. In the neural era, representations could be learned, but tasks still had to be defined and labelled. In the transformer era, tasks could be specified in natural language, but the model still needed to be called explicitly for each step. In the agentic era, the model plans its own steps -- the engineer specifies the goal and the available tools, not the execution sequence.
Each transition was enabled by a specific technical breakthrough: from rules to statistics by corpus availability and probabilistic modelling; from bag-of-words to dense embeddings by the distributional hypothesis and backpropagation at scale; from RNNs to transformers by self-attention and parallelism; from base models to aligned models by RLHF and instruction tuning; from single-call to multi-step by tool use and chain-of-thought.
The near-term transitions -- reliable long context, persistent memory, structured outputs, agent engineering as a discipline -- are continuations of the same pattern. Each one reduces the explicit engineering required to achieve a given level of capability. Each one raises the floor of what is possible by default.
The engineers who build the best systems on top of these capabilities are not the ones who treat the model as magic. They are the ones who understand the mechanism well enough to know where the current model fails, design their system architecture around those failure modes, and adapt as the model capabilities evolve beneath them.
That understanding is what this article was built to support.