Understanding How AI Systems Actually Work: A Layered View From Weights to Agents

A personal reference, written after spending a couple of days digging into local LLMs from first principles.

Most explanations of “how AI works” stop at the level of the model itself — a black box that takes text in and produces text out. But once you actually try to run one yourself, you quickly discover that “the model” is only one layer of a much taller stack. Every product you’ve used — ChatGPT, Claude, Cursor, Character.AI, GitHub Copilot — is built from the same handful of layers. Once you can name those layers, the entire ecosystem becomes navigable.

This post is my reference for those layers, plus the key technical concepts that show up inside them.


The Core Insight: AI Systems Are Layered

A working AI product like Claude Code or ChatGPT is not a single piece of software. It’s at least four distinct layers stacked on top of each other, each with its own job:

LayerRoleExample
ModelThe trained weights — billions of numbersLlama 3.1, Qwen 2.5, Mistral, Claude Opus
Inference engineLoads the model and runs the mathOllama, vLLM, llama.cpp, TensorRT-LLM
OrchestratorManages context, memory, personas, historyOpen WebUI, LangChain, LlamaIndex
AgentLets the model take actions in the real worldClaude Code, Goose, Cline, Aider

Each layer expands the system’s reach. The model knows things. The engine makes it run. The orchestrator gives it situational awareness. The agent gives it hands.

Here’s how the layers stack together in practice — what flows between them, and where the boundaries are:

┌───────────────────────────────────────────────────────────┐
│                          USER                             │
│              "read my file and summarize it"              │
└─────────────────────────────┬─────────────────────────────┘
                              │
                              ▼
┌───────────────────────────────────────────────────────────┐
│  AGENT             Claude Code · Goose · Cline · Aider    │
│  ───────────────────────────────────────────────────────  │
│  Defines tools and executes real-world side effects:      │
│    • read_file, write_file, run_shell, http_request       │
│  Parses tool calls from model output and runs them.       │
└─────────────────────────────┬─────────────────────────────┘
                              │  builds tool-aware prompt
                              ▼
┌───────────────────────────────────────────────────────────┐
│  ORCHESTRATOR      Open WebUI · LangChain · LlamaIndex    │
│  ───────────────────────────────────────────────────────  │
│  Constructs the full prompt for every turn:               │
│    • Chat history    • Persona / character card           │
│    • Summaries       • RAG (vector retrieval)             │
│    • World info      • Sampler presets                    │
└─────────────────────────────┬─────────────────────────────┘
                              │  HTTP request with prompt
                              ▼
┌───────────────────────────────────────────────────────────┐
│  INFERENCE ENGINE  Ollama · vLLM · llama.cpp · TensorRT   │
│  ───────────────────────────────────────────────────────  │
│  Loads weights into VRAM, runs the math:                  │
│    • Tokenize input          • Run forward pass           │
│    • Manage KV cache         • Sample next token          │
│    • Stream tokens back      • Batch concurrent requests  │
└─────────────────────────────┬─────────────────────────────┘
                              │  tensor math on GPU
                              ▼
┌───────────────────────────────────────────────────────────┐
│  MODEL             Llama 3.1 · Qwen 2.5 · Mistral · GPT   │
│  ───────────────────────────────────────────────────────  │
│  Static weight tensors on disk. Given input tokens,       │
│  emits a probability distribution over the next token.    │
│  Data, not code. Knows nothing about you or the world.    │
└───────────────────────────────────────────────────────────┘

Read it bottom-up to see capability growing: raw probabilities become a running computation, become a contextualized conversation, become an actor in the world. Read it top-down to see how your intent gets translated all the way down to tensor multiplies on a GPU.

Most confusion in the AI ecosystem comes from treating these as one thing. They’re not.


Layer 1: The Model

A model is a collection of weight tensors — typically billions of numbers organized across 30–80 transformer layers — plus a small tokenizer that converts text to and from integer token IDs.

It is a static file on disk until someone loads it. You can copy it to a USB drive, checksum it, delete it. It is data, not code. It can’t run any more than a JPEG can display itself.

Given a sequence of tokens as input, its only job is to compute a probability distribution over its entire vocabulary (usually 32,000–128,000 possible tokens) representing how likely each one is to come next.

That’s it. At each step, the model produces probabilities; the inference engine’s sampler picks the actual next token. “Reasoning” models (like OpenAI’s o-series or DeepSeek-R1) complicate this picture slightly — they’re trained to generate long internal chains of thought before answering — but the underlying mechanic is still the same loop: predict probabilities, sample a token, append, repeat.


Layer 2: The Inference Engine

The inference engine is the program that loads the model into memory and actually runs the math. Specifically, it:

  • Loads weights from disk into RAM or VRAM
  • Tokenizes input text into integer IDs
  • Runs the forward pass — billions of multiply-accumulates per token across every layer
  • Manages the KV cache (the running attention state that makes generation fast)
  • Samples a token from the probability distribution the model emits
  • Streams results back to the caller
  • Optionally exposes an HTTP API and batches multiple requests for throughput

The model is the recipe. The engine is what cooks.

The major engines you’ll encounter:

  • llama.cpp — the grandfather of consumer inference. Runs everywhere (CPU, NVIDIA, AMD, Apple Silicon). The foundation under Ollama and LM Studio.
  • Ollama — llama.cpp wrapped in a daemon with a CLI, model library, and HTTP API. The standard for casual local use.
  • vLLM — production-grade serving for open-source models. PagedAttention, continuous batching. What you use when you have many concurrent users.
  • TensorRT-LLM — NVIDIA’s hyper-optimized engine. Squeezes maximum throughput at the cost of setup complexity.
  • MLX — Apple’s native ML framework. Fastest on Apple Silicon thanks to unified memory.
  • TGI — Hugging Face’s serving stack. Powers HF Inference Endpoints.

The same model file can run on multiple engines. The same engine can run many models. They’re independent.


Layer 3: The Orchestrator

The orchestrator sits between the user and the inference engine. It’s responsible for managing the persistent context the model itself lacks.

A raw model has no memory. Every request starts from scratch. The orchestrator’s job is to construct a rich, layered prompt for every turn so the model can behave as if it has memory, personality, and situational awareness.

Things an orchestrator does:

  • Maintains chat history
  • Injects character cards or persona definitions as system prompts
  • Generates and updates automatic summaries of long conversations
  • Performs vector retrieval (RAG) over past messages for semantic memory
  • Pulls in world info or lorebook entries when keywords appear
  • Manages sampler presets (temperature, top-p, etc.)
  • Renders the UI

Examples: Open WebUI (a ChatGPT-like web UI for Ollama), LangChain and LlamaIndex (programmatic frameworks for building orchestrated pipelines), AnythingLLM (desktop app for chatting with your documents).

Crucially: an orchestrator does not take actions in the world. It can’t read your files, run shell commands, or hit external APIs. It just stage-manages the prompt.

This is also called context engineering — increasingly recognized as a distinct discipline from prompt engineering (writing better individual prompts) and from agent design (giving models tools).


Layer 4: The Agent

An agent is everything an orchestrator is, plus the ability for the model to take real-world actions through tools.

The mechanism, made concrete:

  1. The agent injects a system prompt describing available tools and their schemas
  2. The model emits a specially-formatted output recognized as a tool call — typically JSON like {"tool": "read_file", "args": {"path": "/tmp/x.txt"}}
  3. The agent parses this, recognizes it as a tool request rather than user-facing text, and executes the actual action (calls a function, runs a command, hits an API)
  4. The result of the action gets fed back into the conversation as the next message
  5. The model continues with the new information

Tools aren’t a magical ability the model has — they’re a convention between the agent and the model. The model just produces text; the agent recognizes some of that text as tool calls and acts on it.

MCP (Model Context Protocol) is one standardized way to define and serve tools to agents. It’s a protocol, not the only way function calling can work. It’s to tools what HTTP is to web servers — a standard, not a requirement.

Examples of agents: Claude Code, Goose (Block), Cline (VS Code extension), Aider (terminal coding agent), Continue.dev, OpenHands.

The defining property of an agent is that the model’s output can cause real-world side effects. That’s the line.

The agent and orchestrator layers together — the tool loop, the prompt assembly, the execution sandbox, the surrounding scaffolding that turns a raw model into a working system — are often called the harness. When people say “Claude Code is a harness around Claude,” this is what they mean: everything outside the model weights that makes the model usable as an actor. The model is the engine; the harness is the car around it.


The Hidden Components

The four-layer model is the minimum useful taxonomy. A few more components live inside or alongside those layers:

  • Tokenizer — converts text ↔ tokens. Bundled with the model. Different models have different tokenizers.
  • LoRA / fine-tune adapters — small extra weight matrices (10–100 MB) that modify a base model’s behavior without retraining the whole thing.
  • Quantization — a transformation of weights that compresses them from 16-bit floats to 4-bit, 5-bit, or 8-bit integers. Smaller and faster, with a small quality cost.
  • Embedding model — a separate, much smaller model used purely to convert text into vectors. Powers semantic memory in orchestrators and RAG systems.
  • Vector database — Pinecone, Qdrant, Chroma, pgvector, sqlite-vec. Stores embeddings and supports nearest-neighbor search.
  • Caching layer — prompt caching stores intermediate computation for repeated prompt prefixes. Reduces cost and latency.
  • Routing layer — LiteLLM, OpenRouter. Lets one app talk to many models through one interface.
  • Observability layer — Langfuse, Helicone. Logs every model call for debugging and cost tracking.
  • Application / UI layer — what the user actually interacts with. Claude Code is the CLI wrapping the agent. Cursor is an editor wrapping it. ChatGPT is a web UI wrapping its agent.

Here’s the expanded picture, with all of those components slotted into the four-layer stack — plus the hardware they ultimately run on:

┌─────────────────────────────────────────────────────┐
│  Application / UI (Claude Code CLI, Cursor, etc.)   │
├─────────────────────────────────────────────────────┤
│  Agent (tool execution loop)                        │
├─────────────────────────────────────────────────────┤
│  Orchestrator (context, memory, prompt assembly)    │
│    ├── Embedding model  ── ┐                        │
│    └── Vector database  ───┘                        │
├─────────────────────────────────────────────────────┤
│  Routing / caching / observability (production)     │
├─────────────────────────────────────────────────────┤
│  Inference engine (loads weights, runs math)        │
│    ├── Tokenizer                                    │
│    ├── KV cache management                          │
│    └── Sampling                                     │
├─────────────────────────────────────────────────────┤
│  Model (weight tensors + optional LoRA adapters)    │
└─────────────────────────────────────────────────────┘
              ↓ runs on
┌─────────────────────────────────────────────────────┐
│  Hardware (CPU + GPU + memory hierarchy)            │
└─────────────────────────────────────────────────────┘

Every AI product on the market is some subset of this stack. ChatGPT hides everything below the UI from you. Ollama exposes the inference engine and model. Open WebUI adds the orchestrator on top. Claude Code wraps an agent around the whole thing. Same building blocks, different combinations.


Key Concepts You Need to Know

These are the technical terms that appear constantly once you start reading deeper.

Tokens

Tokens are subword pieces, not words. “Unhappiness” might be three tokens: un, happy, ness. The model’s vocabulary contains ~32k–128k unique tokens. Every input and output is a sequence of token IDs. Context length, pricing, and the KV cache are all measured in tokens.

Tensors

A tensor is a multi-dimensional array of numbers — the generalization of a number.

  • A scalar is 0-dimensional: 7
  • A vector is 1-dimensional: [1, 2, 3]
  • A matrix is 2-dimensional: [[1,2],[3,4]]
  • A tensor is the general term for any number of dimensions

Model weights are tensors. Activations flowing through the network are tensors. The KV cache is a tensor. NVIDIA’s Tensor Cores are specialized GPU hardware for multiplying tensors in parallel — the reason inference runs on GPUs, not CPUs.

KV Cache

In a transformer, every token “attends to” every previous token. For each token, the model computes three things: Query (Q), Key (K), and Value (V).

When generating token 501, the model needs Q for token 501, plus K and V for tokens 1–500. The K and V values for previous tokens never change once computed — they’re properties of those tokens.

The KV cache stores those K and V values so they don’t have to be recomputed every step. This trades VRAM for compute: without it, generating each new token would re-do the work for every prior token (O(n) per step, O(n²) total to produce n tokens). With the cache, each new step only does O(1) of the attention work for prior tokens, bringing total generation cost down to O(n). Without it, local LLMs would be unusably slow.

The cache grows linearly with context length. For a 12B model at 32k context, the KV cache alone can occupy 5+ GB of VRAM. This is why doubling your context length roughly doubles VRAM usage even though the model itself doesn’t change.

Quantization

Original model weights are stored as 16-bit floating point numbers (FP16 or BF16). Quantization compresses them to fewer bits — typically 4-bit (Q4), 5-bit (Q5), or 8-bit (Q8) integers. A Q4 quantized model is roughly one quarter the size of the original, with a small but measurable quality loss.

Common quantization tags: Q4_K_M (4-bit, balanced quality), Q5_K_M (5-bit, slightly better), Q8_0 (8-bit, near-original quality). For most use cases, Q4_K_M or Q5_K_M is the sweet spot.

Embeddings

An embedding is a fixed-size vector of numbers (typically 384, 768, or 1024 dimensions) that represents the meaning of a piece of text. Sentences with similar meanings get vectors close together; sentences with different meanings get vectors far apart.

Embeddings are produced by small, specialized models (nomic-embed-text, mxbai-embed-large, OpenAI’s text-embedding-3-*). They power semantic search, RAG, and most “AI memory” features. Every modern AI system that “remembers” you uses embeddings under the hood.

GQA (Grouped Query Attention)

Modern models like Llama 3, Mistral Nemo, and Qwen 2.5 use Grouped Query Attention — many query heads share fewer K and V heads. This dramatically shrinks the KV cache (often 4× smaller) and is the reason mid-sized models can support long contexts on consumer GPUs.


The Mental Model in One Page

A way to remember the layers:

  • Model = “what to think” (the trained reasoning)
  • Engine = “how to think” (the computation that produces tokens)
  • Orchestrator = “what to think about” (context, memory, persona)
  • Agent = “what to do” (actions in the world)

A useful test for distinguishing them:

  • Has tools that affect the world? → agent
  • Has rich context management but no tools? → orchestrator
  • Just a UI on top of a model? → frontend
  • Just the API? → raw inference

Most “AI agents” in marketing materials are actually orchestrators. Most “AI assistants” are frontends. Real agents — the kind that can read your files, run commands, and modify the world — are the harder, rarer category, because giving a model the ability to act requires solving safety, reliability, and reasoning problems that orchestration sidesteps.


Why This Matters

Once you internalize the four-layer model, every AI product you encounter slots into it cleanly:

  • Claude Code = agent (Claude Code CLI) + orchestrator (built-in context management) + engine (Anthropic’s internal serving stack) + model (Claude)
  • ChatGPT = frontend (chatgpt.com) + orchestrator (memory, custom instructions) + engine (OpenAI’s stack) + model (GPT)
  • Cursor = frontend (editor UI) + agent (tool loop over your codebase) + orchestrator (context retrieval, file indexing) + engine (a mix of Anthropic, OpenAI, and self-hosted serving) + model (Claude, GPT, or your choice)
  • An Open WebUI + Ollama setup at home = exactly the same architecture as a hosted assistant, just at hobbyist scale and entirely on your own hardware

The architecture is identical across consumer products, hobbyist setups, and frontier labs. What varies is the scale, the polish, and which components are proprietary.

This is why local LLMs are worth learning even if you mostly use cloud models. The mental model transfers entirely. Once you’ve watched Ollama load a model, seen an orchestrator construct a prompt from scratch, and felt the difference an agent loop makes — every AI product becomes a familiar pattern of components you already understand.

That’s the whole prize: not knowing more about AI in the abstract, but being able to look at any AI product and immediately see what it’s made of.


Reference compiled from a week of hands-on experimentation with local inference engines, open-weight models, and a series of conversations debugging my own setup. The architecture described here is the same whether you’re running a 14B model on a gaming PC or serving frontier models to millions of users — only the implementation details change.

Leave a Reply

Your email address will not be published. Required fields are marked *