Writing Software in 2026

In software, as in all things, history rhymes

Mar 31, 2026

This image is AI generated, yes, but it’s eerily similar to what my dev setup looked like ca. 30 years ago. The tools have changed a lot since then but the problems still fundamentally rhyme.

I’ve been writing software in one form or another for a few decades. I know what it feels like to wrestle with a compiler, to chase a race condition across a distributed system at 2am, to experience the specific despair of a segfault with no useful stack trace. I thought I knew the shape of this work.

But over the past few months, everything became strange in the best possible way.

The challenges of writing software in 2026 rhyme with everything I know, but nothing is quite the same. Let me try to map the territory.

Prompts are just a new programming language

We went from punch cards and machine code to assembly to low-level languages to high-level languages. Each step was an abstraction—a way of expressing intent without needing to think too much about the layer below. Prompts are the next step on that staircase. Nothing new here, except the abstraction is now close enough to natural language that everyone thinks it’s not programming.

Actually, it still is. While it’s true that prompt engineering matters less than it did a year or two ago, since models have matured a lot, several intense weeks of building AI applications have taught me that prompt engineering isn’t dead yet. It matters a great deal how I instruct my agents to accomplish things, for the same reason that it matters how I tell a human teammate to get something done. This abstraction has syntax (structure and order matter). It has semantics (words mean specific things to specific models in specific contexts). It has edge cases and bugs.

But here’s the deeper reframe: a prompt isn’t the sort of code we’re used to. It’s a specification that a nondeterministic system tries to satisfy. A prompt is closer to a SQL query than a Python script: you declare what you want and let the engine figure out how to get there. The difference is that your database always returns the right rows, while your model returns its best guess.

The entire skill of “prompt engineering” is really specification engineering—being as minimal as possible, while also being precise enough that the probabilistic machine lands in the right zone. This is why long prompts full of caveats feel wrong: you’re writing a spec, not a script, and specs should specify tightly what matters, and leave unspecified what doesn’t. In other words: short, precise prompts for well-understood tasks (trust the model) and detailed, structured prompts for novel or high-stakes workflows (constrain the space).

LLMs are probabilistic approximators. You can’t enumerate the constraint space. Understanding this was a huge breakthrough for me. Successful prompt engineering looks less like writing a formal spec and more like shaping a distribution—which is a fundamentally different skill. Sometimes you shape it by adding constraints. Sometimes you shape it by removing them and letting the mode of the distribution do the work.

The real skill is knowing how much specification a given task needs—and that judgment is what separates good prompt engineers from great ones.

Context windows are the new 640KB

“640KB of memory ought to be enough for anybody.” We laugh at this now. I wonder how we will feel in ten years about the people arguing that 200K tokens is plenty for everyone, always.

But context windows aren’t just the new 640KB—they’re the new malloc/free. Working within a context window feels like embedded systems programming crossed with manual memory management. You’re intensely aware of what’s in scope. You curate aggressively. You compress. You build external memory systems—files, databases, embeddings, retrieval layers—because the working memory is precious and finite. We’ve had garbage collectors for decades. Now we’re back to hand-managing memory, except the “memory” is conversation history and the “leak” is when the agent forgets your architecture decisions from 40,000 tokens ago.

And like real memory, context has structure—and the structure matters enormously.

A context window isn’t a flat buffer. It’s segmented: system prompts, personality instructions, tool definitions, retrieved documents, conversation history, user input. Each segment has different volatility, different trust levels, different costs. This is strikingly similar to how computer memory evolved from flat address spaces into layered hierarchies—registers, L1/L2 cache, heap, stack, virtual memory, disk—each with different access patterns and performance characteristics.

It matters a great deal where you put things. Context isn’t all created equal. Decades of systems research went into figuring out where to put what. Cache-line alignment. Memory-mapped I/O. Virtual memory paging. The entire field of database buffer management. The lesson was always the same: how you structure memory matters as much as how much you have.

We’re relearning this now. System prompts are like firmware—loaded once, always resident, high-trust. Tool definitions are like a vtable—a callable structure the model can dispatch into. Retrieved context is like a page fault—pulled in on demand, maybe stale, maybe not what you needed. Conversation history is like a stack that only grows and never pops, until you hit the limit and start losing frames from the bottom.

Nobody who programs agents thinks about context as “just text.” It’s an architecture problem: what goes in the system prompt vs. what’s retrieved at runtime? What’s injected always vs. on-demand? What gets summarized, what gets preserved verbatim? These are the same questions systems programmers asked about memory layout for decades. We mostly forgot these questions because they’re mostly solved problems, but here we are asking the same questions all over again.

The job of the effective AI-era programmer is at least 40% context architecture. Someday this won’t be true, but for now, it is. Context is your most constrained resource. Treat it accordingly.

Multiple agents are a distributed systems problem

Running one agent is hard enough. Running multiple agents, coordinating on shared goals, with shared state they might simultaneously modify is a distributed systems problem with extra, nondeterministic chaos sprinkled on top. You get all the classics: race conditions, stale reads, conflicting writes, inconsistent views of state. Except instead of debugging with logs and traces, you’re reading conversation histories and trying to figure out why two agents reached opposite conclusions from the same facts.

The coordination patterns are, actually, familiar. Your whole setup—separate workspaces per agent, shared coordination channel—is literally the actor model. Message-passing concurrency with isolated state. Erlang solved this in the 1980s. The patterns are the same; the substrate is new. Agents are threads. Workspaces are process isolation. The difference is your “messages” carry much richer semantic content than Erlang tuples, and your “processes” reason rather than execute deterministic code.

The good news: the coordination primitives work. The bad news: agents are much worse at following coordination rules than computers are. They improvise. This is simultaneously their greatest strength and their greatest weakness. Sometimes this is great. Sometimes it causes headaches.

Skills are libraries. The model is the CPU.

Every mature programming ecosystem has libraries—reusable, composable units of functionality you reach for instead of rebuilding from scratch. Skills are that, for agents. “import numpy” is now “read SKILL.md.” The packaging is different but the idea is identical. Good skills have clear interfaces, handle edge cases, and fail loudly when misused. Context stuffing—RAG, memory files, skill loading—is the new compilation step. You’re assembling relevant context into working memory before execution. That’s a build step. We just haven’t named it yet.

And if skills are libraries, the model is the CPU. CPUs have instruction sets, clock speeds, thermal limits, and ISA compatibility problems. Models have capabilities, token throughput, context limits, and API compatibility problems. Switching from GPT-4 to Claude is the new porting from x86 to ARM—sometimes it just works, sometimes everything breaks in subtle ways. OpenAI is the new Intel—dominant, fast-moving, and being watched nervously by everyone building on top. We remember what happened when the Wintel monopoly calcified. And “OpenAI is the new Intel” implies someone is the new AMD—which is exactly what’s happening.

Config files are back

We’ve also reinvented /etc/. AGENTS.md, SOUL.md, TOOLS.md — these are declarative personality and behavior configuration files that shape a runtime you don’t directly control. The same lessons apply: version-control them, diff them carefully, don’t let them drift. The only difference is that your config files now describe a personality rather than a daemon. The runtime has opinions. (Daemons with personality?)

Determinism vs. nondeterminism — this is the big one

Here’s where I keep getting stuck, and I think it’s the most important thing to understand about this new kind of programming.

Nearly all modern software is deterministic. Given the same inputs, you get the same outputs. This is the foundation of debugging, testing, and reasoning about software behavior. It’s so fundamental that we barely even think about it.

AI is not this. Ask your agent to do precisely the same thing ten times and you’ll get ten different results. It’s actually closer to quantum computing than classical computing: you get probabilistic outputs and have to design around that at the systems level.

The natural response is to write prompts the way medieval scribes wrote legal documents: elaborate, exhaustive, hedge-everything-twice. You add caveats. You anticipate misinterpretations. You include positive and negative examples. You write “do NOT include markdown headers” because you learned this the hard way. You’re essentially writing defensive code, except the compiler has moods.

The deeper consequence is that we need new tools. Evals instead of unit tests—probabilistic, run-many-times, assess the distribution of outcomes rather than any single output. And hybrid architectures: nondeterministic agents that delegate to deterministic scripts for anything where correctness matters. I do this constantly. The agent orchestrates; a shell script does the actual thing. Strange loops everywhere.

This is the real architecture pattern of 2026: deterministic orchestration of nondeterministic components. A cron job (deterministic) triggers an LLM call (nondeterministic) that runs a script (deterministic) that hands its output to a LLM call (nondeterministic). It’s the same pattern as Monte Carlo tree search, or stochastic gradient descent wrapped in a training loop. The industry just hasn’t named it yet.

Hallucinations are the new undefined behavior

C gave you segfaults from dangling pointers. LLMs give you confident, plausible wrong answers. Same failure mode: the system doesn’t know it’s broken, and there’s no way within the system to discover this or address it. No stack trace. No line number. Good luck.

Same mitigation pattern, too. In C, you build sanitizers, assertions, and code review layers around the thing you can’t fully trust. With LLMs, you build evals, verification steps, and human review loops. The entire field of “AI evals” is just unit testing for nondeterministic systems—we just haven’t fully admitted that to ourselves yet.

History doesn’t repeat. But it rhymes.

Here is the thing about working with AI in 2026: almost every problem we’re encountering, we’ve encountered before. Everything old is new again, and the present rhymes with the past, but by and large we haven’t realized this yet. We’re just encountering it in a new substrate, with new vocabulary. The field has a habit of forgetting its own lessons between paradigm shifts. Let’s try not to this time.

Caching hierarchies. CPUs have L1/L2/L3 cache, RAM, and disk—each layer bigger, slower, cheaper. Decades of hardware design teach one rule: design your hierarchy deliberately. Your context architecture should have the same shape: system prompt (L1, always hot), recent conversation (L2), documents retrieved on demand (L3), cold storage behind a search layer. Most people and most harnesses today dump everything into one tier and wonder why performance degrades. RAG isn’t a novel idea. It’s demand paging for language models.

Interface contracts. The single most durable lesson from software engineering: program to interfaces, not implementations. An agent that takes opaque inputs and returns unspecified outputs is a function with no type signature—it’ll break and you won’t know why. Define what each agent accepts and what it must return. Write it down. The codebases that survived were the ones with clean APIs. The agent pipelines that’ll survive will be the same.

Observability. We spent twenty years learning that reading logs doesn’t scale, then built structured logging, distributed tracing, and APM. Agent systems in 2026 are back at printf—reading conversation histories, guessing at causality, unable to answer “why did you do that?” without hallucinating. The lesson: invest in observability before you need it, not after. Traces across agent calls. Structured evals as monitoring. This is the next infrastructure wave and it maps directly to what we built for microservices.

Graceful degradation. Robust systems don’t crash on failure, they degrade. Circuit breakers, retries with backoff, fallback paths. An agent system that propagates hallucinated output is a system with no error handling. Design for failure as the common case: if the model returns garbage, catch it deterministically, fall back, and don’t let the garbage propagate downstream. The patterns are already named. We just have to use them.

Version control as institutional memory. Git solved “what changed and why” for code. We have no equivalent for agent behavior. When you tweak SOUL.md and your agent starts behaving differently, there’s no “git bisect” for personality drift. There’s no diff on a prompt that tells you which line caused the regression. Treat agent configuration with the same rigor as production code: version it, diff it, review changes, roll it back when something breaks. This isn’t optional hygiene. It’s the difference between a system you can reason about and one you can’t.

Worse is better. Unix beat Multics. Simple, composable, imperfect tools beat elegant monoliths—every time, at scale, over time. A pipeline of small specialized agents and agentic tools with limited context, a limited toolset, and clean handoffs will outperform one omniscient mega-agent. This is the Unix philosophy applied to AI. Narrow the scope. Define the interface. Compose. The teams that resist the temptation to build the one tool that does everything will build the systems that last.

Every lesson that made us better at managing complexity in software—caching, interfaces, observability, graceful degradation, composability—applies directly here. The vocabulary changes. The engineering doesn’t. The teams that recognize this fastest will build the most reliable AI systems of the next decade.

What this all teaches us

We’re shifting from writing programs to writing constraints on programs. The programmer is no longer the one doing the computation—the model is. The programmer is now the one who specifies what good computation looks like, catches it when it goes wrong, and builds the deterministic scaffolding around the nondeterministic core.

This is a role change, not a job loss. And it requires a different skill set than we’ve been rewarding. The best AI-era programmers I know are not the ones who are most impressed by agents. They’re the ones who understand memory, concurrency, distributed systems, and language—in other words, all the old ideas—and who can see where the new problems rhyme with the old ones.

The history of computing is a history of rediscovering the same truths at a higher level of abstraction. The AI era is just the next floor of that building.

The tools are new. The problems are ancient. It’s time to get good at both.

Pawel Jozefiak

14h

The deterministic orchestration wrapping nondeterministic components framing - that's exactly where I landed after a migration I just finished. Built a custom task manager for my AI agent over two months, tightly coupled to a FastAPI layer I owned entirely. Replaced it with self-hosted open-source kanban and a 94-line dispatcher shim. The shim is the deterministic layer: it translates legacy script calls into kanban API calls with no agent-visible change.

Agents stayed nondeterministic, orchestration stayed predictable. Didn't plan it that way - the architecture emerged from the backward-compatibility constraint. The history-rhymes framing makes me think this is just systems design with AI-shaped inputs, and the old lessons mostly still hold.

Three Things

Discussion about this post

Ready for more?