---
slug: why-ai-code-gets-worse-over-time
title: "Why Your AI Code Gets Worse: It's the Context, Not the Model"
excerpt: "Your AI coding tool seems to get worse the longer you work. It's not the model failing. It's context decay, and a one-page spec is what fixes it."
primaryKeyword: "AI code drift"
publishedAt: 2026-04-15
readingTimeMin: 7
author: "Robert Boylan"
tags:
  - ai-code-drift
  - context-window
  - ai-coding-tools
  - prd
  - indie-dev
---

It's prompt nine. You've been building for forty minutes. Cursor is confidently writing a "share to Twitter" button for an app that was supposed to be a private journal. Two prompts ago, your login page stopped using the theme colors. You haven't touched the login page in half an hour, and you're starting to wonder if Claude is just having a bad day.

It isn't. The model is fine.

The thing that's getting worse is the context. What you're hitting has a name. Call it **AI code drift**: the slow, predictable way an AI coding session falls apart the longer it runs. It isn't a bug. It isn't bad luck. And it doesn't get fixed by switching to a smarter model.

This post is for anyone who's felt their Cursor, Claude Code, Lovable, or v0 session quietly decay over the course of a weekend build. Also for the non-coders watching over a shoulder who want to know why the same tool that was amazing ten minutes ago suddenly seems clueless. Short version: the picture of the app in the AI's head has been slowly rotting, and nobody pointed at a canonical version to re-ground on. That's what a spec is for.

## The pattern: prompt four is when the drift starts

Anyone who's built more than two small apps with an AI coding tool knows the shape. The first two or three prompts go beautifully. You describe the app. The AI generates something close to the thing you pictured. You correct a small detail. It lands.

Then somewhere around prompt four or five, something breaks. A feature you introduced at prompt two is gone. A styling choice from prompt three has been quietly rewritten. A new feature has appeared that you never asked for. The next three prompts are spent rebuilding what used to exist. Your tone in the chat gets a little terser.

By prompt ten, you're mostly in repair mode. The feature you actually wanted to add next has been postponed while you babysit the app's memory of itself. This is [the prompt-four drift pattern almost every vibe-coder hits](/blog/vibe-coding-why-planning-matters), and it isn't random. It's the predictable result of how these tools actually work under the hood.

## It's not the model. It's the context you handed it.

Every AI coding tool runs on a **context window**: the amount of information it can hold in its head at once. (For the non-coders: think of it as the AI's short-term memory. A bigger window means it can see more at once. But "can see more" isn't the same as "can remember more.")

That context is a mix of:

- the conversation so far, prompt by prompt
- the files currently in your project
- system instructions from the tool itself
- whatever scratch notes you pasted in at the start

The model is reasoning over that whole soup on every turn. Model quality decides _how well_ it reasons. The context decides _what_ it reasons about. If the context is clean and structured, even a mid-sized model does well. If the context is a pile of half-described edits, contradictory requests, and forty-five minutes of chat, even a frontier model gets confused. It's still doing a careful job. It's just doing it over a bad map.

The frustrating part is that the tools make this invisible. You see a chat window. You don't see the map.

## Three kinds of decay that compound as you work

Three separate things rot the context, and they stack on top of each other.

**Scope creep.** You add features as you go. "Oh, wouldn't it be nice if…" Each one expands the picture. By prompt six, the app in your head is bigger than the app you described at prompt one, and the AI is working off some mix of both versions.

**Plan decay.** Ideas you mentioned once at prompt two get buried under thirty newer messages by prompt eight. The AI technically still "has" them in the context window. In practice, older content tends to get less weight than recent content. What you said at the start matters less the further you go.

**Silent contradictions.** You say "make it dark mode" at prompt three. You say "lighten the hero section" at prompt seven. To you, those are two small adjustments. To the AI, they're a conflict. When it resolves the conflict on prompt eight, it picks one and quietly abandons the other. You find out at prompt ten, when you notice the dark mode is gone.

None of these are the model's fault. All of them compound. Every new prompt adds noise, every new generation reasons over noisier input, and nobody ever cleared the record. The result feels like the AI is getting worse. What's actually happening is the shared picture of the app has been slowly disintegrating, and the AI is reasoning accurately about the degraded version.

## Why "just use a bigger context window" doesn't fix this

People on the Cursor subreddit hit this wall and conclude the answer is a model with more context. Claude with a million-token window. Gemini 2.5 with two million. Bigger window, less forgetting, right?

Not really.

A bigger context window means the AI _can_ look at more. It doesn't decide _what matters_ in what it sees. A million tokens of conflicting, unstructured chat is still a million tokens of noise. The model has to guess which pieces of that noise describe your real intent, and guessing gets harder the noisier the input.

Worse: research on long-context models consistently shows that content in the _middle_ of the context gets weighted less reliably than content at the beginning or end. There's a well-cited paper on it called [Lost in the Middle](https://arxiv.org/abs/2307.03172), and the finding has held up across model generations. More tokens in the window doesn't guarantee better recall. Sometimes it guarantees worse.

A bigger window helps at the margins. It doesn't fix the underlying problem, which is that the AI has no single canonical description of your app it can re-ground on. That isn't a model problem. It's a _source of truth_ problem.

## The spec as anchor: the one source of truth you can both re-ground on

The fix works for the same reason it sounds boring: put the description somewhere both of you can keep coming back to.

A [good spec (a one-page brief, not a thirty-page enterprise document)](/what-is-a-prd) does three specific things for context hygiene:

1.  **It outranks the chat.** When you start every new session by pasting the spec back in, or keeping it as a pinned file the tool always loads first, you're telling the AI "this is the canonical version. The chat is scratch work." That resets the map on every turn.
2.  **It names the non-goals.** The single biggest cause of silent scope creep is that nobody wrote down what you're _not_ building. A "not building: social sharing, multi-user, map API" bullet list takes thirty seconds to write and blocks an entire class of drift before it happens.
3.  **It records the opinions.** Dark mode, warm yellows, not corporate. Supabase, not Firebase. No backend, local storage is fine. The AI has opinions. You have opinions. If yours aren't written down, the AI's opinions win by default, and they win one small choice at a time.

The spec becomes the thing both you and the tool can point at when the picture drifts. "Is this still matching the spec?" is a question the AI can answer concretely. "Is this still matching what you meant forty-five minutes ago?" is not.

This is the gap Draftlytic is built around. You describe the idea, answer a few pointed questions (who it's for, top three features in order, what you're _not_ building, what you feel strongly about), and come out with a spec that pastes straight into [Cursor, Claude Code, Lovable, v0, or whichever coding tool you build in](/blog/ai-prd-generators-compared). The spec becomes the anchor. The chat becomes scratch work, not the source of truth.

## The takeaway

The next time your AI code feels like it's getting worse, it probably isn't. The model you started with is the model you still have. What's changed is the picture of the app it's reasoning from, and that picture has been quietly falling apart since roughly prompt four.

You can't fix AI code drift by upgrading the model. You fix it by writing down, once, what you're actually building and keeping that description next to the chat. Most people skip that step because the idea is already in their head. Holding the idea intact across a forty-minute session, against a model that weighs recent text more than old text, is a different job. That's the job Draftlytic is built to handle. You describe it once. It stays put.
