The Artifact Is Not True Until It Runs
A Field Note on AI-Augmented Engineering, Validation, and the Work Between Specification and Reality
An AI model can generate code. What recently felt like a novelty in software engineering is quickly becoming an accepted, even mundane norm. The more interesting—and still overlooked—questions happen before and after the code appears.
A plausible artifact is not the same thing as a working artifact. A well-formatted README is not the same thing as operational knowledge. A second model agreeing with the first is not the same thing as proof. In the age of agents, the work of the engineer does not disappear. It moves. Less of the work may be typing the first draft of the code. More of the work becomes deciding what is true.
This is a field note from one small piece of technical work: a bounded containerization task that began with a specification, moved through AI-augmented design and review, and ended only after the resulting artifact was built, run, and tested outside the models' sandboxes.
The task itself was not large. Serve a small static web application from NGINX inside a Docker container. Use Ubuntu 24.04 as the base image. Serve the application over HTTPS. Provide the code and documentation needed to build and run it locally.
On paper, that is a simple assignment. In practice, even a small assignment presents an unbounded decision surface.
What is the actual application payload? What is historical deployment scaffolding? Where should TLS terminate in this constrained local scenario, and how would that differ in production? Is a self-signed certificate acceptable? Should certificate material be generated at build time or runtime? Should NGINX run as root? Which ports should it bind? What does “works” mean when the application embeds a third-party iframe in the browser? Furthermore, how does one validate the generated code?
Those questions became the work.
Specification Before Generation
The first useful step was not asking an AI model to write a Dockerfile. It was reading the specification carefully enough to know what the Dockerfile had to prove.
The stated constraints mattered. The task did not ask for Kubernetes, ECS, Terraform, CloudFront, an Application Load Balancer, or a production deployment pattern. It asked for a standalone Docker container that could be built and run locally. That boundary shaped the solution.
It would have been easy to reach for a production-shaped answer: terminate TLS at a load balancer, use a certificate from AWS Certificate Manager, forward to the container over an internal HTTP listener. In a real AWS production environment, that is the pattern I would usually prefer. But it was not the task. The specifications asked the container itself to serve HTTPS.
That meant TLS had to terminate inside the container for the purposes of the request.
The first lesson was simple: AI can generate quickly, but the human still has to establish what “correct” means.
Payload, Cruft, and the Archaeology of Repos
The source application looked like many inherited repositories look: small core payload, surrounded by historical sediment.
There was an HTML file, a JavaScript asset, and a CSS file. There was also a package manifest, a Heroku Procfile, and configuration intended for an older deployment path. Some of those files were useful context. Some were not part of the runtime required by the assignment.
This is normal. Real repositories are rarely clean teaching examples. They carry prior hosting models, abandoned assumptions, comments from earlier deployment systems, and files whose continued relevance is not obvious until someone checks.
A model can summarize the contents of a repo. It can suggest that a Node app should be built as a Node app. It can also over-preserve historical scaffolding because it lacks the organizational judgment to know which bones matter.
In this case, the application was static. NGINX could serve the required files directly. Node.js was unnecessary. Heroku-specific files were excluded from the served web root. The resulting container served the application, not the repository’s history.
That distinction matters. One of the engineer’s roles in AI-augmented work is to sift payload from sediment.
AI as Reviewer, Not Oracle
The implementation was developed through structured conversation with multiple AI tools. The models helped generate options, explain tradeoffs, draft documentation, and review assumptions. That was useful. It was also not sufficient.
At several points, model-augmented review surfaced issues that deserved attention:
- A user assumption in the Dockerfile did not match the Ubuntu packaging convention.
- An NGINX directive was valid in some versions but not appropriate for the version shipped with Ubuntu 24.04.
- Documentation claims needed to be tightened so they matched what the container actually did.
- Certificate material should not be baked into a reusable image layer.
- Running on unprivileged ports was necessary but not sufficient to prove the process actually ran as a non-root user.
Those were real catches. They improved the artifact.
But the models did not own the answer. A model can point at a possible issue. It cannot relieve the engineer of deciding whether the issue is real, whether the fix is appropriate, or whether the final behavior has been proven.
The useful pattern was not “ask an AI to verify the AI.” The useful pattern was “use AI to widen the review surface, then ground the result in evidence.”
That distinction is easy to lose. Two models can agree with each other and still be wrong. A polished explanation can still rest on a false assumption. A confident correction can still be irrelevant to the actual environment.
AI review is valuable. It is not ground truth.
The Three Validation Layers
The work became clearer when viewed as three layers of validation.
1. Internal Consistency
The first layer asks whether the artifact agrees with itself.
Does the README describe what the Dockerfile actually does? Do the documented ports match the NGINX config? Does the repository layout match the tree shown in the documentation? Are the verification commands testing the behavior the document claims to provide?
This is the layer where AI review is especially useful. Models are good at reading across text, spotting contradictions, and asking whether a claim has been implemented. They can notice when a README promises non-root execution but the Dockerfile does not enforce it. They can notice when a design note says certificates are generated at runtime but the Dockerfile bakes them into the image.
Internal consistency is necessary. It is not enough.
2. Environmental Validity
The second layer asks whether the artifact works in the environment it claims to target.
Ubuntu 24.04 is not an abstraction. It has actual package versions, actual default users, actual filesystem paths, and actual behavior. NGINX inside that image is not generic NGINX; it is the version installed from that distribution’s package repository.
This is where plausible generated output often breaks. A Dockerfile can look correct and fail because a user does not exist. An NGINX config can look modern and fail because the installed package version does not support a directive. A certificate path can look reasonable and fail because the non-root process cannot read or write the expected files.
The only reliable answer is to build the image and run the configuration against the actual runtime.
In this case, the container was built and run in a Linux environment. NGINX validated its configuration with nginx -t. HTTPS returned 200 OK. HTTP redirected to HTTPS. Static JavaScript and CSS assets loaded. Process inspection showed that the NGINX master and worker processes ran as the non-root user expected by the implementation.
That moved the artifact from plausible to environmentally valid.
3. Grounded Behavioral Validity
The third layer asks whether the artifact actually does the thing the user or system needs it to do.
This is the layer most resistant to simulated verification.
For a static site, curl returning 200 OK might be enough. For this application, it was not. The application embedded a third-party application iframe. The real success criterion was not merely that index.html loaded. The real success criterion was that the application rendered in a browser, that the third-party iframe loaded, that Content Security Policy did not block it, and that the UI behaved like an interactive application rather than a static page.
That required a browser test.
The container was run on an actual EC2. The page was opened over HTTPS. The self-signed certificate warning was accepted. The form fields rendered. Client-side validation responded. The artifact behaved as intended outside the AI sandbox.
Only then was it reasonable to say the work was done.
What the Human in the Loop Actually Does
“Human in the loop” is too vague to be useful by itself.
Sometimes it means a skilled operator supervising automation with context, authority, and clear intervention points. Sometimes it means a human rubber-stamping plausible output produced by a machine. Sometimes it means a person is present mainly so culpability... erm, accountability has somewhere to land when the system fails.
The phrase often imagines the human as safely positioned outside the machine: watching the system, evaluating its output, and intervening when necessary. In practice, the human is frequently implicated in the same consequences as everyone else. They are responsible for the outcome, exposed to its downstream effects, and asked to review work at a speed set by the machine.
That matters because the highest-value human work is not usually performed while the model is generating. It happens before and after: before, in the specification, scope, constraints, risks, and acceptance criteria; after, in validation against the real environment. Under workplace time pressure, those bookends are exactly what get compressed. The organization keeps the visible act of generation and cuts the quieter work that makes the result trustworthy.
In AI-augmented engineering, the human in the loop should not be ceremonial. The human’s job is not to watch an agent produce output and then admire it. The human’s job is to establish trust.
That means:
- understanding the specification;
- identifying assumptions;
- deciding what is in scope;
- distinguishing payload from historical scaffolding;
- recognizing when generated output is plausible but unproven;
- designing validation paths;
- running the artifact in a real environment;
- deciding whether the result is good enough;
- documenting the tradeoffs so the next reader knows what was intentional.
This is not glamorous work. It is also not optional.
Agentic engineering does not eliminate human responsibility. It relocates responsibility to the parts of the process most likely to be compressed.
The more capable the tools become, the more important this work becomes. AI can produce more artifacts faster than humans can manually inspect. That does not reduce the need for validation. It increases the cost of not having a validation discipline.
Specification Is Not Enough
Clear specifications matter. Agents build what they are asked to build, and ambiguous instructions invite machine assumptions. Better specifications produce better output.
But specifications are still descriptions of desired reality. They are not reality.
A specification can say “serve over HTTPS.” The artifact still has to establish TLS successfully. A README can say “runs as non-root.” The process table still has to prove it. A design note can say “the application works.” The browser still has to render the embedded payment frame.
This is where a purely agentic view of engineering can become too abstract. If the future of software work is treated as “specification goes in, working software comes out,” then the phrase “working software” has to carry a lot of weight. Working where? Under what runtime? Against which dependencies? Under what browser policy? With which credentials, certificates, network rules, and operational constraints?
The gap between the specification and the world is where engineering still happens.
Brownfield Is the Normal Case
This small task was greenfield only in the narrowest sense. The container solution was new. The source application was inherited. It carried prior deployment assumptions. It had to be interpreted before it could be packaged.
That is more representative of enterprise engineering than a clean-room example would be.
Most real systems are brownfield systems. They contain fragmented village knowledge, partial documentation, historical deployment models, fragile interfaces, and tests that cover only part of the behavior. Agentic tools can help navigate that terrain, but they cannot make the terrain disappear.
In brownfield work, one of the first jobs is specification recovery: discovering what the system actually does, what parts are still live, what assumptions are obsolete, and where the source of truth lives. AI can accelerate that investigation. It can summarize. It can compare. It can suggest likely patterns. But the recovered specification still has to be validated against the running system.
Otherwise, we are not recovering truth. We are generating a better-looking myth.
Documentation as an Operating Surface
The final README for this task was longer than the Dockerfile. That may seem disproportionate until one remembers what the documentation was doing.
It was not merely explaining how to run the container. It was recording the decision trail:
- why the app was treated as static;
- why Node.js was not included;
- why Heroku-specific files were excluded;
- why Ubuntu 24.04 was used instead of an official NGINX image;
- why TLS terminated inside the container for the work sample;
- why the self-signed certificate was generated at runtime;
- why NGINX listened on unprivileged ports;
- what was different about production deployment;
- what verification steps proved the result.
This is not decorative documentation. It is the operational surface of the artifact.
In AI-augmented work, this kind of documentation becomes more important, not less. Generated output can look obvious after the fact. The decision trail explains why the final artifact is shaped the way it is and which alternatives were considered but not chosen.
That paper trail is part of how teams build trust.
The Commit History Matters
The repository was delivered as a mirrored Git repo rather than a zip of files. That preserved the commit history.
This mattered because the history told a story:
- start with assumptions and design decisions;
- vendor the static application payload;
- add NGINX configuration;
- add an entrypoint for runtime certificate generation;
- add the Dockerfile;
- add ignore files and build hygiene;
- fix issues found in review;
- clean up documentation.
A useful commit history is not just a mechanical record of file changes. It is an account of intent. It lets a reviewer see how the work evolved and where assumptions were corrected.
When producing a polished final artifact becomes cheaper, the evidence of that process becomes even more valuable. The question is not only “what did you submit?” It is also “how did you get there, and did the process make the artifact more trustworthy?”
What This Suggests
AI-augmented engineering is not primarily a prompting problem. It is a validation problem.
Prompts matter. Specifications matter. Tooling matters. But the durable practice is the loop:
- Read the specification carefully.
- Establish scope.
- Inspect the inherited system.
- Identify assumptions.
- Use AI to generate options and challenge reasoning.
- Implement the smallest artifact that satisfies the actual requirements.
- Validate internal consistency.
- Validate against the target environment.
- Validate behavior in the real world.
- Document the tradeoffs.
That loop is portable. It can be used with a chat model, an agentic IDE, a human pair programmer, or no AI at all. The tools change the speed and shape of the work. They do not remove the need for the loop.
The more powerful the agent, the more necessary the loop becomes.
Closing
The artifact is not true because an AI generated it.
It is not true because a second AI reviewed it.
It is not true because the README is thorough.
It becomes trustworthy when its claims are tied to evidence: build output, runtime behavior, process state, network responses, browser behavior, and documentation that names the limits of what has been proven.
That is the work between specification and reality.
That is where the engineer still stands.