The Product Manager’s New Operating System

What a Year of Building With LLMs Taught Me About the Future

I have spent the last year working with frontier tools to understand what they change about how product managers, product leaders, and software teams operate. Not at the level of headlines or demos. At the level of daily work.

So I started building.

I built visual experiments, 3D animations, a Michigan logo in motion, a celebration cake, publishing websites, games, native app concepts, and strategy tools. I built utilities to analyze my own AI workflows, an evaluation tool for conversational discovery, a travel journal for a photography trip, and a promotion evaluation workflow for one of the most human, judgment-heavy processes leaders run.

The connective tissue was one question: what happens when the distance between intent and artifact collapses?

For years, product work maintained a predictable separation between thinking and making. A PM developed a point of view, wrote a strategy document, aligned stakeholders, waited for implementation, and reacted once the thing existed. That cycle was useful, but its slowness allowed weak thinking to hide. A vague idea could survive inside a polished document because nobody had confronted what it meant in software.

LLMs make it possible to test that thinking earlier. A sketch becomes an interface. A workflow problem becomes a tool. A strategy becomes something interactive. That does not make the PM less important. It makes the PM’s judgment more exposed.

Learning the new loop

My first visual projects helped me feel the loop directly. I could describe an experience, generate code, inspect it, reject it, redirect it, and keep moving. A 3D logo or celebration cake may sound lightweight next to a production platform, but the feedback is immediate. You can see whether the idea works, whether the model understood you, and where taste still matters.

The model can generate and propose. It can translate a rough idea into something visible. It does not know whether the result fits the audience, brand, user need, or strategic intent unless you bring that judgment to the work.

The first phase was learning to direct the machine. The next was learning to collaborate with it.

As the projects became more ambitious, the work looked less like prompting and more like product development. I was making decisions about architecture, state, onboarding, navigation, performance, packaging, and release quality. I was learning how to keep a system coherent while moving at a pace that would have been unrealistic before.

That pace changes the texture of product work. Instead of waiting for one formal prototype, a PM can explore several interaction models and reject weak directions while the idea is still cheap. But speed does not remove the need for coordination. Each change can introduce inconsistency: a screen that no longer matches the navigation, a feature that ignores an earlier constraint, or a technically complete flow that still feels wrong.

The work became a continuous act of maintaining intent across a growing artifact. That is product management in a more direct form, not product management disappearing.

A strong PM has always needed clarity. Now that clarity can be tested. If I cannot describe the product well enough for an agent to start building it, that tells me something. If the agent moves in the wrong direction, that tells me something. If I can produce a prototype but cannot explain the customer value, that tells me something too.

AI does not only reveal the model’s limitations. It reveals the human’s ambiguity.

The 3D Connect-K project made that concrete. A board game seems simple until you define rules, game states, win conditions, interactions, player modes, difficulty, feedback, configuration, persistence, and platform choices. Web versus native iOS changes the interaction model, distribution, performance expectations, and product feel.

Building it helped me understand how PMs can cross from product definition into product materialization. That does not mean replacing engineering. It means bringing more concrete thinking, better questions, and earlier evidence into the conversation.

A PM who has touched the implementation can ask different questions. Which state needs to persist? What happens when the user changes the board size after starting? Which rules belong in shared logic rather than the interface? Where does a native platform create expectations the web version does not? The purpose is not to become the only builder. It is to make product intent more specific before the team commits to it.

But faster materialization creates a new risk. If every idea can become a prototype, artifact production is no longer the bottleneck.

If every idea can become a prototype, the bottleneck becomes judgment.

Human, AI, human

Documents exposed the same problem in a different form. It is easy to generate a document that sounds right. It is much harder to produce one that is right. That is the difference between fluency and thinking.

The useful pattern I kept returning to was human, AI, human.

The human sets direction: what problem are we solving, what are the constraints, what does good look like, and what would change our mind?

AI expands the middle. It drafts, compares, synthesizes, generates variants, critiques, builds, refactors, tests, and makes the work visible.

Then the human owns the finish. Is this true? Useful? Specific? Coherent? Defensible? Is it the thing we should actually do?

That final step is not editing. It is ownership.

Hand-drawn field-notes illustration showing the human-AI-human workflow loop across direction, expansion, and ownership. — The useful pattern is the PM setting direction, using AI to expand the middle, and owning the final judgment.

Making intent inspectable

Codex Log Viewer grew from my need to understand how I was actually working with AI. Which projects consumed the largest sessions? Which prompts repeated? Was I spending the work on implementation, research, planning, verification, or release?

The project became more important when I realized that natural-language development creates a new kind of product record. Code tells you what changed. The conversation tells you why the human thought the change should exist.

When software is built through natural language, prompt history becomes part of the product record.

If a feature behaves incorrectly, I want to know whether the human asked for the wrong thing, the model misunderstood, the requirement changed, verification failed, or implementation drifted. The conversation becomes evidence and an audit trail.

Project Focus extended that idea by classifying the prompts I was sending. Was I asking AI to think or only to execute? Was I asking for enough verification? Were approvals becoming casual? Was I creating a healthy loop of direction, generation, inspection, and correction?

The classification is useful because it can expose imbalance. A workflow can look productive because it contains a great deal of implementation while underinvesting in research, evaluation, and review. Faster execution does not automatically create a healthier operating model. It can simply make the existing bias toward output more efficient.

That is a different measurement question from how quickly a document was drafted or how much code an agent generated. The deeper questions are whether assumptions are clearer, evidence is easier to inspect, failure modes are easier to find, and humans remain accountable for the right parts of the process.

Hand-drawn scorecard contrasting output metrics with operating-model measures such as questions, evidence, failure modes, and accountability. — The deeper measurement question is whether the AI-assisted workflow is becoming more reviewable, evidence-based, and accountable.

Better surfaces for judgment

The promotion evaluation project tested the pattern in a high-stakes setting involving people’s careers, evidence, fairness, and leadership accountability.

The design principle was simple: AI should not decide who gets promoted.

The workflow used AI to improve the surface on which humans exercised judgment. Promotion packets are uneven. Managers write differently; evidence can be clear, buried, vague, or inferred. The system helped create consistent briefs, extract claims, organize evidence, surface risks, generate panel questions, and track notes.

That consistency matters because panels are otherwise comparing two things at once: the candidate’s evidence and the manager’s ability to present it. A standardized brief does not eliminate judgment, but it makes differences in source quality and support easier to see. It also gives reviewers a shared place to record disagreement instead of allowing a fluent narrative to settle the question prematurely.

The model was not judging a person. It was classifying the support behind a claim: directly verified, cross-referenced, inferred, or unsupported. That distinction matters because a confident summary can conceal very different levels of evidence.

Often the goal is not to remove the human. It is to give the human a better surface for judgment.

Conversational search evaluation brought the same lesson into a product domain closer to my daily work. An AI discovery system can fail while sounding fluent. It can violate a media-type constraint, miss a required title, invent support for a recommendation, or generate repetitive follow-ups. It can look excellent in a demo and still fail as a product.

The evaluation framework therefore separated correctness from quality. Correctness asked whether the response satisfied constraints and preserved evidence. Quality asked whether the results were relevant, well ordered, specific, grounded, diverse, and coherent.

That distinction prevented a common demo trap. A response can sound excellent while recommending the wrong media type or inventing a detail. It can also be factually valid and still fail because the best result is buried, the explanation is generic, or every follow-up narrows toward the same titles. Those failures should not share one score because they point to different parts of the product.

It also had to evaluate paths rather than isolated answers. A seed query opens branches; each follow-up can continue for several turns. Repetition, drift, and narrowing may only appear over time.

A first response may look diverse, for example, while each suggested follow-up leads back to the same corner of the catalog. The user experiences a conversation; the team cannot evaluate only the opening turn.

AI-native discovery products have to be evaluated as paths, not only responses.

Hand-drawn evaluation board showing a seed query branching into follow-up paths, then correctness and quality gates. — AI-native discovery products need evaluation across paths because constraints, grounding, drift, and trust compound over multiple turns.

This work requires PMs to think in systems. Retrieval, ranking, metadata, generated language, policy, latency, grounding, feedback, and trust all interact. The product is not only what appears on the screen. It is the behavior of the system.

That does not mean every PM becomes an engineer or data scientist. It means PMs need enough fluency to ask better questions, understand the evidence, and own the tradeoffs.

The travel journal project reinforced the point from a personal angle. It turned a photographic trip—locations, light, timing, field context, and creative intent—into a structured product I could actually use. The value was not a generic itinerary. It was the shaped artifact created through context, constraints, taste, and iteration.

That project had no enterprise workflow or ranking model behind it, but it still required product judgment. The sequence of days had to support the photographic objective. Guidance had to be useful in the field rather than merely comprehensive. The final experience had to make the right information available at the moment I needed it. AI helped expand the material; it did not decide what would make the trip meaningful.

Build better loops

Across these projects, one idea kept surviving contact with the work: AI is not just a productivity layer. It is becoming an operating layer.

Artifacts are cheaper now. Documents, prototypes, summaries, dashboards, code, tests, and evaluation rubrics can all be produced faster. The differentiator moves upstream and downstream. Upstream: did you frame the problem clearly? Downstream: can you determine whether the output is true, useful, grounded, and worth acting on?

This increases the premium on product taste and evaluation. When producing another version costs little, the hard decision is which version deserves to survive. When synthesis sounds polished by default, leaders need stronger standards for evidence. When a prototype appears in hours, teams need the discipline to ask whether it represents a customer problem worth solving.

The advantage comes from building better loops:

from intent to artifact
from artifact to evidence
from evidence to judgment
from judgment to product improvement

That is what I was trying to understand by building, not from the sidelines but through the cleanup, misunderstandings, false starts, and verification work.

You have to feel how quickly an idea becomes real and how easily the model misunderstands it. You have to feel the distance between a plausible answer and a grounded one. You have to feel how much your own clarity matters.

The future will not belong to PMs who produce the most AI-generated artifacts. It will belong to PMs who build better loops from intent to evidence, judgment, and product improvement.

That is the operating system I have been building for myself—and the one I think product teams will need next.

This writing reflects my personal perspectives on product management, AI, and content discovery. It does not represent the official position of my employer or any affiliated organization.