Article
June 6, 2026
15 min read
The Product Manager’s New Operating System
AI-native product work rewards PMs who can turn intent into artifacts, inspect evidence, and own judgment across faster operating loops.
By Cristiano Pierry

What a Year of Building With LLMs Taught Me About the Future
I have spent the last year working with frontier tools to understand what they were really going to change about how product managers, product leaders, and software teams operate. Not at the level of headlines. Not at the level of demos. At the level of daily work.
So I started building.
I built visual experiments. I built 3D animations. I built a Michigan logo in motion. I built a celebration cake. I built publishing websites. I built strategy documents. I built games. I built prototypes. I built native app concepts. I built utilities to analyze my own AI workflows. I built evaluation tools for conversational discovery. I built a travel journal and itinerary system for a photography trip. I built a promotion evaluation workflow for one of the most human, judgment-heavy processes leaders have to run.
At first glance, that may sound like a disconnected set of projects. It was not. The connective tissue was the same question every time: what happens when the distance between intent and artifact collapses? That is the shift I think many product teams still underestimate.
For years, product work has had a fairly predictable separation between thinking and making. A PM would develop a point of view, write a strategy document, align stakeholders, work with design, work with engineering, wait for implementation, react to the thing once it existed, then iterate. The cycle was useful, but it was also slow. Because it was slow, a lot of weak thinking had time to hide. A vague idea could sit in a polished document. A questionable requirement could survive several meetings. A strategy could sound convincing before anyone had to confront what it actually meant in software.
LLMs make it possible to move from an idea to a working artifact much faster. A sketch can become an interface. A product thought can become a prototype. A workflow problem can become a tool. A strategy can become something interactive. A vague concern can become an evaluation framework. That does not make the PM less important. It makes the PM’s judgment more exposed.
When I started with visual projects, the value was not only that AI helped me create things I would not have created as quickly on my own. The value was that I could feel the new loop. I could describe an experience, generate code, inspect it, reject it, redirect it, improve it, and keep moving. A 3D logo, a celebration cake, or an interactive animation may seem lightweight compared to a production platform, but they are useful because the feedback is immediate. You can see whether the idea works. You can see whether the tool understood you. You can see where taste still matters.
The model can generate. It can propose. It can fill in gaps. It can create movement. It can translate a rough idea into something visible. But it does not know whether the thing is good in the way a product leader needs to know. It does not know whether it fits the moment, the audience, the brand, the user need, or the strategic intent unless you bring that judgment to the work.
The first phase of my experimentation was about learning how to direct the machine. The second phase was about learning how to collaborate with it.
That is where the projects became more ambitious. I moved from visual artifacts into full product surfaces. Games, publishing flows, native app concepts, strategy tools, and prototypes. The work started to look less like prompting and more like product development. I was making tradeoffs about architecture, state management, onboarding, navigation, error handling, performance, packaging, and release quality. I was not just asking the model to “make something.” I was learning how to keep a system coherent while moving at a pace that would have been unrealistic before.
A strong PM has always needed to create clarity. But now clarity can be tested much earlier. If I cannot describe the product clearly enough for an AI agent to start building it, that tells me something. If the agent takes the work in the wrong direction, that tells me something. If I can generate a prototype quickly but cannot explain how it creates customer value, that tells me something too.
AI does not only reveal the model’s limitations. It reveals the human’s ambiguity. That is uncomfortable.
The 3D Connect-K project was a good example. It started as something visual and playful, but the deeper I went, the more it became a real product exercise. A board game is simple only at the surface. Once you start building it, you have to define rules, game states, win conditions, interactions, player modes, difficulty, visual feedback, configuration, persistence, and eventually platform choices. Web versus native iOS is not just a technical distinction. It changes the interaction model, distribution model, performance expectations, and product feel.
That project helped me understand that LLMs make it easier for PMs to cross the boundary from product definition into product materialization.
That does not mean the PM replaces engineering. It means the PM can bring much more concrete thinking into the conversation. A PM can show a working version of an idea. A PM can explore edge cases before asking for a roadmap commitment. A PM can understand implementation complexity by touching it. A PM can ask better questions because they have gone further into the system.
But as the projects became more serious, another issue became obvious. Building faster can make the wrong work more dangerous. If every idea can become a prototype, the bottleneck is no longer artifact production. The bottleneck becomes judgment.
If every idea can become a prototype, the bottleneck is no longer artifact production. The bottleneck becomes judgment.
That is when my work started moving from artifacts to workflows. I used LLMs to create strategy documents, product narratives, planning frameworks, and implementation plans. This work also exposed a trap. It is very easy to generate a document that sounds right. It is much harder to generate a document that is right. That’s the difference between fluency and thinking.
That became one of the core patterns I kept returning to: human, AI, human.
The human sets the direction. What problem are we solving? What are the constraints? What is the strategy? What does good look like? What would change our mind?
AI expands the middle. It drafts, compares, synthesizes, generates variants, critiques, builds, refactors, tests, and makes the work visible.
Then the human owns the finish. Is this true? Is this useful? Is this specific? Is this coherent? Is this defensible? Is this the thing we should actually do?
That final step is not editing. It is ownership. Once I saw that pattern clearly, I started building tools around the workflow itself.

Codex Log Viewer came from that need. At first, it was practical. I wanted to understand my own usage of Codex across projects. Which projects was I working on? How many messages was I sending? Which prompts were repeated? How much work was happening in each repository? Where were the large sessions? What patterns could I see in my own behavior?
But the project became more important than that. When software is built through natural language, the prompt history becomes part of the product record. The code tells you what changed. The conversation tells you why the human thought the change should exist. That is a profound shift. In an AI-assisted workflow, intent is no longer trapped in meetings, scattered notes, or private memory. It can become inspectable.
When software is built through natural language, the prompt history becomes part of the product record.
That has implications for product management, engineering leadership, and governance.
If a feature behaves incorrectly, I want to know whether the human asked for the wrong thing, whether the model misunderstood, whether the requirement changed, whether verification failed, or whether the implementation drifted. The prompt history helps answer that. It becomes evidence. It becomes an audit trail. It becomes a way to understand how work actually happened.
That led me to another layer: evaluating the workflow itself.
In Codex Log Viewer, I built Project Focus evaluation tooling to classify the kinds of prompts I was sending. Was I asking for feature design? Implementation? Bug fixes? Testing and verification? Research? Documentation? Planning? Data analysis? Code review? Release work? Git operations? Short approvals?
If AI becomes embedded in the operating model of software teams, we need to know how we are using it. Are we using it to think or just to execute? Are we asking for enough verification? Are we overusing it for implementation and underusing it for evaluation? Are approvals becoming too casual? Are we creating a healthy loop of direction, generation, inspection, and correction, or are we just accelerating output?
This is where I think many organizations will need to mature quickly. Most companies are still thinking about AI measurement as output measurement:
- How much faster did we write the document?
- How much code did the agent generate?
- How many tickets did it close?
Those metrics are not irrelevant, but they are incomplete. The deeper question is whether the operating model is getting better.
- Is the team asking better questions?
- Is the work more reviewable?
- Are assumptions clearer?
- Is evidence easier to inspect?
- Are decisions more grounded?
- Are failure modes easier to find?
- Are humans still accountable for the right parts of the process?

That question became even more important in the promotion evaluation project.
Promotion review is very different from building a game or a prototype. It is high stakes. It involves people’s careers. It involves judgment, fairness, evidence, context, and leadership accountability. It is exactly the kind of domain where using AI carelessly would be irresponsible.
That is why the design principle mattered so much: AI should not decide who gets promoted.
The purpose of the workflow was not to outsource judgment. It was to improve the conditions under which humans exercise judgment.
Promotion packets are uneven by nature. Managers write differently. Some narratives are polished. Some are dense. Some are specific. Some are vague. Some candidates have evidence that is easy to see. Others have evidence that is real but buried. When a panel reviews those materials, it is not only comparing candidates. It is also comparing the quality and structure of the packets.
The promotion evaluation work used AI to standardize the review surface. It helped turn raw material into consistent one-page briefs. It extracted claims. It organized evidence. It surfaced risks. It created reasons to challenge the recommendation. It generated panel questions. It supported comparison. It helped track notes and decisions.
The model was not judging the person. It was judging whether claims were supported. Was a claim verified directly by the source material? Was it cross-referenced? Was it inferred? Was it an error? That taxonomy matters because a confident sentence in a summary can hide very different levels of support.
This is one of the most important lessons from the year.
Some of the best AI workflows will be the ones that make human decisions more evidence-based, more consistent, and more reviewable.
That distinction should be central to how product leaders think about AI. The goal is not always to remove the human.
Often the goal is to give the human a better surface for judgment.
The same pattern showed up again in conversational search evaluation. That project sits closer to the domain I work in every day: search, recommendations, discovery, metadata, ranking, personalization, and AI-assisted user experiences. The question there was whether a product team could tell if the experience was actually getting better.
That is a much harder problem. A conversational discovery system can fail while sounding fluent. It can return plausible titles that violate a constraint. It can recommend a movie when the user asked for a series. It can miss required titles. It can generate a blurb that sounds helpful but includes unsupported claims. It can offer follow-up suggestions that are repetitive, narrow, or disconnected from the user’s intent. It can look good in a demo and still fail as a product.
So the evaluation framework had to separate correctness from quality.
- Correctness asks whether the response is valid enough to judge. Did the system respect media type? Did it satisfy cast, director, rating, award, or time-period constraints? Did it include required titles? Did it avoid unsafe or brand-risky language? Did it preserve the metadata evidence?
- Quality asks a different set of questions. Are the results relevant? Are the best matches near the top? Is the generated explanation useful? Are title-level blurbs specific and grounded? Are follow-up suggestions diverse and coherent? Does the conversation get better as it continues, or does it drift?
Conversational products cannot be evaluated only as single responses. They have to be evaluated as paths. A seed query opens a tree. Follow-up suggestions create branches. Each branch can continue multiple turns. Repetition, drift, incoherence, and narrowing may not appear in the first answer. They emerge over time.
That is a very different product management problem from evaluating a static screen.

It requires PMs to think in systems. Retrieval, ranking, metadata quality, generated language, policy constraints, latency, grounding, feedback loops, and user trust all interact. The product is not just what appears on the screen. The product is the behavior of the system.
The future PM, especially in AI-native discovery, cannot operate only at the level of surface requirements. A PM working on search should understand query understanding, retrieval, ranking, result presentation, and evaluation. A PM working on recommendations should understand candidate generation, ranking objectives, diversity, novelty, cold start, feedback loops, and long-term satisfaction. A PM working on conversational AI should understand grounding, hallucination risk, source-of-truth constraints, and multi-turn evaluation.
That does not mean every PM becomes an engineer or data scientist. It means the PM has to become much more fluent in the system they are shaping.
LLMs help with that fluency. They can explain code, generate prototypes, compare approaches, produce test plans, inspect logs, draft evaluation rubrics, and help a PM move further into the mechanics of the product. But the PM still has to know what questions to ask. The PM still has to know what evidence matters. The PM still has to own the tradeoffs.
The travel journal project reinforced this from a different angle. That project was not about enterprise software or ranking systems. It was about planning a photographic trip: itinerary, locations, timing, guidance, field context, and a useful way to carry the experience. It showed how LLMs can help convert a personal goal into a structured, usable product. Not just “give me a travel plan,” but create an experience around the trip. Organize the context. Sequence the days. Support the creative objective. Make the output something I could actually use.
The best AI work was the shaped artifact. It required context, constraints, taste, and iteration. The value came from turning intent into a product-like experience, even when the “product” was for one person.
By the end of the year, the projects looked very different from one another, but they were all teaching the same lesson.
- The visual experiments taught me how fast intent can become artifact.
- The games and prototypes taught me how PMs can get closer to the material of software.
- The strategy and planning work taught me that fluency is not the same thing as thinking.
- Codex Log Viewer taught me that prompts are becoming part of the product record.
- Project Focus taught me that we need to evaluate how we are using AI, not just what AI produces.
- The promotion evaluation workflow taught me that AI can improve human judgment without replacing human accountability.
- The conversational search evaluation tool taught me that AI-native products require deeper, more structured evals than traditional product analytics.
- The travel journal taught me that these tools can turn a personal intent into a complete, usable experience when guided well.
AI is not just a productivity layer. It is becoming an operating layer.
It changes how ideas become artifacts. It changes how PMs explore ambiguity. It changes how strategy becomes testable. It changes how prototypes get built. It changes how workflows are inspected. It changes how evidence is organized. It changes how teams evaluate product quality. It changes how leaders prepare for judgment-heavy decisions.
AI increases the premium on judgment.
The artifacts are cheaper now. Documents, prototypes, summaries, dashboards, code, test plans, and evaluation rubrics can all be produced faster. That means the differentiator moves upstream and downstream. Upstream, did you frame the problem clearly? Downstream, can you evaluate whether the output is true, useful, grounded, and worth acting on?
A PM who only coordinates status will be under pressure. A PM who only writes requirements will be under pressure. A PM who cannot reason about systems, evidence, tradeoffs, and evaluation will struggle in AI-native environments.
But a PM who can create clarity, direct AI-assisted work, build prototypes, understand system behavior, design evaluation loops, and make evidence-based decisions will have more leverage than ever.
The role is moving from artifact production to operating model design.
The PM of the future will need to be able to move across levels. They will need to start with a customer problem, turn it into a clear product hypothesis, create a working artifact, inspect the system behavior, define the evaluation method, interpret the evidence, and decide what should happen next. That is not prompt engineering. That is product leadership.
The mistake is thinking that AI makes the work easier in a simple way. It makes some parts easier. It makes other parts more demanding. It removes friction from making, but that means teams can produce more wrong things faster. It makes synthesis cheaper, but that means leaders need better standards for truth. It makes prototypes easier, but that means PMs need stronger taste. It makes analysis faster, but that means evaluation design becomes more important.
The real advantage will come from using AI to build better loops.
- Better loops from intent to artifact.
- Better loops from artifact to evidence.
- Better loops from evidence to judgment.
- Better loops from judgment to product improvement.
That is the through-line across everything I built this year. I was not just trying to make more things. I was trying to understand the new loop.
And the more I built, the more convinced I became that product managers need to get hands-on with these tools now because you cannot understand this shift from the sidelines.
You have to feel it.
- You have to feel how quickly an idea can become real.
- You have to feel how easily the model can misunderstand you.
- You have to feel the difference between a plausible answer and a grounded one.
- You have to feel the cleanup cost.
- You have to feel the power of a good evaluation loop.
- You have to feel how much your own clarity matters.
The future will not belong to PMs who produce the most AI-generated artifacts. It will belong to PMs who can use AI to think more clearly, build more concretely, evaluate more rigorously, and make better decisions.
That is the operating system I have been trying to build for myself.
And I think it is the operating system product teams will need next.
This writing reflects my personal perspectives on product management, AI, and content discovery. It does not represent the official position of my employer or any affiliated organization.