They Sent Us Arguments

The conversation about agentic AI is getting serious

Jun 04, 2026

We’re almost done with our review of the abstracts submitted for PNSQC 2026. During the evaluation process, I got about a third of the way through the pile and realized I had stopped skimming (We had over 100 abstracts). I was actually reading. Many talks were asking and answering harder questions than I’ve heard at any conference in a while.

The conversation about agentic AI is getting serious

The most crowded subject in the submissions was agentic testing. Not the hype version, where someone declares that agents will replace testers and then does a demo. The practitioners coming to Portland are bringing real battles and experiences.

One paper makes a case: We’ve spent years learning to test deterministic systems, and agentic AI breaks that contract entirely. Same inputs, different paths. Test suites that pass don’t tell you the agent stayed inside its intended boundaries. A whole session is dedicated to something the author calls the “tautology trap,” a failure mode where tests look behavioral but never actually call the agent. 14,000 passing tests. Zero behavioral validation. Green pipeline the whole time. I was thinking this may start some uncomfortable conversations in the hallways.

Several other papers work the same territory from different angles. One digs into why AI coding assistants have created a validation bottleneck nobody fully anticipated. Developers ship faster; everything downstream gets slower. Another builds a governance framework for autonomous test automation from the inside out: impact analysis, quality scoring, and confidence reporting replacing the raw pass/fail percentage that doesn’t actually tell a stakeholder what they need to know.

What strikes me reading all of these together is that the field is past the “should we use AI” question. The question is now: how do we use it without fooling ourselves?

The human question isn’t going away

Some of the papers I’m most excited about aren’t primarily about tools at all.

One of them asks whether we’ve been optimizing away something we can’t replace. Not human effort, but human judgment. The kind of quality that comes from someone who cares about what ships, who understands the person on the other end, who will flag something even when the test passed. Is caring something that can be duplicated by a machine? There’s a phrase in the abstract I keep thinking about: “quality is art.” Unprovable, probably. Worth arguing about.

Another paper approaches this from a different direction: what happens to team health when the systems we build become more autonomous but the humans running them stay just as human? It draws on real research from organizational psychology as the actual argument.

And then there’s a paper from a program manager who spent his MBA dissertation interviewing 20 C-suite and VP-level leaders about why communication fails during sustained change. By the time the message reaches the people who have to act, the context that makes it matter has been stripped out. He calls it “communication debt.” I’ve seen this in many clients I’ve worked with. It has nothing to do with AI and everything to do with why good initiatives die in middle management.

Some of the most interesting work is coming from unexpected places

A team of high school students from Portland submitted a paper on autonomous robotics software architecture. Not a science fair project. A real engineering paper about the quality failure modes that emerge when you move from finite state machines to belief-based decision models. They’ve been building competitive robots for years and they documented what breaks, why it breaks, and what they changed. I’ve read papers from senior engineers with less intellectual honesty and rigor than this one.

Here’s something most teams building LLM-powered products don’t know yet. The metric everyone uses to evaluate RAG (Retrieval-Augmented Generation) systems, “faithfulness,” measures whether the model’s answer was consistent with what it retrieved. That’s it. It says nothing about whether what it retrieved was actually true. A financial services architect noticed this gap, built an adversarial test set to measure it across the major evaluation frameworks, and the numbers are not reassuring. You can score 0.95 on faithfulness and be confidently wrong. He’s bringing the data to Portland.

One paper will resonate with anyone who has ever tried to explain to a non-technical executive why a small change cost six months and $400,000. It’s about assessing software maintainability without reading a codebase line by line. How do you communicate technical debt to someone who asks why the developers can’t just fix it over a weekend?

What this year’s program is really about

If I had to name the common thread across all of this work, I’d call it accountability under uncertainty. How do you stay responsible for quality when the systems you’re building make their own decisions? When the tests pass but you’re not sure what they proved? When communication breaks down not because anyone lied but because the message changed on its way?

The community has been digging into these questions. The papers this year aren’t answers. They’re the best current thinking from people who are doing the work, getting it wrong, adjusting, and writing it down for the rest of us.

October in Portland. I’m looking forward to the arguments.

What’s the hardest question your team is dealing with right now?

PNSQC Newsletter & Blog

Discussion about this post

Ready for more?