Our mobile app needed a UX audit. It had been designed desktop-first and adapted for phones, and the list of things to check kept growing while the team kept shipping features. So I pointed an AI agent at it and told it to be a first-year student. It navigated the app on an iPhone Simulator and surfaced 30 usability issues. I shortlisted 10 — then the agent wrote code fixes, created Jira tickets, and submitted merge requests, all through the same pipeline our engineers use. Two of those fixes have been reviewed and merged into production. No developer wrote a line.
What I Actually Did
Syntea is an AI tutoring platform at IU International University of Applied Sciences — thousands of students every day on their phones, talking to an AI tutor, taking practice exams, listening to generated podcasts. The mobile experience is feature-rich — a design review last year had improved things — but without ongoing systematic testing, improvements drift. The app isn’t mobile-first yet, and nobody was regularly checking whether it still works well on a phone.
I didn’t give the agent a list of known bugs. I didn’t point it at specific screens. I told it to explore the app as a student would, test every feature it could find, and tell me what was wrong. It did the rest.
How the Pipeline Works
Claude Code connects to an iPhone Simulator via Xcode’s toolchain and navigates the app as a first-year student — programmatically, in the background, so I can keep working while it runs. It tested 15 features, screenshotted every screen, measured every tap target against iOS accessibility standards. Then three independent analyses ran on the findings: one reviewed only screenshots with zero context, one benchmarked against Duolingo, Coursera, and other learning apps, one explored the app hands-on from a UX perspective. They shared no conclusions.
What It Found
Thirty findings total. Four were high-severity — including a microapp that triggered a chat action invisible to the user until they navigated away. The team was already aware of that one, but the agent flagged it independently. All three analyses converged on the same top issues in the same severity order — three different analytical frames, not three different models. Same underlying model, different lenses. But the convergence across fresh eyes, benchmarking, and hands-on exploration made the findings hard to dismiss.
Here’s what those findings looked like. The AI tutor was telling mobile students “I’ve opened it on the right side” — desktop language on a phone. Close buttons were 29 pixels wide when iOS requires 44. An accessibility label missing here, a sizing inconsistency there.
From Findings to Fixes
The findings landed on a triage page I built — an interactive review where I clicked Fix, Discuss, or Skip on each issue. From those thirty, I shortlisted ten for immediate action and marked five as Fix, four as Discuss, one as Skip. The Skip was interesting — the agent correctly identified nine feature cards hidden behind a carousel with no pagination dots. Technically right. But the product decision was to reduce the number of cards, not add dots. The AI was technically correct each time. What to do about it was mine to decide.
For each Fix, the agent read the codebase, found the existing Design System component, wrote the change, verified it with Playwright at mobile viewport, created a Jira ticket, and submitted a GitLab MR.
Same Jira tickets. Same GitLab MRs. Same CI pipeline. Same code review. The agent didn’t get a shortcut. It used the same process as every engineer on the team.
Two humans closed the loop: a designer checked Design System compliance, a developer reviewed the code.
What It Was Actually Like
Claude Code requires human approval for actions — file reads, browser taps, git commits. Early on I approved nearly everything manually, clicking “accept” in a terminal while an AI navigated an iPhone on my screen. Not glamorous. But with each run you learn what to trust, widen the permissions, and the loop gets tighter. There were still moments where I could see the agent about to make a wrong move and had to wait for it to fail before it tried something else.
The whole thing took about a day, including pipeline setup. Maybe two hours of that were me — approvals, reviewing findings, challenging the agent when it went sideways. A senior engineer familiar with the codebase could have found and fixed these specific issues faster. But they never would have. That’s not a knock on the team — this kind of work is the first thing that drops when there’s a feature roadmap to deliver.
I should be clear: I’m a PM, not an engineer. This wasn’t a mobile redesign — it was the polish pass before a wider rollout. The pipeline works end-to-end, but it’s a prototype — built to prove a point, not to scale. The audit surfaced bigger issues — I started small to prove the pipeline. I’m not going to pretend this is AI building our product. It’s AI doing the work that matters but never survives prioritization.
And yes — an AI helped me draft this post about an AI writing code. The recursion isn’t lost on me.
Where This Goes
The setup is done — the second run is just pointing it at the app again. This first run tested 15 features for tap targets and language consistency. The next one may include responsive layout checks — testing whether components actually collapse and reflow correctly at phone viewports. The one after that tests onboarding flows.
Meanwhile, the engineering team is independently building proper automated testing practices — not from my work, but toward the same destination. The scrappy version proved the concept was worth pursuing. The professional version is already underway.
Each run also builds context. The findings, the code patterns, the decisions I made during triage — all of that carries forward. The next run starts with everything this one learned, and that makes each pass more targeted.
What makes this work is the combination — an agent smart enough to reason across the full stack, and a codebase structured enough to let it. The agent had full context about Syntea — architecture, component APIs, how the AI tutor’s prompts work. Clean architecture helped — but so did the documentation that let the agent understand what it was looking at.
What surprised me wasn’t any single step — AI can find bugs, AI can write code, AI can create tickets. We’ve heard those stories. What surprised me was the thread running through all of them. Test to commit. One pipeline, one session, zero handoffs between tools or teams. The fixes were small. The pipeline isn’t.

Leave a Reply