AussieBytes Blog ◖ᵔᴥᵔ◗

Where does your software engineering team sit on the AI maturity curve?

Most engineering teams think they're AI-native. The data says otherwise. In fact, there could be a growing gap between frontier engineers and the rest of the pack. Here's how to figure out where you really are — and what it takes to level up.


The Uncomfortable Truth About AI Adoption

Here's a stat that should make every engineering leader pause: a small 2025 randomised controlled trial by METR found that experienced developers using AI tools were 19% slower than those coding without them. What was really strange in the study was that those same developers believed they were 24% faster. They were wrong about both the direction and the magnitude.

This isn't an argument against AI. It's an argument against how most teams are adopting it, and how their perceptions of their success adopting it can vary with reality.

Why did experienced developers get slower? The study doesn't isolate a single cause, but several factors likely contributed:

When does it flip positive? The METR study doesn't answer this directly — but the pattern across early adopters and the experiences across our technology work internally and with clients, suggests it happens when teams redesign workflows around AI rather than bolting it onto existing ones: narrower task scoping, richer context files, tight feedback loops, and proper evaluation systems. The study doesn't say AI is slow. It says unredesigned workflows are slow.

The organisations who are seeing real results experience 25-30%+ productivity gains. These aren't the teams that bolted GitHub Copilot onto their existing workflows and called it a day. They're the ones that fundamentally redesigned how their teams work. And the gap between those two groups is widening fast.

The rest of this post is about how to close that gap and leverage the recent gains in AI models for software engineering.


The 5 Levels of AI Maturity in Engineering Teams

Dan Shapiro, CEO of Glowforge, developed a framework that maps where teams actually sit on the AI adoption curve. It's become one of the clearest lenses for understanding the maturity gap — and it forces honesty from teams who work with it.

📋 Quick Self-Assessment: Where Does Your Team Sit?

Use this scorecard to place your team honestly. For each level, ask the diagnostic question — if you can't answer "yes" with evidence, you're probably not there yet.

A chart outlining the levels of AI-assisted software development, detailing each level from "Spicy Autocompleter" to "Dark Factory." It describes the role of AI, the responsibilities of human counterparts, key artifacts associated with each level, and a diagnostic question for evaluation. The chart is structured with columns for Level & Title, AI responsibilities, Human roles, Key artifacts, and a Diagnostic question.

Level 0: Spicy Autocomplete

You type code. AI suggests the next line. Accept or reject. This is GitHub Copilot or tab complete from Cursor in its original form. The human writes software; AI reduces keystrokes. This is where most teams start.

Who's here: This is still the default experience for most developers using GitHub Copilot's original inline suggestions or Cursor's tab-complete. It's faster typing, but it's still typing.

Level 1: The Coding Intern

You hand AI discrete, well-scoped tasks: write this function, build this component. The human still handles architecture, judgment, and integration. AI is a capable assistant — but the engineer is doing the thinking.

Who's here: Teams here are starting to get a handle on the growing ecosystem of agent context files (e.g., claude.md, AGENTS.md). Teams add project conventions and boundaries to their repos so AI can handle scoped tasks more reliably. Good resources include Builder.io published a practical guide to writing effective AGENTS.md files, and the community-curated awesome-claude-code repo collects templates, skills and examples.

Level 2: The Junior Developer

AI handles multi-file changes, navigates the codebase, understands dependencies. The human still reviews all the code.

This is where most developers who describe themselves as "AI-native" are operating.

This is where major gains occur but is also the comfort zone. It feels exceptionally productive. The pull requests are flowing. But the fundamental workflow hasn't majorly changed.

Who's here: This is where most teams using Cursor, Cline (58k+ GitHub stars), or ChatGPT-assisted coding appear to sit today. Developers are pairing with AI across files and getting into flow states — but still reviewing every diff. Devin customers often start here too, using AI agents for repetitive migration tasks like Java upgrades and dependency bumps, with engineers reviewing all output before merge.

Level 3: Developer as Manager

This is where the workflow flips. The engineer is no longer writing code with AI help. Instead, the workflow involves directing AI and reviewing at the PR level. The model does the implementation and submits pull requests. The engineer manages the output.

It's possible for some developers to top out here. The psychological difficulty of letting go of code is real - people love this part of the job and it's akin to an artisanal craft for some.

Who's here: OpenAI's Codex is operates exceptionally well at this level and the next level, level 4. Engineers on the Codex team have stated they can run four to eight parallel agents simultaneously — one implementing a feature, another doing code review, a third running a security audit, a fourth summarising codebase changes. Context engineering is the key unlock: Codex reads AGENTS.md files for project conventions, and the team has built 100+ reusable "Skills". Devin's enterprise customers report similar patterns, with 6-12x efficiency gains when running multiple agents on isolated tasks in parallel.

Level 4: Developer as Product Manager

Write a specification. Walk away. Come back hours later and check if the tests pass. We are not reading code anymore — we are evaluating outcomes. The code itself is a black box. The key now is the system by which we set context, specs, ways of working, evaluations, and how we review code and improve the system by which we and the AI work.

Who's here: The frontier lab coding teams appear to be here. Boris Cherny, creator of Claude Code, hasn't personally written code in over two months. Internal teams at Anthropic report 70-90% AI-written code, with 90% of Claude Code's own codebase written by Claude Code itself. Shapiro himself says he operates here.

Level 5: The Dark Factory

Specs in, software out. No human writes code. No human reviews code. The factory runs autonomously with the lights off.

Almost no teams genuinely operate here today. You can count the publicly documented examples on one hand — and the gap between the marketing language and the operating reality across our industry is enormous.

Who's here: StrongDM's Software Factory remains the most thoroughly documented example. Three engineers have been shipping production software — 16,000 lines of Rust, 9,500 lines of Go, 700 lines of TypeScript — with no human code writing or review since July 2024. They've open-sourced key components: Attractor, Leash, and CXDB. Simon Willison's detailed write-up is a great overview. Cursor's FastRender experiment also offers a glimpse of what's coming: hundreds of autonomous agents coordinated to build a, somewhat working, web browser from scratch in under a week.

Adding to all of this is that the biggest org-wide gains come not from individual teams levelling up in isolation, but from sharing the artefacts across teams. Context file conventions, spec templates, eval harnesses, and policy frameworks are all expensive to build and cheap to reuse. An organisation with six teams independently at Level 3 — each with its own spec format and no shared eval infrastructure — is leaving compounding returns on the table. The exact 'best practices' for this new way of working, that fit your particular organisation, are also yet to be discovered - so encouraging sharing / peer review of what's working (and not), is key to success.


What Changes When the Human Stops Reading Code?

As teams move up the maturity curve, there's a critical inflection point between Level 3 and Level 5 that deserves its own framing. The shift isn't just about productivity. How do we build an engineering control system that replaces human code review with something more rigorous, not less?

When a human stops reading diffs, five things must be true:

Another way to think about this: going up the maturity curve isn't about sticking with the same controls. It's about replacing informal, human-dependent controls (code review, manual testing, tribal knowledge) with formal, machine-enforceable controls (policy engines, scenario suites, satisfaction gating, automated rollback). We are building the factory's assembly line from the ground up. The factory with the 'lights off' should have more safety infrastructure than the one with humans watching — not less.


Why Most Teams Are Stuck (and Don't Know It)

There's a pattern that plays out in almost every software engineering team adopting AI, and it follows what researchers call the J-Curve of AI adoption.

When you bolt AI onto existing workflows, productivity dips before it improves. The initial disruption — learning new tools, adjusting review processes, dealing with AI-generated code that's functional but architecturally naive — creates a temporary slowdown.

Some organisations are stuck at the bottom of this curve. And they're interpreting the dip as evidence that AI doesn't work.

The METR study captured this perfectly. Experienced developers on their own repos got slower because they were spending time prompting, reviewing, and correcting AI output within a workflow designed for human-only development. The tools weren't the problem. The workflow was.

Meanwhile, the teams pushing through the J-curve are seeing compounding returns. Look at the economics of AI-native companies:

Even allowing for measurement differences, the top AI-native startups are running at roughly 5-6x the efficiency of traditional software companies. That's not marginal - it's a different operating model entirely.


Practical Steps to Level Up

So how do we move our teams up the maturity curve? Based on patterns emerging from early adopters including StrongDM's pioneering Dark Factory program, here's what actually works at each transition.

Moving from Level 0 → Level 1: Give AI Real Tasks

The shift: Stop using AI as autocomplete. Start treating it as a capable intern.

Moving from Level 1 → Level 2: Expand the Scope

The shift: Let AI handle multi-file changes and navigate your codebase.

Moving from Level 2 → Level 3: Change the Relationship

The shift: This is the hardest transition. You stop writing code with AI help and start managing AI that writes code.

Moving from Level 3 → Level 4: Evaluate Outcomes, Not Code

The shift: You can no longer read every diff. You need a quality system that doesn't depend on it.

StrongDM's factory uses end-to-end scenarios — closer to a holdout set than a unit test — stored outside the codebase so the agent can't overfit to them[1]. Create a scenarios/ repo where each scenario has: setup, action, expected outcome (with tolerances), and red flags (security, data loss).

Flowchart outlining different approaches to testing in software development, divided into "Scenarios," "Traditional Tests," "Failure Mode," and "Evaluation Method." It details methods involving human-written scenarios versus developer-written tests, probabilistic outcomes, agent behavior, and various assessment techniques.

This is the bridge where engineering becomes closer to product QA and safety engineering — making it hard for the agent to ship something wrong, without relying on a human reading the diff.

Moving from Level 4 → Level 5: Going Dark

The shift: Remove human code review without removing control. This is less about trusting the model and more about building a factory where the model can't do anything unsafe.


The Junior Pipeline Problem

There's a downstream consequence of this shift that every engineering leader needs to grapple with. Junior developer job postings in the US have dropped 67%. UK graduate tech roles fell 46% in 2024, with projections of 53% by 2026.

The traditional apprenticeship model — where juniors learn by doing grunt work that AI now handles better and faster — is collapsing. The junior of 2026 needs systems design understanding that was expected of a mid-level engineer in 2020.

This isn't just a hiring problem. It's a pipeline problem. If we're not careful, we'll wake up in five years with a generation gap in our engineering organisations — plenty of senior engineers who grew up writing code, and no mid-levels who understand the craft deeply enough to direct AI effectively.

The teams investing in specification writing, systems thinking, and AI-native workflows as core training for new engineers are the ones building a sustainable talent pipeline.

How to Train Juniors in an AI-Native Org

The apprenticeship model isn't dead — it needs redesigning. I suggest replacing "write boilerplate and fix bugs" as the learning path for new joiners with:

Give these juniors sandboxes to play with and ensure they deploy something real to production on their first day in the company. For these new joiners, we are teaching them the systems engineering methods of the future. The teams that figure this out first will have a talent advantage that compounds over years.


Where to Start

If you take one thing from this blog: don't assess your AI tools. Assess your AI workflows.

  1. Be honest about where you are. Walk your team through the 5 Levels. Most will say Level 3 or 4. But unless the evidence backs that up - it will be wrong. That gap is the starting point.
  2. Set the foundation, then raise the bar every sprint. Add a agents.md to your top repo today — project conventions, build commands, boundaries. It's the single highest-leverage starting point. Then make AI workflow improvement a standing agenda item: each sprint, pick one task the team still does manually that AI could own, run it through the agent, and record what worked, what broke, and what the spec was missing. Over weeks, your context files get sharper, your specs get tighter, and the level your team operates at quietly climbs. And build these as platform capabilities, not team-by-team efforts. A shared context file convention, a common spec template, and a reusable eval harness mean every team benefits from what any one team learns.
  3. Measure, compare, compound. Ship one real change through an agent end-to-end — not a demo, not a toy. Measure time, cost, and quality against your manual estimate. Then keep a simple scorecard: AI-assisted vs. human-only cycle time, rework rate, and cost per change. Review it monthly. The teams that move up the maturity curve fastest aren't the ones that make one big leap — they're the ones that run small experiments every week, feed the results back into better specs and context files, and let the gains compound. Continuous improvement isn't a mindset shift. It's a habit built one iteration at a time.

That's it. No major investment. No reorganisation. Undertake an honest self-assessment and implement a workflow that compounds learning and productivity every week.


The Bottom Line

The dark factory doesn't need more engineers. But it desperately needs better ones.

The teams that will thrive aren't the ones with the best AI tools. They're the ones who've redesigned their workflows, retrained their people, and are honest about separating how AI feels from hard facts around productivity.

The maturity curve is real. The J-curve dip is real. And the gap between teams that push through and teams that plateau is only going to widen.

The big question facing engineering teams, I'd suggest, isn't whether your team will adopt AI. It's whether you'll redesign your workflow around it.


#AI #SWE #dark factory #software engineering