AussieBytes Blog ◖ᵔᴥᵔ◗

How I vibe-coded an LLM Decision Council that beats GPT-5.2-pro for executive work

Nine months ago I tried building an LLM council with ChatGPT‑4.1 and 4o to help me build executive deliverables and make decisions in my role.

I got part of the way there, but the project stalled.

When Andrej Karpathy shared his LLM Council build, it was the nudge to dust off the old designs and see how far I could push some of the new coding specialist models - OpenAI Codex Max and Claude Opus 4.5 - on that original idea. To my surprise - these models built the platform to a working end-state with only light nudging from me.

This post is the story of that build, how it compares to Karpathy’s original LLM Council, and why the result sometimes beats ChatGPT‑5.2‑pro in ChatGPT on the work I actually care about – and what that says about how far AI coding agents have come in 12 months.

A screenshot of a tweet by Andrej Karpathy discussing a web app project that resembles ChatGPT, detailing its functionality with an emphasis on local multi-model AI usage. The tweet includes mentions of a reading tool for understanding Large Language Models (LLMs) with features like explanation, summarization, and Q&A. *Karpathy - LLM Council Nov' 25


TL;DR


Context: Why I Wanted My Own Decision Council

Karpathy’s LLM Council landed at exactly the right moment.

It made a simple point: instead of trusting a single model, you can treat models like a council of experts. You send your prompt to all of them, they answer independently, they review each other’s work, and then a “Chair” model synthesises a final answer.

That caught my eye because it lined up with three things I already cared about:

  1. Model diversity: I don’t want to bet everything on one vendor or one model family.
  2. Evaluation in the loop: Real‑world work doesn’t look like synthetic benchmarks; I want per‑prompt evaluation where disagreement is visible.
  3. Enterprise reality: Our clients sit under different regulated entities and care about vendor risk, model risk, and audit trails.

Over time I’ve started to think of this less as “multi‑agent chat” and more as a decision council: a small group of deliberately different roles (strategy, risk, architecture, decision science) that happen to be implemented with LLMs.

Karpathy’s repo was an interesting pattern. I wanted something more opinionated and closer to my own use cases, but still small enough to build as a ‘weekend project’.

So I set myself two constraints:


The First Attempt: When ChatGPT‑4.1/4o Ran Out of Steam

I actually tried this nine months earlier.

The spec was almost identical:

ChatGPT‑4.1 and 4o were good enough to scaffold pieces – a FastAPI or Node backend here, a React shell there – but a few things consistently broke the build:

I could have pushed through manually, but that defeated the point. The whole experiment was about seeing how far the models themselves could go.

So the repo sat idle.


The Rematch: OpenAI Codex Max + Opus 4.5

Fast‑forward nine months. OpenAI Codex Max arrives – highly controllable and precise with good context, but slow. Claude Opus 4.5 lands soon after – faster, strong on reasoning and code, but still prone to overconfidence and the occasional hallucination.

I dusted off the original idea and ran a slightly updated approach:

  1. Start from a blank repo. Empty project with a clear goal.
  2. Invest as much time as possible in context engineering. Pull in Karpathy’s repo and a few others, plus specs for the OpenAI and OpenRouter SDKs, FastAPI, key libraries and some sociology and psychology texts on decision making.
  3. Define the system before coding. Draft AGENTS.md, architecture.md, prd_llm_council.md and business-process-flow.md, all cross‑linked so the models can "walk" the design.
  4. Write a concrete plan.md Break the work into ordered steps from backend skeleton → council pipeline → UI → config → persistence → tests.
  5. Then let the models drive. With docs, plan and .env keys in place, hand implementation over to OpenAI Codex Max and Opus 4.5.

A note on the business process flow and council design: every run starts with triage (is this question important enough for a council?), moves into parallel analysis from different seats, then a targeted peer‑review round, and finally a synthesis pass.

I’ve structured it this way because research on group decision‑making, Delphi‑style iterative surveys, and social psychology around dissent all point to the same pattern: you get better decisions when you deliberately surface conflicting perspectives, keep them independent for as long as possible, and only then force a reconciliation with a clear rationale.

This time, both the process and the results from the build were markedly different.

My human contributions were mostly:

The end result is a working LLM council that, in practical use, sometimes beats ChatGPT‑5.2‑pro on the tasks I actually care about.


Inside the LLM Decision Council

At a high level, my council still shares the same three‑stage shape as Karpathy’s – answers, critique, synthesis – but under the hood the pipeline is closer to a Delphi‑style decision process.

An infographic outlining the LLM Decision Council, detailing the process of AI-augmented decision-making. It includes sections on input and triggers, decision-making processes like parallel execution and peer reviews, the council's composition with designated roles, and technical components including the tech stack and engineering features.

1. How the Decision Council Runs

Behind the scenes, an orchestrator runs a Delphi‑style process:

  1. Triage and task selection Check whether the question is council‑worthy and pick the right task preset (diagnose_problem, design_solution, review_solution, client_ready_brief, pre_mortem, etc.).
  2. Round 1 – Parallel seat analysis A subset of the 13 seats (Strategy, Product, Tech Architecture, Data & AI, Risk/Compliance, Legal, Finance, Operations, Customer/UX, Org Change, Industry SME, Red Team, Decision Science) runs in parallel on the same brief, each with its own persona and structured output template.
  3. Round 2+ – Informed deliberation (optional) Seats can revise their view with awareness of others, à la Delphi.
  4. Pairwise adversarial peer review Certain seats are wired to challenge others (for example, Red Team → Strategy/Tech; Risk/Compliance → Data & AI/Operations; Decision Science → Strategy/Chair).
  5. Chair synthesis A dedicated Chair seat pulls everything together with progressive disclosure (one‑liner → exec brief → full analysis) and confidence synthesis.

A user interface displays a fictional text scenario setup, featuring sections for "Compose text," "Energy settings," and "Scenario details." The interface includes various input fields and options on a dark background, with a button labeled "Generate a scenario" at the bottom right. Starting Screen

Conceptually, this still collapses down to three stages – answers, critique, and synthesis – so it’s easy to explain. In code, it’s an async FastAPI service with a decision orchestration layer on top of OpenAI’s Responses API rather than a single “call all models once” loop.

On top of that, the system adds:

2. UI Design

I have gone for function over form, choosing to update the UI later for aesthetics. In addition, rather than hide everything behind a single chat box, the UI design attempts to make the process obvious:

A digital interface displaying a document on a debate regarding test scenarios and matching seats, with sections including a title, a summary, and references. The layout features a dark theme with highlighted text and various options for editing and downloading the document. Sample Chair output w/redactions - the full report is much longer and detailed than this screenshot, including tables / charts

That matters, because a big part of the value is seeing where the models disagree. I don’t always accept the Chair’s verdict; sometimes the most interesting signal is that one model is stubbornly dissenting. I also lean on the Chair’s progressive disclosure – a one‑liner, a short exec summary, and a full synthesis – so I can skim or paste the right layer straight into a deck or email.

A digital interface displaying peer reviews, with sections labeled "Summary Verdict" and "Critical Issue." The text outlines concerns related to decision-making risks, severity levels, and suggested actions for addressing specific issues. The background features a gradient color scheme, transitioning from pink to purple. Peer review example w/redactions - seat provides feedback for action

A dark-themed interface displaying a list of council dynamics with various teams and associated categories such as Strategy, Data, and Operations. Each entry includes options for review, block, or unassign, along with severity ratings. A summary section is also present at the bottom. Council Dynamics Summary

A user interface displayed on a tablet featuring a list of reports organized by categories such as "Qualification," "Decision Science," "Operations," and "Legal," each with expandable options indicated by arrows. The overall design has a dark theme with green accents. Seat run summary

3. Tuning for My Workflows

I don’t run every question through a council. That would be expensive and slow. Instead, I’ve tuned it for a few specific workflows:

A digital note-taking app displays a planning document outlining a scaled agile delivery model, including sections on critical staffing options, project prioritization, and portfolio management. The text contains various highlighted points related to stakeholder decisions and regulatory changes, with some parts obscured. Sample seat report w/redactions

In each of these, the council is used sparingly, for high‑leverage prompts where disagreement matters. One of the most useful “seats” isn’t domain‑specific at all: a Decision‑Science seat that classifies the decision (one‑way vs two‑way door), calls out obvious biases, and forces a quick pre‑mortem (“it’s two years from now and this failed – what went wrong?”). In practice I run a light council (a handful of seats, single round) for scoping, and only switch to a full council (more seats, multi‑round deliberation and peer review) for genuinely high‑stakes questions, often with real documents and company context loaded through the RAG layer.


When does the council beat ChatGPT‑5.2‑pro?

Short answer: sometimes, on the margins that matter to me.

I haven't set up a formal benchmark (yet), but across a few dozen prompts in my real workflows, there is a pattern:

The cases where the council clearly "wins" look like this:

  1. Ambiguous, multi‑stakeholder questions E.g. "How should ABC Bank Client structure an AI risk committee considering CPS 230 and CPS 234, and other obligations?" The council is better at surfacing different governance patterns, conflicting constraints, and explicit trade‑offs.

  2. Architecture or Strategy / Operating Model choices with non‑obvious trade‑offs E.g. “How to trade off the stated constraints on time, budget, feature breadth, and architectural patterns to deliver the outcome by X date and to Y benefits case” Different models tend to favour different patterns; the Chair is forced to reconcile them.

  3. Long‑horizon reasoning When I ask "what happens in three years if we choose X vs Y?", the diversity of council answers seems to reduce the risk of a single‑model “hallucinated future”.

On simple factual questions, ChatGPT‑5.2‑pro alone is perfectly fine (and cheaper). The council helps when the shape of the disagreement is itself useful signal. Because every seat and the Chair emit 1–10 confidence scores – plus a short note on what would change their mind – I also end up with a crude confidence range and the limiting factors behind each recommendation, which is often more valuable than a single confident paragraph.


What’s better than Karpathy’s original (for my use case)

Karpathy’s LLM Council is intentionally a weekend vibe‑code. For my use, I needed a bit more structure.

Four things appear to be clear upgrades for my world:

  1. Configurable councils and tasks as first‑class objects I can define multiple councils and tasks – "Architecture", "Risk", "Research", "Client Brief" – each with their own seat mix, prompts, and Chair, rather than hard‑coding a single loop.

  2. Persistence, replay and cost awareness Every run is stored with its brief, seat outputs, and a cost breakdown, which makes it easy to replay the same problem with a different model mix and to have a sane conversation about spend.

  3. More opinionated prompts and outputs For example, risk‑focused councils are forced to output options, risks, mitigations, and an explicit recommendation, not just prose.

  4. Role‑based seats and conflict by design Seats are prompted as different stakeholders – strategy, product, tech, risk, finance, decision‑science, red‑team – and some are there specifically to challenge others. Combined with confidence scoring on every seat and the Chair, the goal isn’t polite consensus; it’s structured disagreement the council has to resolve.

None of this is rocket science, but it shifts the council from a "cool demo" to something I actually use with clients and internal work.


What’s still rough

It’s not all upside. A few things are clearly missing if this were ever to be more than a power‑user tool.

1. Evaluation Harness

Right now, there are a few things still missing:

Without that, this is still anecdotal – useful for me, but not something I’d present as a benchmark.

2. Observability and Cost Controls

I already track tokens, USD/AUD costs, and run traces through the FastAPI layer, which is plenty for a single‑developer setup. A production‑grade council in a bank, though, would need:

That’s especially important if this were to be run in a regulated setting.

3. Governance Hooks

Today, the governance story is still light: the system is multi‑tenant‑aware and there’s a decision journal and company context store, but I’m the one reviewing prompts, sanity‑checking outputs, and deciding when to use the council.

If I were deploying this into a large enterprise, I’d want:

All solvable – but not yet added to this build.


Australian Perspective: Councils as a Governance Tool

From an Australian lens, LLM councils are interesting for another reason: they map quite neatly to the themes regulators are already pushing.

I don’t think “council‑as‑a‑service” is a regulatory silver bullet. But as a pattern, it’s much easier to explain in a boardroom than a generic "AI and human report".


Implications and Take‑Aways

For founders and product teams:


What’s Next

From here, my priorities are simple:

  1. Build a tiny eval harness with a dozen of my real prompts and a basic scoring loop, using the Solo Expert baseline seat as the comparison.
  2. Tighten observability so I can see cost, latency, and failure modes more clearly and hook into the rest of our monitoring stack.
  3. Use the decision journal more deliberately – tracking recommendations and outcomes for a handful of client and internal decisions – so future councils can learn from past ones.
  4. Write a follow‑up with a formal set of benchmarks and cross model comparison.

If you’re a business or technical leader playing around with LLM councils like this and want to compare notes, reach out – I’m curious what councils look like in your world.