Reasoning About AI Reasoning: Letting the LLM Cook

From Step-by-Step Prompts to Skill-Driven Workflows

Hypothesis: Skill-driven LLMs outperform hand-authored declarative flows for procedural reasoning at scale.

The more I work with declarative AI harnesses and Claude skills side-by-side, the more I see a particular pattern showing up: the best results often happen when I give an AI agent the tools, context, and requirements, then let it reason on its own. “Letting the LLM cook” as I’ve started to call it.

For example, I’ve been using a workflow like this to build and update sections of my site:

Take screenshots of website section designs I like for context.
Reason through the UI content fields needed to model a similar section in a CMS.
Create the CMS content model.
Pull down the content schema and auto-generate types for the codebase.
Give the LLM architectural context through Cursor or Claude Code.
Have the LLM update the data query to fetch the content for the desired section or page.
Let the LLM create the first pass of the UI component code from the screenshots.
Let the LLM wire the content data to the component properties.
Test and modify as necessary.

When attempting to carry this out by declaring each individual step and holding the LLM’s hand topic by topic, the results were fragile at best.

But when building the skills, tools, and then letting the LLM cook? The results were shockingly good.

The takeaway requires a bit of humility. LLMs are becoming very capable at procedural reasoning: knowing which steps to take, which order to take them in, and which tools to call in order to carry out user intent. In many cases, they may be better at that runtime sequencing than humans trying to declaratively define all possible flows in advance.

The Runtime Advantage of LLM Reasoning

It's an interesting observation, but the next obvious question is why? What first principles help explain it?

One key insight is this: LLM reasoning happens inside the context of the user’s actual question or intent at runtime. It does not have the disadvantage of having to guess in advance which user flows it may have to support. That ability to reason within the exact context of the request is a remarkable advantage.

Engineers have spent decades becoming remarkably good at domain modeling, logic flows, and declaring how programs should behave at runtime. In the abstract, that has always been a large part of software engineering.

But declarative human orchestration patterns are still, at some level, a bet that a human can predict the meaningful paths a user might take before the user takes them.

Even the smartest humans, or machines for that matter, struggle to compete with runtime problem-solving while carrying the compile-time burden of predicting the millions of questions a user might ask. And even if every potential query were known beforehand, no declarative logic map could fully compete with the ability to draw on the context leading up to the question, the user’s exact intent, and access to fresh data in real time.

Being able to reason in the moment is the fundamental advantage.

If that is true, and given the impossibility of inserting human reasoning directly into AI inference moments, it is a realization that requires humility. It is tempting to micromanage every instruction. To “code” the AI. But if the goal is true scalability, “if this, then do that” starts to look like an anti-pattern.

Of course, mechanically separate QA and validation become crucial in architectures that delegate procedural reasoning. The human value moves further up the abstraction chain into designing agent loops, evaluation layers, guardrails, and success criteria — a topic for another day.

Nonetheless, the point holds: the advantage comes from building the right tools and context, then giving the LLM enough room to reason.

Extending the Pattern Up the Abstraction Chain

I’m sure the hypothesis will be debated for some time across the AI field.

But for a moment, let’s assume it is directionally correct: runtime reasoning paired with exact context is what makes AI tool selection and procedural reasoning so powerful.

If true, why stop at delegating MCP tool selection to LLM reasoning? Couldn’t an agent reasoning within the full runtime context also pick the right skill for the intent? And why stop there? Couldn’t it delegate to the right sub-agent, using the most cost-conscious model capable of satisfying the request? Wouldn’t it stand to reason that the LLM with the exact context and real-time model pricing would make a better call?

At some point, the same question keeps reappearing at each layer of abstraction: if the LLM has the context, the available options, the constraints, and the success criteria, why assume a pre-declared orchestration flow will make the better decision?

There are plenty of implementation details to work through, but I do not see an obvious reason the “let the LLM cook” hypothesis would only apply at the tool-calling layer.

Many teams have already spent time orchestrating automation flows around their best guesses of user intent, only to later move to AI skills once runtime reasoning proved more flexible than declarative flows.

Will much of today’s declarative harness and orchestration work meet the same fate once LLM runtime reasoning improves at the next level of abstraction?

The Cost Advantage: Right-Sizing Agent Intelligence

Given the current moment in the AI era, with token spend under scrutiny and ROI becoming harder to hand-wave, the possibility of letting AI agents right-size delegated work per intent is especially intriguing.

Some user intents may need an expensive frontier model. Others may be perfectly handled by an open-source model six to twelve months behind the frontier. Just as some tasks require PhD-level reasoning while others need something closer to a competent middle-school tutor.

It would be impossible for humans to declaratively route every possible intent to the perfectly sized level of model intelligence. But it is not theoretically impossible for an AI agent to do so, especially if that agent is reasoning within the runtime context of the request.

An AI agent could evaluate the goal, the available tools, the model options, and the current pricing landscape, then reason about the best outcome per unit of cost. It could even loop with the user, or the agent calling it, to negotiate tradeoffs between quality, latency, and price.

At scale, that matters. High-volume delegation of lower-complexity work to cheaper models could save organizations millions. Over time, agentic marketplaces could emerge once there is enough liquidity around specific outcomes, capabilities, and price points.

In my mind, the next question becomes: what interfaces could support right-sized functionality delegation across abstraction layers with enough network effect to realistically unlock this kind of ecosystem?

MCP as an Interface for AI Delegation

One possibly elegant solution would be to re-use the interface Anthropic, OpenAI, Cursor, Hugging Face, and the rest of the AI ecosystem already use: Model Context Protocol, or MCP.

As a quick refresher, MCP is an adapter pattern that allows LLMs to interface with third-party functionality and data through a shared protocol. But if MCP can expose a database lookup, a filesystem operation, or a content retrieval tool, could it also expose higher-level functionality that is selected and loaded at runtime?

From the protocol’s perspective, a function that performs a database lookup and a function that runs a recursive agentic loop are not fundamentally different. Both are externally exposed capabilities with inputs, outputs, and a contract.

That raises the interesting question: could the same runtime tool-calling pattern that already works for MCP also work for skills, dynamic functionality, and recursive agent orchestration?

I ran this experiment (Github link) at the skill layer of abstraction, and the early results pointed to what seemed to be a yes.

Dynamically Fetching Skills at Runtime via MCP

Critics of MCP might quickly point out one of its practical limitations: loading too many MCP tools into the context window becomes prohibitively expensive and unwieldy rather quickly.

For that reason, I created a development repository to test whether MCP tools and skills could be discovered and pulled in dynamically at runtime.

The build followed a few steps.

1. Established a Monorepo with a Gateway MCP and Skill-Fetcher MCP

Inside a pnpm monorepo, I split the experiment into three self-contained packages: a gateway mcp server, a skill-fetcher MCP server, and an agent test harness. This way, each piece could be developed, inspected, and tested independently.

Additionally, Streamable HTTP was used to communicate between MCP servers on different local ports in order to more closely simulate a production environment.

The agent only ever knew about the Gateway MCP server endpoint, connecting to it like any other MCP tool.

2. Built a Gateway MCP Server Fronting a Registry of Child MCPs

Using the official MCP Typescript SDK, I built the front door gateway MCP server exposing two tools: list_avaiilable_mcps for discovery, and invoke_mcp_tool for routing.

Behind this, I registered a few stubbed MCP tools as well as the skill-fetcher MCP server. Importantly, the agent discovered the available child MCP servers at runtime. One could imagine a shared-responsibility model where federated teams publish their own MCP endpoints into a discovery mechanism, enabling new functionality to be deployed without manually rewiring the consumer’s MCP list.

Just as importantly, no task-specific functionality was wired directly into the Gateway MCP server. Its job was routing and discovery rather than execution.

3. Built a Skill-Fetcher MCP Server to Dynamically Expose Skills

Using consistent tooling and approach, I then built the skill-fetcher MCP server which dynamically served AI procedural skills through the endpoints list_skills and fetch_skill. Examples of underlying stubbed skills included content-qa, seo-audit, and tone-rewriter seeded with markdown.

Of course, in production, the skill registry could be dynamic and accommodate more complex functionality.

When the agent calls invoke_mcp_tool targeting skill-fetcher, the gateway opens a genuine MCP client connection to this server and relays the result. This is the "third-party functionality" I wanted to test: shipping capabilities as dynamic, fetch-able knowledge.

4. Wrote a Provider-Agnostic Agent Harness

Given the ultimate goal of being able to dynamically spawn right-sized agent functionality, I wanted the ability to test my Gateway MCP hypothesis with any LLM model. I also wanted the ability to test with an open source LLM without burning tokens/money. In this way, testing could be done with Anthropic models or any other OpenAI-compatible endpoint like Ollama, Qwen, Groq, or OpenRouter.

After politely turning down Claude code’s inclination to rebuild the Vercel AI SDK from scratch within my code base, I simply re-used it as the most straightforward means to interact with my choice of AI agent. I decided on qwen3.6:35b-mlx for testing.

5. Ran Tests Forbidding Built-in Knowledge or Hints About Tooling

I constrained the agent with a system prompt that it had no built in tooling and that it should discover and execute through the Gateway MCP exclusively.

In a real world scenario with an intelligent enough front door agent, this would largely be unnecessary as the model would properly reason to invoke the gateway MCP versus internal tools. However, with the open source model, it was more efficient to direct it to the gateway front door directly as it doesn’t affect the dynamic loading hypothesis.

Success would be dependent on if the agent could find the MCP and skills functionality dynamically at run-time.

6. Confirmed the Dynamic Discovery Sequence

The agent harness was then run with qwen3.6:35b-mlx and a sample query was passed to it that should trigger the dynamic selection of the right skill via its front matter metadata within the registry.

I watched the agent self-direct rather than take its word for it: unprompted, it called list_available_mcps, identified the Skill Fetcher as the right child, called invoke_mcp_tool → fetch_skill with content-qa, read the fetched markdown, and produced the QA report by following it.

The capability was never in the agent but fetched, just-in-time, through MCP. An MCP Inspector was also wired up to call each server's tools by hand and raw JSON was inspected to confirm that the gateway and skill-fetcher each behaved correctly in isolation.

In short, the experiment suggests that MCP can let LLMs “cook” at a higher level of abstraction: not only selecting tools, but dynamically discovering and loading procedural capabilities at runtime.

Guardrails for Runtime Reasoning

Of course, in practice, AI reasoning and federated functionality requires guardrails. Letting the LLM cook does not mean letting it operate without constraints. The point is not to abdicate judgment, but to move judgment into the right layers of the system.

A few guardrails seem especially important.

Independent QA for agent reasoning: If AI is doing the reasoning, especially when the agent is mutating or accessing sensitive systems, the flow should include mechanically separate evaluations and QA checks.
Timeouts: Any MCP server dynamically calling third-party functionality, particularly expensive agents or long-running workflows, should have associated timeouts.
Loop termination: Any MCP server or agent loop calling recursive functionality should include explicit termination rules as a matter of security, reliability, and cost control.
Observability across federated systems: If functionality is federated across tools, skills, or agents, there should still be a consistent observability architecture tying together the full flow, the user, token spend, latency, tool calls, and other pertinent concerns.

The potentially immense value of runtime LLM reasoning comes with a corresponding responsibility: the more reasoning we delegate, the more seriously we need to design the evaluation, observability, and control layers around it.

Letting the LLM Cook

In conclusion, there appear to be practical ways to let LLMs reason at higher levels of abstraction, opening up a range of interesting possibilities.

The skills pattern already moved the line by demonstrating the value of reasoning within the exact context of a user intent at runtime. Extending that pattern to dynamic skills, delegated functionality, and even sub-agents follows the same thread. Dynamic discovery via MCP provides one possible mechanism for federated run-time functionality.

It does make one wonder if it’s wise to build harnesses and orchestration systems with pre-declared “orchestrator” agents and “specialist” agents. If LLMs can outperform declarative flows at the tool-selection layer, it feels like a possible anti-pattern to handle higher-level orchestration declaratively by default.

Finally, it opens up an even larger question: if runtime agent reasoning can outperform declarative flows for dynamically choosing tools, fetching skills, or delegating to sub-agents, what stops agents from dynamically assembling better versions of themselves as more context becomes available?

But that’s a part two post for another day.