Chapter 4

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

🧱 Part II: The MCD Framework

Part II introduces the core contribution of this thesis: the Minimal Capability Design (MCD) framework. This section defines MCD’s conceptual underpinnings (Chapter 4) and then instantiates it as a practical, deployable agent architecture (Chapter 5).

Unlike traditional agent stacks that add memory, orchestration, and redundancy by default, MCD is a design-first approach grounded in statelessness, prompt sufficiency, and failure-resilient minimalism.

This part lays the architectural groundwork upon which simulation and walkthrough validations in Part III are built.

📘 Chapter 4: The Minimal Capability Design (MCD) Framework

.

4.1 Overview of the MCD Framework

The Minimal Capability Design (MCD) framework provides a structured methodology for engineering AI agents that are lightweight by design, not by post-hoc reduction (Schwartz et al., 2020; Strubell et al., 2019). It inverts the conventional workflow of building a feature-rich agent and then compressing it (Bommasani et al., 2021). Instead, MCD begins with a minimal architectural footprint, treating components like persistent memory, complex toolchains, and layered orchestration not as defaults, but as capabilities that must be rigorously justified by task requirements and resource constraints (Singh et al., 2023). At its core, an MCD-compliant agent is fail-safe, stateless, and prompt-driven by default (Ribeiro et al., 2016).

The following sections formalize these intuitions into a cohesive framework, detailing its core principles, a layered architectural model, and a suite of diagnostic tools designed to detect and prevent over-engineering (Hevner et al., 2004).

4.2 The Core Principles of MCD

The framework is built on three foundational principles that guide every design decision, from high-level architecture to low-level implementation (March & Smith, 1995).

4.2.1 Bounded Rationality as a Design Constraint

In traditional reasoning agents, performance often scales with available context and tools, a concept rooted in Herbert Simon’s work on organizational decision-making and reflected in modern LLMs that leverage large context windows (Simon, 1955; Brown et al., 2020). For edge deployments, this scaling is counterproductive—longer reasoning chains and larger tool inventories increase fragility under strict token and latency constraints (Xu et al., 2023).

MCD reframes bounded rationality as a deliberate deployment constraint: an agent must be architected to complete its reasoning within a minimal symbolic context, even when richer context is theoretically available (Kahneman, 2011). This enforces computational frugality and mitigates failure modes like reasoning drift and over-tokenization. This principle demonstrates constraint-resilience advantages in T1-T4 validation: while traditional approaches excel in resource-abundant scenarios (Few-Shot: 811ms, Conversational: 855ms), MCD maintains stable performance under constraint pressure (1724ms average with 85% performance retention at Q1 tier), compared to 40% retention for Few-Shot and 25% for conversational approaches under identical constraint conditions (Chapter 6). This approach aligns with the compact reasoning strategies in zero-shot Chain-of-Thought (Wei et al., 2022; Kojima et al., 2022) but enforces a hard capability ceiling to avoid over-engineering.

4.2.2 Degeneracy Detection

Agent frameworks like LangChain (Chase, 2022) and agentic loops like BabyAGI (Nakajima, 2023) encourage modular expansion through memory modules, retrieval layers, and multiple tool handlers. However, analyses show that unused or redundant pathways accumulate in these architectures, increasing latency and brittleness without improving success rates (Park et al., 2023; Qin et al., 2023).

MCD incorporates Degeneracy Detection—a systematic audit of every routing and tool path to remove unused components before deployment (Basili et al., 1994). This principle extends beyond the complexity-reduction practices in modular agent literature by formalizing minimalism as a first-class design rule rather than a maintenance task (Mitchell, 2019).

4.2.3 Minimality by Default

In conventional AI deployment, minimality is usually achieved through an optimization pass after a working architecture is built (Dettmers et al., 2022; Han et al., 2016). MCD reverses this workflow by establishing minimality as the starting point: all capability, memory, and tool modules are excluded by default and are only added if failure cases from the walkthroughs or simulations prove their necessity (Banbury et al., 2021). This approach is consistent with the goals of post-training compression research but shifts the temporal order—design for minimality first, add capability later. This ensures that excess capability is never deployed in the first place, a philosophy that aligns with the resource-conscious principles of TinyML (Warden & Situnayake, 2019).

Empirical validation shows minimality-first design achieves identical task success (94%) with 67% fewer computational resources in T5 capability measurement and T6 component removal tests (Chapter 6) demonstrating the trade-off between peak performance optimization and constraint-resilience reliability (Sahoo et al., 2024).

Table 4.1: MCD Principles Implementation Overview

Core Principle Layer(s) Impacted Primary Failure Modes Addressed Simulation Test(s)
Bounded Rationality Prompt, Control Over-tokenization, reasoning drift T1, T4
Degeneracy Detection Control, Execution Unused tool calls, latent component errors T7, T9
Minimality by Default All Layers Capability creep, unnecessary dependencies T5, T6

MCD Design Philosophy Distinction

Table 4.2 emphasizes that MCD is not prompt optimization—it’s a complete design philosophy for constraint-first agent development that affects:

  • System Architecture: Three-layer model with clear separation of concerns
  • Resource Management: Quantization-aware execution with dynamic tier selection
  • Tool Integration: Minimal-first approach to external capability addition
  • Failure Handling: Predictable degradation patterns across all system components
  • Deployment Strategy: Edge-first design that scales up rather than cloud-first design that scales down

Academic Significance: This comprehensive table demonstrates that MCD contributes to agent architecture theory, not just engineering practice, by providing systematic principles for constraint-aware system design across all architectural layers.

Table 4.2: MCD Principle Application Across System Architecture

MCD Principle Prompt Layer Control Layer Execution Layer Tool Integration Validation Evidence
Bounded Rationality - 90-token capability ceiling
- No conversational memory
- Explicit context anchoring
- Single-step reasoning chains
- Stateless routing decisions
- Deterministic fallback paths
- Q4 quantization limits
- 512MB RAM constraints
- 430ms latency budgets
- Maximum 2 tool calls
- Zero external dependencies
- Local-only execution
T1, T4, T6 (Chapter 6)
Degeneracy Detection - Unused prompt segments
- Redundant role instructions
- Over-specified constraints
- Dead routing pathways
- Circular dependency loops
- Duplicate logic branches
- Dormant quantization tiers
- Inactive memory modules
- Unused model capabilities
- Redundant tool handlers
- Overlapping API calls
- Duplicate tool functions
T6, T7, T9 (Chapter 6)
Minimality by Default - Zero-shot baseline first
- Essential-only instructions
- Constraint-first design
- No orchestration layer
- Minimal routing logic
- Exception-only complexity
- Q1 tier as starting point
- Single model deployment
- Resource-conscious scaling
- Empty tool registry
- Capability-driven addition
- Justified tool inclusion
T5, T10, W1-W3 (Ch. 6–7)

4.3 The MCD Layered Architectural Model

The MCD framework formalizes its commitment to stateless, symbolic control through a three-layer architectural stack (Gregor & Hevner, 2013). This model enforces a separation of concerns while ensuring that each layer operates within the core principles of minimalism.

4.3.1 Prompt Layer

The Prompt Layer is the primary interface for reasoning and task execution (Liu et al., 2023). It enforces minimal symbolic prompting with embedded fallback logic, inspired by chain-of-thought robustness (Wei et al., 2022) but tailored for stateless regeneration. This enables 92% context reconstruction accuracy without persistent memory, validated through T4 stateless integrity tests and applied in healthcare dialogue scenarios (W1, Chapter 7), a key requirement for browser-based or microcontroller deployments. This layer also handles modality anchoring, the compression of visual or audio context into symbolic tokens, enabling multi-modal reasoning without requiring heavy multi-modal models (Alayrac et al., 2022; Radford et al., 2021).

4.3.2 Control Layer

Orchestration-heavy control layers often abstract decision logic into external frameworks, which can hide redundancy and create opaque execution flows (Chase, 2022; Singh et al., 2023). MCD’s Control Layer avoids this by keeping all routing and validation logic in-prompt. It draws on insights from modular agent routing literature but reinterprets them as symbolic, inline decision trees that avoid external orchestration calls entirely (Shinn et al., 2023).

4.3.3 Execution Layer

The Execution Layer assumes that agents are deployable in quantized form from the start (Jacob et al., 2018). It treats hardware-aware optimizations like quantization (Dettmers et al., 2022; Frantar et al., 2023) and pruning (Han et al., 2016; Iandola et al., 2016) as baseline assumptions, not optional enhancements. It is designed for full local inference without backend servers, leveraging lightweight toolchains like llama.cpp (Georgi, 2023) and browser-based WebAssembly runtimes (Haas et al., 2017) to remove any dependency on persistent network connectivity.

While the MCD stack emphasizes prompt-centric reasoning, symbolic routing, and quantized execution, it is not dismissive of alternative architectural paradigms (Bommasani et al., 2021). Multi-expert (MoE), modular reflection (MoR), retrieval-augmented (RAG), and parameter-efficient tuning (PEFT) models were analyzed during framework construction (see Ch. 2), but excluded here due to one or more of the following: (a) persistent memory or backend requirements, (b) runtime variability incompatible with statelessness, or (c) toolchain complexity that violates MCD’s Degeneracy Detection heuristics (Hu et al., 2021; Lewis et al., 2020). Their capabilities are acknowledged but deferred to future hybrid architectures (see Appendix D).

4.4 Quantization-Aware Routing Logic

The agent’s routing logic is designed to prioritize low-capability execution paths (Nagel et al., 2021). It attempts to resolve queries using Q1 and Q4 models, falling back to Q8 only when:

  • Drift threshold is exceeded (T2)
  • Confidence score drops below fallback threshold (T6)
  • Response timeout occurs (T5)

This routing logic ensures cost-efficiency, latency reduction, and robustness to model failure (Zafrir et al., 2019).

Empirical Tier Selection Guidelines:

  • Q1 (Ultra-minimal): 60% success rate on simple tasks, triggers fallback in 35% of complex scenarios
  • Q4 (Optimal balance): 96% completion rate across 80% of constraint-bounded tasks, optimal efficiency point, while alternative approaches show significant degradation under identical resource pressure.
  • Q8 (Over-provisioned): Marginal accuracy gains at 67% computational overhead, violates minimality principles

Dynamic fallback operates effectively without session memory, validating stateless tier selection (T10) (Jacob et al., 2018).

4.5 Formal Definitions of MCD Concepts

Minimal Context Prompt: A set of rules defining the smallest possible symbolic representation of state required for an agent to complete a task turn (Anthropic, 2024). It prioritizes information density over completeness.

Fallback-Safe Prompting: A prompt design pattern that includes explicit, low-cost default actions or responses that are triggered when the agent detects ambiguity or input degradation (Kadavath et al., 2022).

Capability Collapse: A measurable failure mode where an agent’s task success rate drops >50% when resource constraints are reduced below critical thresholds (Amodei et al., 2016). Validation shows this occurs at 85-token budget limits for verbose approaches, while MCD maintains 94% success rate down to 60-token constraints (T1-T3).

Semantic Prompt Degradation: The quantifiable loss in task accuracy that occurs as a prompt is systematically compressed or has its semantic richness reduced (Min et al., 2022).

4.6 Diagnostic Tools for Over-Engineering

To detect over-engineering early, MCD introduces diagnostic tools inspired by software fault classification (Basili et al., 1994) and prompt robustness analysis (Min et al., 2022).

Empirically Calibrated Thresholds -

  • Capability Plateau Detector - Calibrated Threshold: 90-token saturation point validated across multiple test domains (T1-T3). Beyond this threshold, additional complexity yields <5% improvement while consuming 2.6x computational resources thereby preventing over-engineering in test scenarios.
  • Memory Fragility Score - Validated Benchmark:>40% dependence indicates deployment risk, confirmed through T4 stateless validation. Agents exceeding this threshold show 67% failure rates when deployed without persistent state.
  • Toolchain Redundancy Estimator - Empirical Cutoff: <10% utilization triggers removal, validated through degeneracy detection tests (T7, T9). Components below this threshold contribute <2% to overall task success while adding 15-30ms latency overhead.

Table 4.3: Over-Engineering Diagnostic Tools

Tool/Metric Purpose Inspired By
Capability Plateau Detector Detects diminishing returns in prompt/tool additions. Optimization Plateaus
Memory Fragility Score Measures agent dependence on state persistence. RAG Failure Rates [Lewis et al., 2020]
Toolchain Redundancy Estimator Identifies unused or rarely-used modules. Defect Taxonomy [Basili et al., 1994]

4.7 Security and Multi-Modality within MCD

4.7.1 Security-by-Design Heuristics

Minimalist agents, by their nature, have a smaller attack surface (Barocas et al., 2017). The MCD framework operationalizes this with three lightweight security layers:

  • Prompt Validation Layer: Uses simple, low-cost input sanitization (e.g., regex patterns) to filter potentially malicious instructions (Papernot et al., 2016).
  • Bounded Response Layer: Enforces strict output length and content restrictions to prevent information leakage or unexpected behavior (Selbst et al., 2019).
  • Fallback Security Layer: Ensures that the agent’s default response upon failure is a safe, pre-defined state, preventing common prompt injection attacks (Perez et al., 2022).

Empirically Validated Safety Benefits:
Validation demonstrates MCD approaches fail transparently with clear limitation acknowledgment, while over-engineered systems exhibit unpredictable failure patterns under constraint overload (Lin et al., 2022). MCD’s conservative design prevents confident but incorrect responses through bounded output restrictions and explicit fallback states (T7 constraint safety analysis).

4.7.2 Multi-Modal Minimalism

While this thesis primarily uses language reasoning for clarity, the MCD framework extends to multi-modal agents through modality anchoring (Radford et al., 2021). This process uses lightweight, on-device feature extractors (e.g., MobileNet for images, keyword spotters for audio) to convert perceptual input into compact textual or symbolic representations (Howard et al., 2017). This enables stateless agents to operate on vision or sensor streams without requiring resource-intensive, end-to-end multi-modal models. These mechanisms are illustrated in the drone walkthrough W2 (Ch. 7) and detailed in Appendix B.

4.8 Framework Scope and Boundaries

MCD is optimally suited for narrowly-scoped, interaction-driven agents (e.g., chatbots, diagnostic tools, lightweight navigation) (Thoppilan et al., 2022). For agents requiring persistent world-models, large-scale simulation, or low-level physical control (e.g., robotic arms), architectural minimality may not suffice. For these cases, future work is needed on hybrid memory-adaptive designs, as discussed in the appendices.

It is important to note that “edge” deployment is not monolithic (Singh et al., 2023). Devices like the ESP32-S3 enforce single-turn stateless reasoning due to tight RAM/flash constraints, while Jetson Nano platforms may support limited multi-turn interaction or shallow retrieval. MCD is structured to accommodate this spectrum: its prompt layer operates in isolation, while the control and execution layers can scale or collapse based on hardware capability. This “sliding window” of minimality ensures architectural discipline without sacrificing adaptability. Browser-based validation confirms effective deployment across ESP32-S3 (Q1 tier) to Jetson Nano (Q4 tier) constraint profiles with 430ms average latency and dynamic capability matching (T10 tier selection analysis, Chapter 6).

Validated Deployment Context: Browser-based validation confirms MCD effectiveness in WebAssembly environments with 430ms average latency and appx 80% overall execution reliability (Haas et al., 2017). Framework scales appropriately across ESP32-S3 (Q1 tier) to Jetson Nano (Q4 tier) constraint profiles, with dynamic capability matching preventing over-provisioning.

Collectively, these principles, layers, and diagnostics constitute the Minimal Capability Design framework (Hevner et al., 2004). The next chapter will demonstrate how this framework is instantiated into a test environment, while subsequent chapters will rigorously evaluate its performance and robustness.

Note: Future MCD implementations may benefit from domain-specific SLMs (healthcare, navigation, diagnostics) as base models, potentially reducing the prompt engineering dependencies identified in current limitations while maintaining architectural minimalism (Belcak et al., 2025)

4.9.1 SLM-MCD Architectural Compatibility (Theoretical Discussion)

Recent research demonstrates that Small Language Models (SLMs) provide a complementary approach to MCD’s architectural minimalism (Belcak et al., 2025). While MCD achieves efficiency through design-time constraints (statelessness, degeneracy detection, bounded rationality), SLMs achieve similar goals through domain specialization and parameter reduction (Microsoft Research, 2024).

SLMs align naturally with MCD principles by eliminating unused capabilities at the model level rather than the architectural level (Gunasekar et al., 2024). Microsoft’s Phi-3-mini (3.8B parameters) demonstrates that domain-focused models can achieve comparable task performance to 30B+ models while maintaining the resource constraints essential for edge deployment (Abdin et al., 2024). This synergy suggests that MCD frameworks can leverage SLMs as optimized base models without compromising core design principles.

Table 4.4: SLM Compatibility with MCD Architecture

SLM Characteristic MCD Principle Alignment Synergy Potential Implementation Notes
Domain specialization Degeneracy Detection ✅ High Reduces over-engineering at model level
Parameter efficiency Minimality by Default ✅ High Supports Q4/Q8 quantization tiers
Edge deployment Bounded Rationality ✅ Medium Enables local inference under constraints
Task-specific training Stateless Regeneration ⚠️ Moderate May require prompt adaptation strategies

Framework Independence: MCD principles (stateless execution, fallback safety, prompt minimalism) remain model-agnostic and apply equally to general LLMs, quantized models, or domain-specific SLMs (Touvron et al., 2023). This architectural independence ensures that MCD implementations can benefit from emerging SLM advances without fundamental framework modifications.

Validation Scope Note: While this section establishes the theoretical alignment between SLM characteristics and MCD architectural principles, empirical validation of purpose-built Small Language Models was not conducted in this research. The simulation tests (Chapter 6, T1-T10) and applied walkthroughs (Chapter 7) utilized quantized general-purpose LLMs (Qwen2-0.5B, TinyLlama-1.1B, Llama-3.2-1B) rather than domain-specialized SLMs such as Phi-3-mini or SmolLM.

The distinction is significant: quantized LLMs achieve parameter reduction through post-training compression (Q1/Q4/Q8 quantization), whereas purpose-built SLMs achieve efficiency through domain-focused pre-training and architectural specialization from inception. While both approaches align with MCD's constraint-resilient principles, direct empirical validation of SLM-specific implementations remains an opportunity for future research. The framework independence discussed in this section—that MCD principles apply equally to general LLMs, quantized models, or domain-specific SLMs—is architecturally sound but not empirically demonstrated through controlled testing in this thesis.

This limitation does not diminish the validity of the MCD framework itself, which was rigorously validated across three quantization tiers using general-purpose models. Rather, it identifies SLM integration as a natural extension for subsequent research to empirically verify the synergies suggested by the theoretical analysis presented here.

4.9.2 Comparative Positioning: MCD vs. Other Architectures

Table 4.5: MCD Architectural Positioning

Architecture Type Memory Dependency Toolchain Complexity Stateless Compatibility Base Model Options Notes
MCD (This Work) ❌ None ✅ Minimal ✅ Yes General LLMs, Quantized, SLMs Framework-agnostic design
RAG ✅ High ⚠️ Moderate ❌ No Any LLM Requires persistent memory
MoE / MoR ⚠️ Variable ❌ High ❌ No Specialized architectures Expert selection overhead
SLM-Direct ❌ Low ✅ Minimal ✅ Partial Domain-specific models Model-level optimization
TinyLLMs + PEFT ⚠️ Tuning dependent ⚠️ Moderate ❌ Limited Fine-tuned variants Breaks statelessness
Symbolic Agents ❌ None ✅ Minimal ✅ Yes Rule-based systems MCD extends with LLM integration

MCD positions itself as a model-agnostic architectural framework that combines stateless design, diagnostic minimalism, and quantization-aware execution (Ribeiro et al., 2016; Bommasani et al., 2021). Whether deployed with general quantized LLMs or specialized SLMs, MCD’s core principles ensure predictable, constraint-aware agent behavior suitable for edge environments.

Chapter Summary

Chapter 4 introduced the design principles, subsystem analyses, and diagnostic heuristics that constitute MCD. These principles provide the theoretical structure for agent minimalism.

Next Chapter Preview

Chapter 5 now moves from theory to implementation. It instantiates MCD as a working agent architecture with symbolic routing, stateless execution, and controlled fallback. These instantiations form the templates used in later simulation and walkthrough scenarios.