Chapter 9

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

🧩 Part III: Validation, Extension, and Conclusion

🔭 Chapter 9: Future Work and Extensions

This chapter outlines directions for extending the Minimal Capability Design (MCD) framework beyond the scope of this thesis (Gregor & Hevner, 2013). These proposals are informed by the observed failure modes in the simulations (Chapter 6), the practical design trade-offs identified in the walkthroughs (Chapter 7), and the framework limitations analyzed during the evaluation (Chapter 8) (Miles et al., 2013). The goal is to move from the proof-of-concept of stateless minimalism toward hybrid, self-optimizing, and empirically validated agents that retain MCD’s efficiency principles while broadening their operational range (Xu et al., 2023).

9.1 Empirical Benchmarking on Edge Hardware

While this thesis employed a browser-based WebAssembly simulation environment to eliminate hardware-dependent noise, future work must include deployment-level empirical benchmarking on low-power devices to measure real-world efficiency and robustness (Banbury et al., 2021; Singh et al., 2023).

9.1.1 Proposed Hardware Testbeds

The proposed testbeds would include a selection of representative ARM-based edge devices (Howard et al., 2017):
- Raspberry Pi 5
- NVIDIA Jetson Nano
- Google Coral Dev Board

These platforms would allow for the direct measurement of CPU/GPU utilization during the inference of quantized LLMs (e.g., Q4/Q8 models) (Jacob et al., 2018; Dettmers et al., 2022).

⚙️ Note on Quantization Tiering:

Initial benchmarking will focus on Q1/Q4/Q8 quantized models, reflecting MCD’s design logic (Nagel et al., 2021). These tiers were selected because:
- - Q1 enables ultra-low-resource deployments (e.g., in-browser WASM).
- - Q4 balances inference speed and precision on platforms like Jetson Nano.
- - Q8 serves as a high-precision fallback in sustained load scenarios.

Future testing may include partially quantized or mixed-precision architectures as hybrid agents are explored (Frantar et al., 2023).

9.1.2 Hardware-Coupled Metrics and Benchmarking

Future validation of the MCD framework will include hardware-coupled metrics using these environments. Diagnostics from the simulations (e.g., T8, T9) will be directly correlated with on-device measurements to test the predictive robustness of the framework’s fallback and redundancy heuristics (Field, 2013).

Table 9.1: Proposed Metrics for Hardware-Coupled Benchmarking

Metric Measurement Method Purpose
End-to-End Latency Time from query submission to final response (ms). Quantify how simulation-based sufficiency thresholds translate to real-world edge hardware.
Energy Consumption Power draw in watt-hours per complete task cycle. Evaluate the Green AI alignment of MCD principles under operational load.
Semantic Drift Incidence Rate of logical or factual errors under noisy, real-world user inputs. Identify whether failure points (e.g., 52% semantic drift beyond 3-step reasoning chains (T5 validation)) shift under actual deployment conditions.
Throughput Efficiency Number of queries processed per watt-hour. Provide a holistic measure of the agent’s sustainable performance.

Validation-Grounded Metrics:
Browser-based validation established baseline thresholds that can guide hardware benchmarking (Strubell et al., 2019):
- 90-token capability plateau (T6) → Hardware energy consumption measurement at semantic saturation
- 2.1 : 1 reliability advantage under constraint conditions (T1-T10) → Real-world efficiency validation under ARM constraints
- ≈ 80% Q4 completion (W1/W2/W3 ) → Quantization tier validation on Jetson Nano vs ESP32-S3
- 0% vs 87% failure modes (T7) → Safety validation under hardware thermal constraints

These figures demonstrate consistent categorical patterns across n=5 runs per domain, with extreme effect sizes (η²=0.14-0.16) providing robust qualitative evidence

Validation Continuity Framework:
Browser-based WebAssembly simulation (430ms average latency) provides baseline for ARM device comparison:
- Raspberry Pi 5 → Expected 15-25% latency improvement over browser constraints
- Jetson Nano → Q4 tier validation with GPU acceleration for complex reasoning
- Coral Dev Board → Q1-Q4 fallback mechanism validation under edge TPU constraints

9.2 Hybrid Architectures: Extending MCD Beyond Pure Statelessness

A key limitation of the current MCD agents, identified in Section 8.4, is their strict statelessness and tool-free design. While advantageous for simplicity, this can be relaxed in a controlled, minimal-impact manner to extend the agent’s task scope without undermining MCD’s core principles.

9.2.1 Potential Hybrid Enhancements

  • Adaptive Memory Agents: Employ ephemeral memory that exists only within the current task session and is reset upon completion to prevent persistent state bloat (Anthropic, 2024).
  • Selective Memory Primitives: Store only critical symbolic anchors (e.g., the last two spatial coordinates in the navigation walkthrough) rather than the full conversation history (Thrun et al., 2005).
  • On-Demand Tool Selection: Integrate external tools (e.g., a lightweight retrieval API) that are invoked only when the agent’s internal diagnostic heuristics detect a high risk of capability collapse (Qin et al., 2023).

🛠️ Reintroducing Optimization Trade-Offs:
While this thesis prioritized quantization due to its zero-training and stateless compatibility, future hybrid MCD agents may also explore (Hinton et al., 2015):
- - Distilled TinyLLMs (e.g., TinyLlama) for cases with access to pre-compiled small models.
- - PEFT techniques like LoRA or prefix-tuning for agents that support task-specific fine-tuning during provisioning (Hu et al., 2021).
- - Sparse and pruned models for structured symbolic reasoning agents (Han et al., 2016).

These approaches require session-state support or training pipelines, but may serve in bounded hybrid agents that retain a minimalist inference core.

9.2.2 SLM-MCD Integration Strategies

Recent research demonstrates that domain-specific Small Language Models (SLMs) provide complementary optimization to MCD’s architectural minimalism (Belcak et al., 2025). Unlike general quantized models, SLMs achieve efficiency through domain specialization while maintaining compatibility with MCD’s constraint-first principles (Magnini et al., 2025).

Domain-Specific MCD Agents:
Future implementations could leverage specialized SLMs as base models within MCD frameworks:
- Healthcare MCD Agents: Utilizing medical SLMs (e.g., BioMistral, mhGPT) for appointment booking and clinical terminology handling while preserving MCD’s stateless execution and fallback safety (Singhal et al., 2025)
- Navigation MCD Agents: Employing robotics-specific SLMs trained on spatial reasoning datasets (Song et al., 2024) to reduce semantic drift in multi-step navigation tasks
- Code Diagnostics MCD Agents: Integrating code-specific SLMs like Microsoft’s CodeBERT family for enhanced prompt debugging while maintaining MCD’s transparent boundary acknowledgment

Multi-SLM Orchestration Under MCD Logic:
Hybrid architectures could combine multiple domain-specific SLMs under MCD’s stateless routing logic (Agrawal & Nargund, 2025):

User Query → Intent Classification → Domain SLM Selection → MCD Execution Layer
           ↓
    Healthcare SLM (Q4) → Appointment Logic → Stateless Confirmation
    Navigation SLM (Q1/Q4) → Spatial Reasoning → Coordinate Output 
    Diagnostics SLM (Q8) → Pattern Recognition → Error Classification

SLM-Quantization Synergy:
Domain-specific models trained on specialized datasets may achieve better performance at lower quantization tiers than general models (Pham et al., 2024). For example:
- Medical terminology SLMs might maintain clinical accuracy at Q4 precision where general LLMs require Q8
- Spatial reasoning SLMs could enable Q1-tier navigation tasks that general models cannot handle
- Code-specific SLMs may preserve debugging capability under aggressive compression

Table 9.2: SLM-MCD Integration Compatibility Matrix

SLM Domain MCD Principle Alignment Quantization Tier Stateless Compatible Implementation Complexity
Healthcare High - reduces medical jargon over-engineering Q4/Q8 ✅ Yes Low - direct replacement
Navigation Medium - requires spatial state handling Q1/Q4 ⚠️ Partial Medium - coordinate persistence
Code Diagnostics High - eliminates unused syntax handling Q8 ✅ Yes Low - structured output
Multi-Domain Variable - depends on orchestration Q4/Q8 ⚠️ Complex High - routing logic required

Framework Independence Preservation:
MCD architectural principles (stateless execution, fallback safety, degeneracy detection) remain model-agnostic and apply equally to general LLMs, quantized models, or domain-specific SLMs (Touvron et al., 2023). This ensures that SLM integration enhances rather than replaces MCD’s core design philosophy.

9.3 Auto-Minimal Agents: Toward Self-Optimizing Systems

An emerging research direction is the development of self-optimizing agents that continuously enforce MCD constraints on themselves without external tuning (Mitchell, 2019; Russell, 2019).

9.3.1 Core Concepts for Self-Optimization

  • Self-Reducing Prompt Chains: Agents would be designed to dynamically shorten multi-step reasoning prompts when the Redundancy Index (Section 8.3) indicates that no measurable accuracy gain is being achieved (Basili et al., 1994).
  • Entropy-Based Prompt Pruning: This approach would use token-level entropy scoring to detect high-perplexity or low-information branches in a prompt’s decision tree. The agent could then prune branches where the KL-divergence from a task-aligned distribution exceeds a set threshold, thereby maintaining prompt efficiency.
  • Domain-Aware Self-Optimization: Future auto-minimal agents could leverage SLM domain expertise for enhanced self-optimization:
    Domain Drift Detection: SLMs trained on specific vocabularies could better detect when task context shifts beyond their expertise domain, triggering MCD fallback mechanisms
    Specialized Entropy Scoring: Domain-specific models provide more accurate entropy measurements for their specialized tasks, enabling precise self-pruning without capability loss
    Adaptive SLM Selection: Self-optimizing agents could dynamically select the most appropriate domain-specific SLM based on input analysis while maintaining MCD’s stateless execution
  • Quantization-Aware Pruning Synergy: As agents begin self-optimizing, future directions may include quantization-aware pruning strategies that (Iandola et al., 2016):
    Dynamically remove low-weight branches in decision trees,
    Ensure pruning does not conflict with existing quantization tiers,
    Preserve compatibility with Q4/Q8 fallback layers.
  • Self-Pruning via Capability Scoring: Agents could maintain a minimal execution graph by scoring each decision step for its relevance to the task and automatically dropping low-impact branches, thus avoiding the persistent growth of prompt chains over time.

Empirically Calibrated Self-Optimization:
Validation provides specific thresholds for auto-minimal agent design:
- Redundancy Index > 0.5 triggers automatic prompt compression (T6 validation)
- Token efficiency < 2.6:1 activates degeneracy detection pruning (T1-T3 efficiency metrics)
- Semantic drift > 10% initiates fallback tier selection (T5, T10 drift thresholds)
- 90-token plateau detection prevents unnecessary complexity expansion (universal pattern)

9.3.2 Anticipated Benefits

  • Maintain token-budget discipline automatically.
  • Reduce reliance on human prompt engineers.
  • Allow agents to evolve toward their minimal viable design during deployment.

9.4 Chapter Summary and Thesis Outlook

The proposals in this chapter extend MCD from a static design philosophy into a dynamic and empirically grounded research program (Lessard et al., 2012).

The future trajectory for this work is fourfold:
- Measured: Validating the framework with real-world hardware performance data to ground its principles in empirical evidence (Patton, 2014).
- Flexible: Evolving into hybrid agents that carefully add selective state or tools to broaden their operational range without sacrificing architectural minimalism (Bommasani et al., 2021).
- Self-Governing: Creating agents that can detect and prevent their own over-engineering, making them more robust and adaptable (Russell, 2019).
- Domain-Optimized: Integrating specialized SLMs as base models within MCD frameworks to achieve both architectural and model-level efficiency without compromising constraint-first design principles (Belcak et al., 2025).

These extensions preserve MCD’s lightweight, deployment-aligned core while enabling greater robustness and domain reach—setting the stage for applied deployments in IoT, mobile robotics, embedded assistive devices, and offline-first AI systems (Warden & Situnayake, 2019).

And hybrid optimization techniques such as quantization-aware pruning, adaptive distillation, and entropy-driven PEFT—provided they maintain alignment with MCD’s stateless, low-complexity ethos and complement domain-specific SLM integration strategies.

Next Chapter Preview

With future directions outlined, we now conclude by reflecting on the overall contribution of this thesis. Chapter 10 synthesizes the findings, reaffirms the motivation for MCD, and summarizes the framework’s relevance to lightweight, robust agent design for edge scenarios.