Designing Lightweight AI Agents for Edge Deployment
A Minimal Capability Framework with Insights from Literature Synthesis
This chapter surveys the literature across four core dimensions of lightweight agent design: architectural minimality, prompt-based reasoning, memory constraints, and software degeneracy (Singh et al., 2023). For each domain, we analyze current strategies, identify limitations under edge deployment conditions, and motivate corresponding principles in the Minimal Capability Design (MCD) framework. Our focus lies not on post-hoc optimizations, but on design-time constraints that support reliability and interpretability under resource scarcity (Strubell et al., 2019).
This review analyzes over 70 peer-reviewed papers and technical reports sourced from ACL, NeurIPS, ICML, and arXiv (2020-2025) (Rogers et al., 2020). Search terms included “minimal capability AI,” “edge agent deployment,” “lightweight LLM optimization,” and “prompt engineering” (Qin et al., 2023). Papers were selected for inclusion if they (1) demonstrated agent deployment on real or simulated edge hardware, (2) discussed prompt or memory design explicitly, and (3) provided empirical latency or memory data (Chen et al., 2023). Insights were coded into three architectural layers—Prompt, Memory, and Execution—to identify recurring patterns and gaps, which directly inform the MCD framework proposed in this thesis (Braschler et al., 2020). These insights are later validated through browser-based simulation as an effective proxy for edge deployment constraints, providing controlled resource limitations without the variability of physical hardware (Li et al., 2024).
Recent literature shows that lightweight AI design is a mature area, particularly in embedded systems and TinyML research (Banbury et al., 2021; Warden & Situnayake, 2019). Approaches in TinyML heavily leverage post-hoc model optimization techniques such as quantization (Dettmers et al., 2022; Jacob et al., 2018) and knowledge distillation (Hinton et al., 2015; Gou et al., 2021) to reduce resource consumption on microcontroller-class devices. Similarly, work on on-device inference for mobile platforms (Howard et al., 2017; Han et al., 2016; Iandola et al., 2016) focuses on compression and pruning to fit models within tight resource budgets. For edge and offline deployment patterns, systems like EdgeTPU pipelines (Google Coral, 2020) and Jetson Nano deployments (NVIDIA, 2020) illustrate that hardware can execute LLM-adjacent models, but only with aggressive resource management (Xu et al., 2023). These approaches presuppose a neural-centric design, whereas MCD allows for symbolic or hybrid agents whose structure is deliberately constrained even prior to model selection (Mitchell, 2019).
Limitation:
These works focus almost exclusively on model-level efficiency, treating minimality as a post-hoc optimization rather than a foundational design principle (Schwartz et al., 2020). They do not explicitly address when to omit architectural components such as memory layers, toolchains, or orchestration—decisions which have major implications for interpretability and reliability in constrained environments (Ribeiro et al., 2016)
Pivot:
This gap motivates the Minimal Capability Design (MCD) framework’s principle of Minimality by Default (detailed in Ch. 4), where the architecture is constrained from the outset (Bommasani et al., 2021). This principle is operationalized and evaluated in Test T6 component removal analysis (Ch. 6), where removing unused components demonstrably improves clarity without loss of correctness
While this body of work offers valuable optimization strategies, most require access to training data, fine-tuning infrastructure, or persistent session scaffolding (Hu et al., 2021). In contrast, quantization alone enables tiered deployment across constrained hardware without retraining, with subsequent validation demonstrating Q4-tier optimization as optimal for 80% of constraint-bounded reasoning tasks, while maintaining stable performance under progressive resource degradation where alternative optimization techniques show significant failure rates (Nagel et al., 2021). This makes it the only optimization technique directly aligned with runtime-agent-level MCD goals (statelessness, minimalism, fallbacks) (Zafrir et al., 2019).
The present work thus treats quantization (1-bit, 4-bit, 8-bit) as a primary enabler for deployment-layer optimization, while treating other techniques (distillation, PEFT, pruning) as architecturally relevant but operationally excluded from runtime implementation (Frantar et al., 2023).
Table 2.1: Synthesis of Literature on Model-Level Optimization
Challenge | Key Papers | Insight Taken | MCD Extension |
---|---|---|---|
Model compression | Dettmers (2022), Frantar (2023) | Smaller models can run on constrained devices. | Treat compression as a baseline assumption, not an optional optimization. |
Knowledge distillation | Hinton et al. (2015) | Transfer knowledge to a smaller model. | Combine with minimal prompt logic to avoid over-training for unnecessary capabilities. |
TinyML deployment | Banbury et al. (2021) | Inference is possible on MCUs. | Apply minimality at the architecture level: drop orchestration and memory by default. |
On-device inference | Howard et al. (2017) | Pruning improves speed and latency. | Embed minimality into the agent’s interaction logic, not just its model parameters. |
Table 2.2: Optimization Technique Comparison
Optimization Technique | Training Dependency | Runtime Overhead | Edge Suitability | Stateless-Friendly | MCD Inclusion | Validated Performance |
---|---|---|---|---|---|---|
Quantization | ❌ None | ✅ Minimal | ✅ Strong | ✅ Yes | ✅ Yes | ✅ 2.1:1 reliability advantage under constraint conditions |
Distillation | ✅ Yes | ⚠️ Medium | ⚠️ Conditional | ❌ No | ❌ No | ❌ Training-dependent, excluded from validation |
PEFT (LoRA, etc.) | ✅ Yes | ❌ High | ❌ Weak | ❌ No | ❌ No | ❌ High overhead, validation-excluded |
Pruning | ✅ Yes | ✅ Medium | ⚠️ Unstable | ✅* Yes (partial) | ❌ No | ❌ Training-dependent, validation-excluded |
Adaptive Computation | ⚠️ Sometimes | ❌ Complex | ❌ Low | ⚠️ Unreliable | ❌ No | ❌ Complex overhead, validation-excluded |
Note: Techniques marked “excluded” are still referenced architecturally in Chapter 3, but not implemented or tested in this work due to MCD alignment mismatch.
Recent literature demonstrates the power of prompting to elicit complex behaviors (Brown et al., 2020; Liu et al., 2023). Zero-shot prompting enables task generalization without fine-tuning (Kojima et al., 2022), while chain-of-thought (CoT) improves reasoning transparency (Wei et al., 2022; Zhang et al., 2022). Few-shot in-context learning can anchor classification and reasoning tasks, reducing ambiguity (Dong et al., 2022; Min et al., 2022). More advanced techniques like ReAct combine reasoning with acting in minimal loops (Yao et al., 2022; Shinn et al., 2023), and Self-Ask allows agents to clarify questions under constraints (Press et al., 2022).
Limitation:
These works assume an ample context budget and often rely on intermediate reasoning chains that grow in token length, making them unsuitable for token-constrained, stateless agents (Tay et al., 2022). Prompting alone remains vulnerable to semantic drift under reformulation (Min et al., 2022; Perez et al., 2021) and over-tokenization when context windows are limited (Rogers et al., 2020).
These vulnerabilities manifest particularly in stateless environments where conversational approaches exhibit systematic drift into speculative territory, while structured fallback prompts maintain focus and clarity—a distinction critical for edge deployment scenarios (Kadavath et al., 2022).
Empirical validation demonstrates that under Q1 quantization pressure, structured prompts maintain 75% effectiveness while conversational approaches degrade to 40% reliability, confirming the constraint-resilience advantage of minimal prompting strategies (Sahoo et al., 2024).
Pivot:
This motivates MCD’s Minimal Capability Prompting (detailed in Ch. 4), where reasoning remains compact and recoverable under degraded context (Zhou et al., 2022). This approach is validated in T1-T3 prompt comparison and T4 stateless integrity tests (Ch. 6) to measure prompt compactness and stateless integrity. These regeneration heuristics are further tested under prompt degradation scenarios in Chapter 6 (T4, T8) and applied in realistic failure contexts in Chapter 7 (Wang et al., 2024).
Table 2.3: Synthesis of Literature on Prompt-Based Reasoning
Challenge | Key Papers | Insight Taken | MCD Extension |
---|---|---|---|
Zero-shot generalization | Brown et al. (2020) | Tasks can be solved from natural language. | Limit to minimal, often symbolic prompts to conserve tokens. |
Reasoning transparency | Wei et al. (2022) | CoT improves interpretability. | Keep CoT strictly token-bound and use early exits. |
Few-shot anchoring | Dong et al. (2022) | Few-shot examples improve reliability. | Use compressed exemplars or symbolic representations. |
Prompt fragility | Min et al. (2022) | Prompts fail under semantic drift. | Add fallback-safe regeneration heuristics as a design requirement. |
Approaches to context management vary widely (Lewis et al., 2020; Karpukhin et al., 2020). Retrieval-augmented generation (RAG) improves factuality by querying external memory stores (Lewis et al., 2020; Izacard & Grave, 2021), while long-context models allow for thousands of tokens in session memory (Tay et al., 2022; Beltagy et al., 2020). Ephemeral scratchpads can support structured reasoning without requiring long-term storage (Griffith et al., 2022; Nye et al., 2021). However, these methods rely on persistent session state, assume non-degraded connectivity, and face challenges with episodic memory limits in dialogue (Shuster et al., 2022; Dinan et al., 2020). The concept of the Minimal Context Protocol (MCP), a lightweight specification for agent-tool communication, builds on minimalist prompt design principles but formalizes them as deployment constraints to prioritize predictable resource use over the “more context is better” paradigm of RAG (Anthropic, 2024).
Limitation:
Memory-based designs inherently fail in offline, stateless contexts, where session history must be carried entirely within the prompt or discarded (Thoppilan et al., 2022).
Pivot:
This gap motivates MCD’s Stateless Regeneration approach (detailed in Ch. 4), where agents emulate continuity by statelessly reconstructing essential context at each turn (Ouyang et al., 2022). This strategy is validated in T4 stateless regeneration and T8 token constraint tests (Ch. 6), and applied in diagnostic contexts in Walkthrough W3 (Ch. 7).
Table 2.4: Synthesis of Literature on Memory and Context
Challenge | Key Papers | Insight Taken | MCD Extension |
---|---|---|---|
Factual accuracy (RAG) | Lewis et al. (2020) | External memory improves factuality. | Replace with compact, in-prompt context to avoid external dependencies. |
Long-term context | Tay et al. (2022) | More history aids complex reasoning. | Use symbolic compression of history instead of storing full text. |
Structured reasoning | Griffith et al. (2022) | Scratchpads organize thought processes. | Keep scratchpads non-persistent and strictly per-turn. |
Stateful design | Khandelwal et al. (2021) | Statefulness helps long tasks. | Emulate continuity via stateless reconstruction protocols. |
Full-stack agent frameworks such as those discussed by Richards et al. (2023) and Singh et al. (2023) often integrate orchestration, toolchains, and memory by default. Popular libraries like LangChain (Chase, 2022) and agentic loops like BabyAGI (Nakajima, 2023) showcase modularity but can suffer from unused scaffolds and over-provisioned components (Park et al., 2023). This leads to complexity creep (Shinn et al., 2023) and high tool invocation costs (Schick et al., 2023; Toolformer Team, 2023). Such architectures introduce latent components (e.g., unused tool selectors, memory calls that are never populated) which create failure points without improving outcome quality (Mialon et al., 2023). For example, a latent memory.get(“user_intent”) call may return None and crash downstream logic even if the memory module is unused—a failure induced purely by scaffold overreach.
Beyond efficiency concerns, architectural complexity introduces safety risks where over-engineered systems fail by generating confident but incorrect responses, while minimal architectures can be designed for safe degradation patterns that acknowledge limitations rather than fabricate solutions (Amodei et al., 2016).
Validation confirms this safety advantage: structured minimal approaches demonstrate 0% dangerous failure modes under constraint overload, compared to 87% confident hallucination rates in over-engineered systems when resource pressure intensifies beyond design thresholds (Lin et al., 2022).
Limitation:
These architectures add fragility, increase latency, and hide design complexity behind abstractions that do not improve task success rates in constrained use cases (Qin et al., 2023).
Pivot:
This motivates MCD’s Degeneracy Detection principle (detailed in Ch. 4), where unused or redundant architectural pathways are systematically identified and removed during the design phase (Bommasani et al., 2021).
Table 2.5: Synthesis of Literature on Agent Frameworks and Complexity
Challenge | Key Papers | Insight Taken | MCD Extension |
---|---|---|---|
Over-provisioning | Chase (2022) | A rich toolset supports flexibility. | Remove unused tools entirely at design time. |
Abstraction cost | Richards et al. (2023) | Modular design can increase maintainability. | Focus on a minimal routing layer instead of complex abstractions. |
Latency creep | Nakajima (2023) | Orchestration slows down response time. | Enforce a direct prompt-to-execution mapping where possible. |
Hidden complexity | Singh et al. (2023) | Layers can obscure core logic. | Mandate a transparent architecture with auditable components. |
Recent developments in Small Language Models (SLMs) demonstrate parallel efficiency optimization through domain-specialized pre-training rather than post-deployment compression (Belcak et al., 2025; Gunasekar et al., 2024). While MCD primarily leverages quantization for deployment flexibility, emerging SLM architectures (Phi-3, Gemma, SmolLM) achieve similar resource profiles through parameter reduction from inception.
While quantization and SLMs represent parallel paths to efficiency optimization, this thesis focuses exclusively on quantization-based MCD validation to maintain methodological coherence. SLM-MCD architectural compatibility is discussed theoretically in Section 4.9.1, but empirical SLM validation is beyond the current research scope—representing an important direction for future work (Section 9.2.1). This design choice prioritizes framework universality: by demonstrating constraint-resilience through quantization of general-purpose models, MCD principles remain applicable whether practitioners deploy quantized LLMs or native SLMs.
This review reveals a consistent pattern: the literature on lightweight AI is dominated by model-centric, post-hoc optimizations, while the literature on agentic frameworks assumes resource abundance. MCD is formulated to address this gap by treating minimality not as an afterthought, but as a foundational architectural constraint. It focuses on interaction sufficiency, fallback robustness, and symbolic reasoning—not just computational lightness. Unlike runtime-oriented frameworks such as LangChain, MCD does not prescribe implementation libraries. Instead, it defines a design logic that assumes constraints and failure by default, making it compatible with a wide range of runtime choices. Critically, MCD does not compete with traditional frameworks in resource-abundant scenarios—instead, it provides reliable baseline performance precisely when resource constraints cause alternative approaches to degrade unpredictably or fail entirely. The MCD framework is task-agnostic and may be applied to any agent modality, as demonstrated in the Chapter 7 walkthroughs.
The emergence of Small Language Models provides additional validation for constraint-first design principles. Where traditional approaches optimize large models post-deployment, both MCD and SLMs demonstrate that design-time constraints - whether architectural or parametric - yield more efficient, deployable solutions (Belcak et al., 2025). This convergence suggests that future lightweight agents will benefit from combining MCD’s architectural minimalism with SLM’s domain-specific efficiency, creating a dual-layer optimization strategy aligned with edge deployment requirements.
Additionally, while various model-level optimizations such as distillation and parameter-efficient fine-tuning offer theoretical benefits, their integration often demands persistent session state, retraining access, or complex runtime adaptations. For agents operating in cold-start or browser-based settings, these strategies introduce fragility — thereby strengthening the case for quantization as the most practical and robust deployment-aligned optimization in MCD
Domain | Prior Work Focus | Limitation in Edge Context | MCD Response | Validation Evidence |
---|---|---|---|---|
Model Compression | Dettmers (2022), Frantar (2023) | Post-hoc minimality only. | Minimality by Default (Architectural) | T6 component removal maintains function, T10 shows Q4 optimal tier |
Prompt Reasoning | Brown (2020), Wei (2022) | Token-heavy reasoning chains. | Minimal Capability Prompting | T1-T3 demonstrate structured advantage under constraint |
Memory | Lewis (2020), Tay (2022) | Assumes persistent state and connectivity. | Stateless Regeneration | T4 stateless regeneration, T8 token constraint tests |
Agent Stacks | Chase (2022), Nakajima (2023) | Over-provisioned scaffolds and hidden complexity. | Degeneracy Detection | W1-W3 complexity detection walkthroughs |
In sum, this literature review consolidates model-centric minimality, prompt vulnerability, and architectural overreach under resource pressure into a coherent argument: that lightweight agents require not just efficient models, but constraint-first architectural design. The Minimal Capability Design framework presented in the following chapters answers this need.
The literature review highlighted a structural gap: while many solutions optimize models post hoc, few constrain design up front. The MCD framework emerges in response to this—built not by pruning complex agents, but by designing with minimality from the outset.
Chapter 3 now details the methodology by which MCD was formalized: a constructive, design-led approach validated through simulation, walkthroughs, and diagnostic heuristics. This provides the bridge between theoretical motivation and the framework definition introduced in Part II.