Appendix E

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

Appendix E: MCD Heuristics and Diagnostic Table

Contents Overview

This appendix serves as a consolidated reference for MCD diagnostic heuristics introduced in Chapter 4, including methods for Capability Plateau detection, Redundancy Index calculation, Semantic Drift monitoring, and Prompt Collapse diagnostics. All thresholds are empirically validated through T1-T10 simulations and W1-W3 domain walkthroughs.

Purpose Statement

To provide practitioners with a ready-to-apply toolkit for validating minimal agent designs, detecting over-engineering before deployment, and ensuring constraint-compliant architecture through quantified diagnostic metrics.

E.1 Comprehensive MCD Diagnostic Reference

Table E.1: Complete MCD Heuristics and Diagnostic Tools

Diagnostic Tool Purpose Calibrated Threshold Measurement Method Failure Indicator Chapter Reference Validation Tests
Capability Plateau Detector Detects diminishing returns in prompt/tool additions 90-130 token saturation range Token efficiency analysis: semantic value per token Additional complexity yields <5% improvement while consuming 2.6x resources Section 6.3.6, Section 8.3 T1, T3, T6
Memory Fragility Score Measures agent dependence on state persistence 40% dependence threshold Stateless reconstruction accuracy testing >40% dependence indicates high fragility risk; T4 validates 5/5 stateless success Section 4.2, Section 6.3.4 T4, T5
Toolchain Redundancy Estimator Identifies unused or rarely-used modules <10% utilization triggers removal Component usage tracking during execution Components below 10% utilization add latency overhead with minimal task contribution Section 4.2, Section 6.3.7 T7, T9
Semantic Drift Monitor Tracks reasoning quality degradation across quantization tiers >10% semantic drift threshold Cosine similarity comparison Q1 vs Q4 outputs Drift >10% triggers automatic tier escalation (Q1→Q4→Q8) Section 6.3.10 T2, T10
Prompt Collapse Diagnostic Identifies critical prompt compression limits 60-token minimum threshold Task success rate under progressive token reduction MCD maintains 94% success at 60 tokens; failure below indicates insufficient minimality Section 6.3.6 T1, T2, T3, T6
Context Reconstruction Validator Tests stateless context recovery capability ≥90% accuracy requirement Multi-turn interaction without persistent memory <90% accuracy indicates architectural dependency on session state Section 4.2, Section 6.3.4 T4
Fallback Loop Complexity Meter Prevents runaway recovery sequences ≤2 loops maximum threshold Recovery sequence depth and token consumption >2 loops leads to semantic drift (T5: 2/4 drift beyond 3 steps) Section 5.4, Section 6.3.5 T3, T5, T9
Quantization Tier Optimizer Selects minimum viable capability tier Q4 optimal balance point Performance vs resource consumption analysis Q1: 85% retention, Q4: 95% success, Q8: equivalent with overhead Section 6.3.10 T10

E.2 Detailed Heuristic Implementation Guidelines

E.2.1 Capability Plateau Detector

Implementation Protocol:

def detect_capability_plateau(prompt_tokens, semantic_score, resource_cost):
    """
    Detects when additional prompt complexity yields diminishing returns.

    Threshold Calibration: 90-130 token saturation range (T6 validation)
    - Conservative lower bound: 90 tokens (design-time warning)
    - Empirical upper bound: 130 tokens (hard saturation)
    """
    if prompt_tokens > 90:
        efficiency_ratio = semantic_score / resource_cost
        if efficiency_ratio < 0.05:  # <5% improvement threshold
            return "PLATEAU_DETECTED", "Consider removing complexity beyond 90-token boundary"
    return "WITHIN_BOUNDS", "Prompt complexity acceptable"

Practical Application:

  • Monitor during design: Track token additions vs task completion improvements
  • Deployment threshold: Stop adding complexity beyond 90-token boundary (conservative) or 130-token ceiling (validated saturation)
  • Resource calculation: Measure latency/memory cost per token added
  • Validation evidence: T1-T3, T6 demonstrate plateau effects across multiple domains

Threshold Calibration Methodology:

The 90-130 token capability plateau range was empirically derived through systematic ablation testing (T1, T6) rather than prescriptive universal constraint:

Empirical Evidence:

  • T1 variants: Optimal performance-to-resource ratio at 60-85 tokens
  • T6 variants: Capability saturation observed at 94-131 tokens across comparisons
  • Cross-test convergence: 90-130 token range validated through independent trials
  • Section 8.3 analysis: Confirmed plateau effect with 2.6x resource cost for <5% improvement

Threshold Selection Rationale: 90 tokens represents the conservative lower bound where marginal improvements typically fall below 5% while computational costs increase 2.6×. This provides a design-time warning signal rather than strict enforcement boundary. 130 tokens represents empirical saturation ceiling validated across T1-T6.

Task-Dependent Calibration:

  • Simple slot-filling: 60-80 tokens optimal (W1 Healthcare booking)
  • Spatial navigation: 70-90 tokens sufficient (W2 Indoor navigation with deterministic logic)
  • Complex diagnostics: 90-130 tokens required (W3 System diagnostics with heuristic classification)

Practitioner Guidance: Treat 90 tokens as optimization starting point rather than absolute constraint, adjusting based on domain-specific complexity validated through T1-style testing.


E.2.2 Memory Fragility Score

Calculation Method:

def calculate_memory_fragility(stateless_accuracy, stateful_accuracy):
    """
    Measures dependence on persistent state vs stateless reconstruction.

    Validation: T4 shows 5/5 stateless success with explicit slot reinjection
    Threshold: >40% dependence indicates high fragility risk
    """
    if stateful_accuracy == 0:
        return "ERROR", "Insufficient stateful baseline"

    dependence_ratio = (stateful_accuracy - stateless_accuracy) / stateful_accuracy

    if dependence_ratio > 0.40:  # 40% dependence threshold
        return "HIGH_FRAGILITY_RISK", f"State dependence: {dependence_ratio:.2%}"
    elif dependence_ratio > 0.20:
        return "MODERATE_FRAGILITY", "Consider stateless optimization"
    else:
        return "STATELESS_READY", "Architecture validated for stateless deployment"

Practical Application:

  • Test protocol: Run identical tasks with explicit context reinjection (stateless) vs implicit session memory (stateful)
  • Risk assessment: >40% dependence indicates deployment vulnerability under resource constraints
  • Validation method: T4 confirms 5/5 stateless reconstruction success for MCD with explicit slot passing
  • Design implication: High fragility scores require architecture revision per Section 4.2 principles

E.2.3 Toolchain Redundancy Estimator

Usage Tracking Implementation:

def analyze_toolchain_redundancy(component_usage_log, total_executions):
    """
    Identifies underutilized components for removal.

    Threshold: <10% utilization triggers removal (T7 validation)
    Benefit: 15-30ms latency savings per removed component
    """
    redundant_components = []

    for component in component_usage_log:
        utilization_rate = component.usage_count / total_executions

        if utilization_rate < 0.10:  # <10% utilization threshold
            latency_overhead = component.avg_latency
            redundant_components.append({
                "name": component.name,
                "utilization": f"{utilization_rate:.1%}",
                "latency_savings": f"{latency_overhead}ms",
                "recommendation": "REMOVE"
            })

    if len(redundant_components) == 0:
        return "TOOLCHAIN_OPTIMIZED", redundant_components
    else:
        return "REDUNDANCY_DETECTED", redundant_components

Practical Application:

  • Monitoring period: Track component usage over representative task cycles (minimum n=100 interactions)
  • Removal threshold: Components with <10% utilization should be removed (T7 validation)
  • Performance impact: T7/T9 show 15-30ms latency savings from redundancy removal
  • Implementation: Systematic audit during development and pre-deployment validation

E.2.4 Semantic Drift Monitor

Real-time Detection:

def monitor_semantic_drift(q1_output, q4_output, similarity_threshold=0.90):
    """
    Monitors quality degradation across quantization tiers.

    Threshold: >10% drift (similarity <90%) triggers escalation
    Validation: T10 dynamic tier routing
    """
    from sklearn.metrics.pairwise import cosine_similarity
    from sentence_transformers import SentenceTransformer

    # Calculate semantic similarity
    model = SentenceTransformer('all-MiniLM-L6-v2')
    q1_embedding = model.encode([q1_output])
    q4_embedding = model.encode([q4_output])

    semantic_similarity = cosine_similarity(q1_embedding, q4_embedding)[0][0]
    drift_percentage = (1 - semantic_similarity) * 100

    if semantic_similarity < similarity_threshold:  # >10% drift
        return "ESCALATE_TO_Q4", f"Drift detected: {drift_percentage:.1f}%"
    else:
        return "MAINTAIN_Q1", f"Stable performance: {drift_percentage:.1f}% drift"

Practical Application:

  • Continuous monitoring: Compare outputs across quantization tiers in production
  • Automatic escalation: >10% drift triggers Q1→Q4→Q8 progression (T10 validation)
  • Performance validation: T10 demonstrates effective tier selection with drift-based routing
  • Edge deployment: Critical for maintaining quality under resource constraints without manual intervention

E.3 Diagnostic Application Workflow

Table E.2: MCD Validation Workflow Sequence

Phase Diagnostic Tools Applied Success Criteria Failure Actions
Design Phase Capability Plateau Detector, Prompt Collapse Diagnostic <90 tokens (conservative) or <130 tokens (ceiling), ≥94% task success at 60-token minimum Redesign prompt structure, apply symbolic compression (Section 5.2.1)
Implementation Phase Memory Fragility Score, Context Reconstruction Validator <40% state dependence, ≥90% stateless accuracy (T4: 5/5 success) Implement explicit context regeneration protocols (Section 4.2)
Pre-deployment Phase Toolchain Redundancy Estimator, Fallback Loop Complexity <10% unused components, ≤2 fallback loops maximum Remove redundant modules, simplify recovery sequences
Runtime Phase Semantic Drift Monitor, Quantization Tier Optimizer <10% drift, Q4 optimal balance (T10: 80% of tasks) Dynamic tier escalation Q1→Q4→Q8, performance rebalancing

E.4 Empirical Calibration Evidence

Table E.3: Validation Evidence for Diagnostic Thresholds

Heuristic Calibration Source Sample Size Statistical Validation Practical Validation
90-130 token plateau T1 prompting analysis, T6 over-engineering detection, Section 8.3 n=5 per variant across 10 test configurations (T1-T6) Categorical consistency across tests; 95% CI: [0.44, 0.98] for 80% completion Consistent across healthcare (W1), spatial (W2), diagnostic (W3) domains
40% fragility threshold T4 stateless integrity testing n=5 per variant across Q1/Q4/Q8 tiers Cross-tier validation; T4: 5/5 stateless vs 2/5 implicit success Healthcare appointment scenarios (W1), slot-filling validation
10% redundancy cutoff T7 bounded adaptation, T9 fallback complexity Component tracking across representative task cycles Degeneracy detection validated through repeated measurements Navigation (W2), diagnostics (W3), 15-30ms latency improvements
10% semantic drift T10 quantization tier matching n=5 per tier (Q1/Q4/Q8) comparison Dynamic tier selection validated through categorical differences Real-time capability matching, 85% Q1 retention, 95% Q4 success
60-token minimum T1, T2, T3, T6 progressive compression n=5 per variant across multiple token budgets 94% success rate maintained at 60-token floor Universal across all three walkthroughs (W1/W2/W3)
≤2 loop maximum T3 fallback validation, T5 semantic drift analysis Multiple recovery sequence tests T5: 2/4 semantic drift beyond 3 steps validates ≤2 threshold Bounded clarification prevents runaway loops (W1/W3)

E.5 Practitioner Implementation Checklist

Pre-deployment Diagnostic Checklist:

  • Capability Plateau: Prompt complexity stays within 90-130 token efficiency range
  • Memory Independence: Agent achieves ≥90% accuracy without persistent state (T4 validation)
  • Component Utilization: All tools/modules show ≥10% usage or are removed (T7 degeneracy detection)
  • Semantic Stability: <10% drift between quantization tiers under normal operation (T10 monitoring)
  • Prompt Resilience: Maintains ≥94% success rate down to 60-token compression (T1/T6 floor)
  • Fallback Bounds: Recovery sequences terminate within ≤2 loops maximum (T5 drift prevention)
  • Context Regeneration: Stateless reconstruction maintains ≥90% accuracy (T4: 5/5 explicit slot passing)
  • Tier Optimization: Q4 selected as default with automatic Q1→Q4→Q8 escalation protocols (T10 validation)

E.6 Integration with Simulation and Walkthrough Testing

Cross-Reference Map:

  • Chapter 4: Theoretical foundations for MCD principles and diagnostic development
  • Chapter 5: Implementation integration across architectural layers (Sections 5.1-5.7)
  • Chapter 6: Empirical validation of all diagnostic thresholds (Tests T1-T10)
  • Chapter 7: Real-world application validation in healthcare (W1), navigation (W2), diagnostics (W3)
  • Chapter 8: Comparative analysis against full-stack frameworks using these heuristics

Table E.4: Validation Cross-Reference Matrix

Test/Walkthrough Primary Heuristics Validated Secondary Heuristics Domain Application
T1-T3 Capability Plateau Detector, Prompt Collapse Diagnostic Semantic Drift Monitor Token efficiency analysis, progressive compression
T4-T5 Memory Fragility Score, Context Reconstruction Validator Fallback Loop Complexity Stateless operation validation, semantic drift detection
T6-T9 Toolchain Redundancy Estimator, Capability Plateau Detector Fallback Loop Complexity Component optimization, over-engineering detection
T10 Quantization Tier Optimizer, Semantic Drift Monitor All heuristics integrated Dynamic capability matching, tier-based routing
W1 Healthcare Memory Fragility Score, Context Reconstruction Semantic Drift Monitor, Capability Plateau Appointment booking, dynamic slot-filling (Section 5.2.1)
W2 Navigation Semantic Drift Monitor, Quantization Tier Optimizer Toolchain Redundancy Robotic pathfinding, semi-static deterministic logic
W3 Diagnostics Capability Plateau Detector, Toolchain Redundancy All heuristics Edge monitoring systems, heuristic classification

End of Appendix E

This diagnostic framework ensures reproducible, statistically valid results while maintaining ecological validity of real-world deployment constraints. All thresholds were optimized for browser-based WebAssembly execution environments typical of edge AI deployment scenarios validated in T8.