Appendix B

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

Appendix B: Configuration

Documents the configuration environment and experimental setup, including hardware specifications, model pools, memory and token budget parameters, validation frameworks, and reproducibility protocols crucial for the reliability of the study.
This configuration framework ensures reproducible, statistically valid results while maintaining the ecological validity of real-world deployment constraints. All parameters were optimized for browser-based execution environments typical of edge AI deployment scenarios.

B.1 Test Environment Specifications

B.1.1 Hardware Configuration

The MCD framework validation was conducted using the following standardized hardware configuration to ensure reproducibility and constraint-representative testing conditions.

Primary Testing Platform:

Component Specification
Platform Windows 11 (NT 10.0, Win64 x64)
Memory 8GB RAM
CPU Cores 8 cores
GPU Support WebGPU Available
Browser Chrome 140.0.0.0 (also tested on Edge 140.0.0.0)
Runtime Environment WebAssembly (WASM) with local browser execution

Browser Engine Details:

  • User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 Edg/140.0.0.0
  • JavaScript Engine: V8
  • WebGPU: Supported and Available
  • WebAssembly: Full WASM support enabled

B.1.2 Model Configuration and Quantization Tiers

Available Model Pool:

The testing framework included access to 135+ quantized models across different parameter scales and optimization levels, enabling comprehensive validation coverage across diverse architectures.

Primary Test Models by Quantization Tier:

Q1 Tier (Ultra-Minimal)

  • Primary Model: Qwen2-0.5B-Instruct-q4f16_1-MLC
  • Backup Model: SmolLM2-360M-Instruct-q4f16_1-MLC
  • Memory Target: <300MB RAM
  • Validated Performance: 85% retention under Q1 constraints (T10)
  • Use Case: Ultra-constrained environments, proof-of-concept validation, simple FAQ/classification tasks

Q4 Tier (Optimal Balance)

  • Primary Model: TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC
  • Secondary Model: Qwen2.5-0.5B-Instruct-q4f16_1-MLC
  • Memory Target: 500-700MB (stable at 560MB typical)
  • Validated Performance: Optimal for 80% of tasks, 430ms average latency (T8/T10)
  • Use Case: Production deployment, optimal efficiency-quality balance

Q8 Tier (Strategic Fallback)

  • Primary Model: Llama-3.2-1B-Instruct-q4f32_1-MLC
  • Secondary Model: Llama-3.1-8B-Instruct-q4f16_1-MLC-1k
  • Memory Target: 600-1200MB (800MB typical for 1B models)
  • Validated Performance: Used when Q4 drift >10% or performance <80% threshold
  • Use Case: Complex reasoning, multi-step diagnostics, Q4 escalation fallback

Extended Model Pool (Validation Coverage):

  • Llama Family: 3.2-1B, 3.2-3B, 3.1-8B variants
  • Qwen Family: 2.5 series (0.5B-7B), 3.0 series (0.6B-8B)
  • Specialized Models: DeepSeek-R1-Distill, Hermes-3, Phi-3.5, SmolLM2
  • Domain-Specific: WizardMath-7B, Qwen2.5-Coder series

B.2 Execution Parameters

B.2.1 Token Budget Configuration

Tier-Specific Token Limits:

Tier Max Tokens Temperature Top-P Frequency Penalty Presence Penalty
Q1 60-90 0.0 0.85 0.3 0.1
Q4 90-130 0.1 0.8 0.5 0.3
Q8 130-200 0.2-0.3 0.8-0.9 0.1-0.5 0.05-0.3

Rationale: Token ranges reflect capability plateau findings (Section 8.3): 90-130 tokens identified as optimal efficiency zone before diminishing returns.

Prompt Engineering Parameters:

  • System Prompt: Null (stateless by design, Section 4.2)
  • Dynamic Prompting: Enabled for all tiers (adaptation to task complexity)
  • Template Protection: Added to prevent placeholder/formal letter contamination
  • Context Window: Optimized per model (1k-4k tokens depending on architecture)

B.2.2 Memory Management Configuration

Memory Monitoring Protocol:

  • Pre-execution Memory: Baseline measurement before each test iteration
  • Post-execution Memory: Memory usage after completion
  • Memory Delta: Tracked for resource efficiency scoring
  • Stability Threshold: ±50MB considered stable deployment
  • Memory Budget: <512MB target (T8 validation), 1GB absolute maximum

Resource Limits:

  • Latency Budget: <500ms average (T8 threshold), 2000ms maximum per query
  • CPU Usage: Monitored but not limited (informational metric)
  • Browser Stability: Crash detection and recovery enabled
  • Batch Processing: Disabled to ensure test isolation and independent measurements

B.3 Test Suite Configuration

B.3.1 Validation Settings

Statistical Configuration:

  • Repeated Trials Design: n=5 independent measurements per variant
  • Statistical Analysis:
    • Categorical outcomes: Fisher's Exact Test for binary completion rates
    • Continuous metrics: Descriptive statistics (mean, median, range)
  • Confidence Intervals: 95% CI (Wilson score method) calculated for completion rates
  • Sample Acknowledgment: Limited statistical power (n=5 per variant); validation relies on extreme effect sizes and cross-tier replication (Q1/Q4/Q8)
  • Random Seed: Fixed for reproducibility across test iterations

Measurement Tools:

  • Primary: performance.now() API for high-resolution timing measurements
  • Secondary: Browser DevTools integration for resource monitoring
  • Validation: Cross-platform compatibility testing (Chrome, Firefox, Edge)
  • Error Handling: Comprehensive failure classification and logging

B.3.2 Domain-Specific Parameters

W1: Healthcare Appointment Booking Domain (Chapter 7.2)

  • Slot Requirements: Doctor type, Date, Time, Patient Name, Reason for visit
  • Validation Rules: Date format validation, time slot availability, doctor specialization matching
  • Success Criteria: ≥4/5 slots correctly extracted
  • Fallback Depth: Maximum 2 clarification loops (bounded rationality, T5 validation)
  • Adaptation Pattern: Dynamic slot-filling (Section 5.2.1, Table 5.1)

W2: Spatial Navigation Domain (Chapter 7.3)

  • Safety Classification: Critical path validation required for hazard communication
  • Hazard Types: Wet floors, construction zones, restricted areas, accessibility obstacles
  • Route Validation: Point-to-point pathfinding accuracy with coordinate calculations
  • Memory Constraints: Stateless route recalculation required (T4: 5/5 stateless success)
  • Adaptation Pattern: Semi-static deterministic logic (Section 5.2.1, Table 5.1)

W3: System Diagnostics Domain (Chapter 7.4)

  • Error Categories: Server, Database, User Access, Performance, Communication failures
  • Response Structure: Component identification + priority classification (P1/P2/P3) + structured troubleshooting steps
  • Technical Depth: Appropriate for Q1 (basic identification) to Q8 (detailed root cause analysis) tiers
  • Template Protection: Anti-contamination filters for formal language patterns
  • Adaptation Pattern: Dynamic heuristic classification (Section 5.2.1, Table 5.1)

B.4 Validation Framework Configuration

B.4.1 MCD Compliance Scoring

Alignment Metrics (Section 4.2 Principles):

  • Minimality Score: Token efficiency relative to semantic value delivered
  • Boundedness Score: Adherence to reasoning depth limits (≤3 steps, Section 4.2)
  • Degeneracy Score: Component utilization rates (≥10% threshold, T7 validation)
  • Stateless Score: Context reconstruction success without persistent memory (T4: 5/5 vs 2/5)

Classification Thresholds:

Category Score Range Interpretation
MCD-Compliant ≥0.7 Full adherence to MCD principles
MCD-Compatible 0.4-0.69 Partial alignment, acceptable with documentation
Non-MCD <0.4 Violates core principles
Over-Engineered RI >10 Redundancy Index exceeds efficiency threshold (T6)

B.4.2 Performance Classification

Tier Performance Categories:

Category Completion Rate Resource Usage
Excellent ≥90% Optimal efficiency, within all constraints
Good 75-89% Acceptable efficiency, minor deviations
Acceptable 60-74% Within memory bounds, performance adequate
Poor <60% Excessive resource consumption or low success

Edge Deployment Classification:

Category Latency Memory Success Rate
Edge-Superior <400ms <300MB 100%
Edge-Optimized <500ms <500MB ≥90%
Edge-Compatible <750ms <700MB ≥75%
Edge-Risky <1000ms <1GB ≥60%
Deployment-Hostile Exceeds any constraint threshold

B.5 Data Collection and Storage

B.5.1 Experimental Data Format

Primary Data Structure:

{
  "exportType": "Unified Comprehensive Analysis T1-T10",
  "timestamp": "ISO-8601 format (YYYY-MM-DDTHH:mm:ss.sssZ)",
  "testBedInfo": {
    "environment": "browser",
    "platform": "Win32",
    "memory": "8GB",
    "cores": 8,
    "webgpu": "Supported"
  },
  "selectedModels": {
    "Q1": "Qwen2-0.5B-Instruct-q4f16_1-MLC",
    "Q4": "TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
    "Q8": "Llama-3.2-1B-Instruct-q4f32_1-MLC"
  },
  "systemSpecs": "Hardware configuration details",
  "performanceMetrics": "Aggregated results per test variant"
}

Data Integrity Measures:

  • Contamination Detection: Template and placeholder pattern recognition (regex-based filtering)
  • Backend Readiness: Model loading and availability verification before test execution
  • Tier Optimization: Quantization-specific parameter validation
  • Storage Integrity: Complete data capture confirmation with checksum validation

B.5.2 Result Classification Schema

Success Determination Criteria:

  • Technical Success: Task completion within resource constraints (<512MB RAM, <500ms latency)
  • Semantic Success: Meaningful and contextually appropriate responses (human-evaluated)
  • MCD Alignment: Adherence to framework principles (≥0.7 compliance score)
  • Edge Viability: Deployment compatibility in constrained environments

Failure Categories:

  • Technical Failure: Crashes, timeouts, resource exhaustion
  • Semantic Failure: Hallucination, irrelevant responses, safety violations
  • Framework Violation: Non-compliance with MCD principles (e.g., unbounded loops, >3 reasoning steps)
  • Template Contamination: Use of placeholder text or formal letter patterns (e.g., "[Your Name]", "Dear Sir/Madam")

B.6 Reproducibility Parameters

B.6.1 Environment Standardization

Browser Configuration:

  • Cache Management: Cleared before each test session
  • Extension Isolation: Clean browser profiles used (no extensions enabled)
  • Network Conditions: Local execution only, no external API calls
  • Resource Monitoring: Real-time memory and CPU tracking via DevTools

Model Loading Protocol:

  1. Pre-load Phase: All three tiers (Q1/Q4/Q8) loaded before testing begins
  2. Warm-up Period: Initial inference run to stabilize performance baseline
  3. Baseline Measurement: Resource usage recorded before first test iteration
  4. Isolation Protocol: Memory reset between test variants to ensure independence

B.6.2 Statistical Validity Assurance

Randomization Controls:

  • Test Order: Randomized variant presentation to control order effects
  • Model Selection: Systematic tier progression (Q1→Q4→Q8) for escalation validation
  • Cross-Validation: Stratified sampling across approach types (MCD, Few-Shot, CoT, etc.)
  • Temporal Controls: Time-of-day effects minimized through session distribution

Quality Assurance:

  • Inter-Rater Reliability: Automated scoring validation with manual spot-checking (10% sample)
  • Test-Retest Stability: Repeated measures for key findings (n=5 per variant)
  • External Validation: Cross-platform compatibility verification (Chrome, Firefox, Edge)
  • Data Auditing: Complete experimental trace logging for reproducibility

References

Chapter 4, Section 4.6: MCD Subsystem Definitions
Chapter 5: Instantiated Agent Design Patterns
Chapter 6, Tests T1-T10: Empirical validation of layer interactions
Chapter 7, Walkthroughs W1-W3: Applied layer architecture in domain scenarios