Designing Lightweight AI Agents for Edge Deployment
A Minimal Capability Framework with Insights from Literature Synthesis
Documents the configuration environment and experimental setup, including hardware specifications, model pools, memory and token budget parameters, validation frameworks, and reproducibility protocols crucial for the reliability of the study. This configuration framework ensures reproducible, statistically valid results while maintaining the ecological validity of real-world deployment constraints. All parameters were optimized for browser-based execution environments typical of edge AI deployment scenarios.
B.1.1 Hardware Configuration
The MCD framework validation was conducted using the following standardized hardware configuration to ensure reproducibility and constraint-representative testing conditions.
Primary Testing Platform:
Component | Specification |
---|---|
Platform | Windows 11 (NT 10.0, Win64 x64) |
Memory | 8GB RAM |
CPU Cores | 8 cores |
GPU Support | WebGPU Available |
Browser | Chrome 140.0.0.0 (also tested on Edge 140.0.0.0) |
Runtime Environment | WebAssembly (WASM) with local browser execution |
Browser Engine Details:
- User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 Edg/140.0.0.0
- JavaScript Engine: V8
- WebGPU: Supported and Available
- WebAssembly: Full WASM support enabled
B.1.2 Model Configuration and Quantization Tiers
Available Model Pool:
The testing framework included access to 135+ quantized models across different parameter scales and optimization levels, enabling comprehensive validation coverage across diverse architectures.
Primary Test Models by Quantization Tier:
Q1 Tier (Ultra-Minimal)
- Primary Model: Qwen2-0.5B-Instruct-q4f16_1-MLC
- Backup Model: SmolLM2-360M-Instruct-q4f16_1-MLC
- Memory Target: <300MB RAM
- Validated Performance: 85% retention under Q1 constraints (T10)
- Use Case: Ultra-constrained environments, proof-of-concept validation, simple FAQ/classification tasks
Q4 Tier (Optimal Balance)
- Primary Model: TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC
- Secondary Model: Qwen2.5-0.5B-Instruct-q4f16_1-MLC
- Memory Target: 500-700MB (stable at 560MB typical)
- Validated Performance: Optimal for 80% of tasks, 430ms average latency (T8/T10)
- Use Case: Production deployment, optimal efficiency-quality balance
Q8 Tier (Strategic Fallback)
- Primary Model: Llama-3.2-1B-Instruct-q4f32_1-MLC
- Secondary Model: Llama-3.1-8B-Instruct-q4f16_1-MLC-1k
- Memory Target: 600-1200MB (800MB typical for 1B models)
- Validated Performance: Used when Q4 drift >10% or performance <80% threshold
- Use Case: Complex reasoning, multi-step diagnostics, Q4 escalation fallback
Extended Model Pool (Validation Coverage):
- Llama Family: 3.2-1B, 3.2-3B, 3.1-8B variants
- Qwen Family: 2.5 series (0.5B-7B), 3.0 series (0.6B-8B)
- Specialized Models: DeepSeek-R1-Distill, Hermes-3, Phi-3.5, SmolLM2
- Domain-Specific: WizardMath-7B, Qwen2.5-Coder series
B.2.1 Token Budget Configuration
Tier-Specific Token Limits:
Tier | Max Tokens | Temperature | Top-P | Frequency Penalty | Presence Penalty |
---|---|---|---|---|---|
Q1 | 60-90 | 0.0 | 0.85 | 0.3 | 0.1 |
Q4 | 90-130 | 0.1 | 0.8 | 0.5 | 0.3 |
Q8 | 130-200 | 0.2-0.3 | 0.8-0.9 | 0.1-0.5 | 0.05-0.3 |
Rationale: Token ranges reflect capability plateau findings (Section 8.3): 90-130 tokens identified as optimal efficiency zone before diminishing returns.
Prompt Engineering Parameters:
- System Prompt: Null (stateless by design, Section 4.2)
- Dynamic Prompting: Enabled for all tiers (adaptation to task complexity)
- Template Protection: Added to prevent placeholder/formal letter contamination
- Context Window: Optimized per model (1k-4k tokens depending on architecture)
B.2.2 Memory Management Configuration
Memory Monitoring Protocol:
- Pre-execution Memory: Baseline measurement before each test iteration
- Post-execution Memory: Memory usage after completion
- Memory Delta: Tracked for resource efficiency scoring
- Stability Threshold: ±50MB considered stable deployment
- Memory Budget: <512MB target (T8 validation), 1GB absolute maximum
Resource Limits:
- Latency Budget: <500ms average (T8 threshold), 2000ms maximum per query
- CPU Usage: Monitored but not limited (informational metric)
- Browser Stability: Crash detection and recovery enabled
- Batch Processing: Disabled to ensure test isolation and independent measurements
B.3.1 Validation Settings
Statistical Configuration:
- Repeated Trials Design: n=5 independent measurements per variant
- Statistical Analysis:
- Categorical outcomes: Fisher's Exact Test for binary completion rates
- Continuous metrics: Descriptive statistics (mean, median, range)
- Confidence Intervals: 95% CI (Wilson score method) calculated for completion rates
- Sample Acknowledgment: Limited statistical power (n=5 per variant); validation relies on extreme effect sizes and cross-tier replication (Q1/Q4/Q8)
- Random Seed: Fixed for reproducibility across test iterations
Measurement Tools:
- Primary:
performance.now()
API for high-resolution timing measurements - Secondary: Browser DevTools integration for resource monitoring
- Validation: Cross-platform compatibility testing (Chrome, Firefox, Edge)
- Error Handling: Comprehensive failure classification and logging
B.3.2 Domain-Specific Parameters
W1: Healthcare Appointment Booking Domain (Chapter 7.2)
- Slot Requirements: Doctor type, Date, Time, Patient Name, Reason for visit
- Validation Rules: Date format validation, time slot availability, doctor specialization matching
- Success Criteria: ≥4/5 slots correctly extracted
- Fallback Depth: Maximum 2 clarification loops (bounded rationality, T5 validation)
- Adaptation Pattern: Dynamic slot-filling (Section 5.2.1, Table 5.1)
W2: Spatial Navigation Domain (Chapter 7.3)
- Safety Classification: Critical path validation required for hazard communication
- Hazard Types: Wet floors, construction zones, restricted areas, accessibility obstacles
- Route Validation: Point-to-point pathfinding accuracy with coordinate calculations
- Memory Constraints: Stateless route recalculation required (T4: 5/5 stateless success)
- Adaptation Pattern: Semi-static deterministic logic (Section 5.2.1, Table 5.1)
W3: System Diagnostics Domain (Chapter 7.4)
- Error Categories: Server, Database, User Access, Performance, Communication failures
- Response Structure: Component identification + priority classification (P1/P2/P3) + structured troubleshooting steps
- Technical Depth: Appropriate for Q1 (basic identification) to Q8 (detailed root cause analysis) tiers
- Template Protection: Anti-contamination filters for formal language patterns
- Adaptation Pattern: Dynamic heuristic classification (Section 5.2.1, Table 5.1)
B.4.1 MCD Compliance Scoring
Alignment Metrics (Section 4.2 Principles):
- Minimality Score: Token efficiency relative to semantic value delivered
- Boundedness Score: Adherence to reasoning depth limits (≤3 steps, Section 4.2)
- Degeneracy Score: Component utilization rates (≥10% threshold, T7 validation)
- Stateless Score: Context reconstruction success without persistent memory (T4: 5/5 vs 2/5)
Classification Thresholds:
Category | Score Range | Interpretation |
---|---|---|
MCD-Compliant | ≥0.7 | Full adherence to MCD principles |
MCD-Compatible | 0.4-0.69 | Partial alignment, acceptable with documentation |
Non-MCD | <0.4 | Violates core principles |
Over-Engineered | RI >10 | Redundancy Index exceeds efficiency threshold (T6) |
B.4.2 Performance Classification
Tier Performance Categories:
Category | Completion Rate | Resource Usage |
---|---|---|
Excellent | ≥90% | Optimal efficiency, within all constraints |
Good | 75-89% | Acceptable efficiency, minor deviations |
Acceptable | 60-74% | Within memory bounds, performance adequate |
Poor | <60% | Excessive resource consumption or low success |
Edge Deployment Classification:
Category | Latency | Memory | Success Rate |
---|---|---|---|
Edge-Superior | <400ms | <300MB | 100% |
Edge-Optimized | <500ms | <500MB | ≥90% |
Edge-Compatible | <750ms | <700MB | ≥75% |
Edge-Risky | <1000ms | <1GB | ≥60% |
Deployment-Hostile | Exceeds any constraint threshold |
B.5.1 Experimental Data Format
Primary Data Structure:
{
"exportType": "Unified Comprehensive Analysis T1-T10",
"timestamp": "ISO-8601 format (YYYY-MM-DDTHH:mm:ss.sssZ)",
"testBedInfo": {
"environment": "browser",
"platform": "Win32",
"memory": "8GB",
"cores": 8,
"webgpu": "Supported"
},
"selectedModels": {
"Q1": "Qwen2-0.5B-Instruct-q4f16_1-MLC",
"Q4": "TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
"Q8": "Llama-3.2-1B-Instruct-q4f32_1-MLC"
},
"systemSpecs": "Hardware configuration details",
"performanceMetrics": "Aggregated results per test variant"
}
Data Integrity Measures:
- Contamination Detection: Template and placeholder pattern recognition (regex-based filtering)
- Backend Readiness: Model loading and availability verification before test execution
- Tier Optimization: Quantization-specific parameter validation
- Storage Integrity: Complete data capture confirmation with checksum validation
B.5.2 Result Classification Schema
Success Determination Criteria:
- Technical Success: Task completion within resource constraints (<512MB RAM, <500ms latency)
- Semantic Success: Meaningful and contextually appropriate responses (human-evaluated)
- MCD Alignment: Adherence to framework principles (≥0.7 compliance score)
- Edge Viability: Deployment compatibility in constrained environments
Failure Categories:
- Technical Failure: Crashes, timeouts, resource exhaustion
- Semantic Failure: Hallucination, irrelevant responses, safety violations
- Framework Violation: Non-compliance with MCD principles (e.g., unbounded loops, >3 reasoning steps)
- Template Contamination: Use of placeholder text or formal letter patterns (e.g., "[Your Name]", "Dear Sir/Madam")
B.6.1 Environment Standardization
Browser Configuration:
- Cache Management: Cleared before each test session
- Extension Isolation: Clean browser profiles used (no extensions enabled)
- Network Conditions: Local execution only, no external API calls
- Resource Monitoring: Real-time memory and CPU tracking via DevTools
Model Loading Protocol:
- Pre-load Phase: All three tiers (Q1/Q4/Q8) loaded before testing begins
- Warm-up Period: Initial inference run to stabilize performance baseline
- Baseline Measurement: Resource usage recorded before first test iteration
- Isolation Protocol: Memory reset between test variants to ensure independence
B.6.2 Statistical Validity Assurance
Randomization Controls:
- Test Order: Randomized variant presentation to control order effects
- Model Selection: Systematic tier progression (Q1→Q4→Q8) for escalation validation
- Cross-Validation: Stratified sampling across approach types (MCD, Few-Shot, CoT, etc.)
- Temporal Controls: Time-of-day effects minimized through session distribution
Quality Assurance:
- Inter-Rater Reliability: Automated scoring validation with manual spot-checking (10% sample)
- Test-Retest Stability: Repeated measures for key findings (n=5 per variant)
- External Validation: Cross-platform compatibility verification (Chrome, Firefox, Edge)
- Data Auditing: Complete experimental trace logging for reproducibility