Thesis Home

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

Appendix B: Configuration

Documents the configuration environment and experimental setup, including hardware specifications, model pools, memory and token budget parameters, validation frameworks, and reproducibility protocols crucial for the reliability of the study.
This configuration framework ensures reproducible, statistically valid results while maintaining the ecological validity of real-world deployment constraints. All parameters were optimized for browser-based execution environments typical of edge AI deployment scenarios.

B.1 Test Environment Specifications

B.1.1 Hardware Configuration

The MCD framework validation was conducted using the following standardized hardware configuration to ensure reproducibility and constraint-representative testing conditions.

Primary Testing Platform:

Component	Specification
Platform	Windows 11 (NT 10.0, Win64 x64)
Memory	8GB RAM
CPU Cores	8 cores
GPU Support	WebGPU Available
Browser	Chrome 140.0.0.0 (also tested on Edge 140.0.0.0)
Runtime Environment	WebAssembly (WASM) with local browser execution

Browser Engine Details:

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 Edg/140.0.0.0
JavaScript Engine: V8
WebGPU: Supported and Available
WebAssembly: Full WASM support enabled

B.1.2 Model Configuration and Quantization Tiers

Available Model Pool:

The testing framework included access to 135+ quantized models across different parameter scales and optimization levels, enabling comprehensive validation coverage across diverse architectures.

Primary Test Models by Quantization Tier:

Q1 Tier (Ultra-Minimal)

Primary Model: Qwen2-0.5B-Instruct-q4f16_1-MLC
Backup Model: SmolLM2-360M-Instruct-q4f16_1-MLC
Memory Target: <300MB RAM
Validated Performance: 85% retention under Q1 constraints (T10)
Use Case: Ultra-constrained environments, proof-of-concept validation, simple FAQ/classification tasks

Q4 Tier (Optimal Balance)

Primary Model: TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC
Secondary Model: Qwen2.5-0.5B-Instruct-q4f16_1-MLC
Memory Target: 500-700MB (stable at 560MB typical)
Validated Performance: Optimal for 80% of tasks, 430ms average latency (T8/T10)
Use Case: Production deployment, optimal efficiency-quality balance

Q8 Tier (Strategic Fallback)

Primary Model: Llama-3.2-1B-Instruct-q4f32_1-MLC
Secondary Model: Llama-3.1-8B-Instruct-q4f16_1-MLC-1k
Memory Target: 600-1200MB (800MB typical for 1B models)
Validated Performance: Used when Q4 drift >10% or performance <80% threshold
Use Case: Complex reasoning, multi-step diagnostics, Q4 escalation fallback

Extended Model Pool (Validation Coverage):

Llama Family: 3.2-1B, 3.2-3B, 3.1-8B variants
Qwen Family: 2.5 series (0.5B-7B), 3.0 series (0.6B-8B)
Specialized Models: DeepSeek-R1-Distill, Hermes-3, Phi-3.5, SmolLM2
Domain-Specific: WizardMath-7B, Qwen2.5-Coder series

B.2 Execution Parameters

B.2.1 Token Budget Configuration

Tier-Specific Token Limits:

Tier	Max Tokens	Temperature	Top-P	Frequency Penalty	Presence Penalty
Q1	60-90	0.0	0.85	0.3	0.1
Q4	90-130	0.1	0.8	0.5	0.3
Q8	130-200	0.2-0.3	0.8-0.9	0.1-0.5	0.05-0.3

Rationale: Token ranges reflect capability plateau findings (Section 8.3): 90-130 tokens identified as optimal efficiency zone before diminishing returns.

Prompt Engineering Parameters:

System Prompt: Null (stateless by design, Section 4.2)
Dynamic Prompting: Enabled for all tiers (adaptation to task complexity)
Template Protection: Added to prevent placeholder/formal letter contamination
Context Window: Optimized per model (1k-4k tokens depending on architecture)

B.2.2 Memory Management Configuration

Memory Monitoring Protocol:

Pre-execution Memory: Baseline measurement before each test iteration
Post-execution Memory: Memory usage after completion
Memory Delta: Tracked for resource efficiency scoring
Stability Threshold: ±50MB considered stable deployment
Memory Budget: <512MB target (T8 validation), 1GB absolute maximum

Resource Limits:

Latency Budget: <500ms average (T8 threshold), 2000ms maximum per query
CPU Usage: Monitored but not limited (informational metric)
Browser Stability: Crash detection and recovery enabled
Batch Processing: Disabled to ensure test isolation and independent measurements

B.3 Test Suite Configuration

B.3.1 Validation Settings

Statistical Configuration:

Repeated Trials Design: n=5 independent measurements per variant
Statistical Analysis:
- Categorical outcomes: Fisher's Exact Test for binary completion rates
- Continuous metrics: Descriptive statistics (mean, median, range)
Confidence Intervals: 95% CI (Wilson score method) calculated for completion rates
Sample Acknowledgment: Limited statistical power (n=5 per variant); validation relies on extreme effect sizes and cross-tier replication (Q1/Q4/Q8)
Random Seed: Fixed for reproducibility across test iterations

Measurement Tools:

Primary: performance.now() API for high-resolution timing measurements
Secondary: Browser DevTools integration for resource monitoring
Validation: Cross-platform compatibility testing (Chrome, Firefox, Edge)
Error Handling: Comprehensive failure classification and logging

B.3.2 Domain-Specific Parameters

W1: Healthcare Appointment Booking Domain (Chapter 7.2)

Slot Requirements: Doctor type, Date, Time, Patient Name, Reason for visit
Validation Rules: Date format validation, time slot availability, doctor specialization matching
Success Criteria: ≥4/5 slots correctly extracted
Fallback Depth: Maximum 2 clarification loops (bounded rationality, T5 validation)
Adaptation Pattern: Dynamic slot-filling (Section 5.2.1, Table 5.1)

W2: Spatial Navigation Domain (Chapter 7.3)

Safety Classification: Critical path validation required for hazard communication
Hazard Types: Wet floors, construction zones, restricted areas, accessibility obstacles
Route Validation: Point-to-point pathfinding accuracy with coordinate calculations
Memory Constraints: Stateless route recalculation required (T4: 5/5 stateless success)
Adaptation Pattern: Semi-static deterministic logic (Section 5.2.1, Table 5.1)

W3: System Diagnostics Domain (Chapter 7.4)

Error Categories: Server, Database, User Access, Performance, Communication failures
Response Structure: Component identification + priority classification (P1/P2/P3) + structured troubleshooting steps
Technical Depth: Appropriate for Q1 (basic identification) to Q8 (detailed root cause analysis) tiers
Template Protection: Anti-contamination filters for formal language patterns
Adaptation Pattern: Dynamic heuristic classification (Section 5.2.1, Table 5.1)

B.4 Validation Framework Configuration

B.4.1 MCD Compliance Scoring

Alignment Metrics (Section 4.2 Principles):

Minimality Score: Token efficiency relative to semantic value delivered
Boundedness Score: Adherence to reasoning depth limits (≤3 steps, Section 4.2)
Degeneracy Score: Component utilization rates (≥10% threshold, T7 validation)
Stateless Score: Context reconstruction success without persistent memory (T4: 5/5 vs 2/5)

Classification Thresholds:

Category	Score Range	Interpretation
MCD-Compliant	≥0.7	Full adherence to MCD principles
MCD-Compatible	0.4-0.69	Partial alignment, acceptable with documentation
Non-MCD	<0.4	Violates core principles
Over-Engineered	RI >10	Redundancy Index exceeds efficiency threshold (T6)

B.4.2 Performance Classification

Tier Performance Categories:

Category	Completion Rate	Resource Usage
Excellent	≥90%	Optimal efficiency, within all constraints
Good	75-89%	Acceptable efficiency, minor deviations
Acceptable	60-74%	Within memory bounds, performance adequate
Poor	<60%	Excessive resource consumption or low success

Edge Deployment Classification:

Category	Latency	Memory	Success Rate
Edge-Superior	<400ms	<300MB	100%
Edge-Optimized	<500ms	<500MB	≥90%
Edge-Compatible	<750ms	<700MB	≥75%
Edge-Risky	<1000ms	<1GB	≥60%
Deployment-Hostile	Exceeds any constraint threshold

B.5 Data Collection and Storage

B.5.1 Experimental Data Format

Primary Data Structure:

{
  "exportType": "Unified Comprehensive Analysis T1-T10",
  "timestamp": "ISO-8601 format (YYYY-MM-DDTHH:mm:ss.sssZ)",
  "testBedInfo": {
    "environment": "browser",
    "platform": "Win32",
    "memory": "8GB",
    "cores": 8,
    "webgpu": "Supported"
  },
  "selectedModels": {
    "Q1": "Qwen2-0.5B-Instruct-q4f16_1-MLC",
    "Q4": "TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC",
    "Q8": "Llama-3.2-1B-Instruct-q4f32_1-MLC"
  },
  "systemSpecs": "Hardware configuration details",
  "performanceMetrics": "Aggregated results per test variant"
}

Data Integrity Measures:

Contamination Detection: Template and placeholder pattern recognition (regex-based filtering)
Backend Readiness: Model loading and availability verification before test execution
Tier Optimization: Quantization-specific parameter validation
Storage Integrity: Complete data capture confirmation with checksum validation

B.5.2 Result Classification Schema

Success Determination Criteria:

Technical Success: Task completion within resource constraints (<512MB RAM, <500ms latency)
Semantic Success: Meaningful and contextually appropriate responses (human-evaluated)
MCD Alignment: Adherence to framework principles (≥0.7 compliance score)
Edge Viability: Deployment compatibility in constrained environments

Failure Categories:

Technical Failure: Crashes, timeouts, resource exhaustion
Semantic Failure: Hallucination, irrelevant responses, safety violations
Framework Violation: Non-compliance with MCD principles (e.g., unbounded loops, >3 reasoning steps)
Template Contamination: Use of placeholder text or formal letter patterns (e.g., "[Your Name]", "Dear Sir/Madam")

B.6 Reproducibility Parameters

B.6.1 Environment Standardization

Browser Configuration:

Cache Management: Cleared before each test session
Extension Isolation: Clean browser profiles used (no extensions enabled)
Network Conditions: Local execution only, no external API calls
Resource Monitoring: Real-time memory and CPU tracking via DevTools

Model Loading Protocol:

Pre-load Phase: All three tiers (Q1/Q4/Q8) loaded before testing begins
Warm-up Period: Initial inference run to stabilize performance baseline
Baseline Measurement: Resource usage recorded before first test iteration
Isolation Protocol: Memory reset between test variants to ensure independence

B.6.2 Statistical Validity Assurance

Randomization Controls:

Test Order: Randomized variant presentation to control order effects
Model Selection: Systematic tier progression (Q1→Q4→Q8) for escalation validation
Cross-Validation: Stratified sampling across approach types (MCD, Few-Shot, CoT, etc.)
Temporal Controls: Time-of-day effects minimized through session distribution

Quality Assurance:

Inter-Rater Reliability: Automated scoring validation with manual spot-checking (10% sample)
Test-Retest Stability: Repeated measures for key findings (n=5 per variant)
External Validation: Cross-platform compatibility verification (Chrome, Firefox, Edge)
Data Auditing: Complete experimental trace logging for reproducibility

References

Chapter 4, Section 4.6: MCD Subsystem Definitions
Chapter 5: Instantiated Agent Design Patterns
Chapter 6, Tests T1-T10: Empirical validation of layer interactions
Chapter 7, Walkthroughs W1-W3: Applied layer architecture in domain scenarios

Appendix B