Appendix F

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

Appendix F: Statistical Calculations And Effect Size Analysis

This appendix provides detailed calculations supporting effect size claims throughout the thesis, addressing small sample size limitations (n=5 per variant) through emphasis on practical significance rather than inferential statistics.

F.1 Cohen's d for Completion Rate Comparisons

Formula:

d=M1M2σpooled
where
σpooled=ppool×(1ppool)

Example: W3 MCD Structured (80%) vs Few-Shot (40%)

  • Mean difference: 0.40
  • Pooled SD: 0.490
  • Cohen's d = 0.82 (Large effect, d > 0.8)

Additional Comparisons:

Comparison Cohen's d Interpretation
T1: MCD vs Ultra-Minimal (100% vs 0%) 2.00 Extreme effect
W1: Hybrid vs System Role (100% vs 60%) 1.00 Large effect
W2: MCD vs Few-Shot (60% vs 40%) 0.40 Medium effect

Interpretation: Large effects (d > 0.8) dominate key MCD comparisons, providing practical significance despite small sample sizes.

F.2 Eta-Squared (η²) for Token Efficiency Variance

Formula:

η2=SSbetweenSStotal

T1 Token Efficiency Analysis:

  • Approaches: MCD (0.297), Verbose (0.114), Baseline (0.125), CoT (0.159), Few-Shot (0.297)
  • Grand mean: 0.198
  • η² = 0.14-0.16 (Large effect by conventional standards, η² > 0.14)

Interpretation: Token efficiency variance across approaches represents large practical effects, validating architectural differentiation.

F.3 Fisher's Exact Test for Categorical Differences

Extreme Case: MCD (5/5) vs Ultra-Minimal (0/5)

Approach Success Failure
MCD Structured 5 0
Ultra-Minimal 0 5
  • Odds ratio: Infinite (complete separation)
  • p-value = 0.0079 (p < 0.05, statistically significant)

Moderate Case: MCD (4/5) vs Few-Shot (2/5)

Approach Success Failure
MCD Structured 4 1
Few-Shot 2 3
  • Odds ratio: 6.00
  • p-value = 0.524 (not statistically significant, n=5 insufficient)

Interpretation: Extreme binary outcomes (5/5 vs 0/5) achieve statistical significance despite small n. Moderate differences (4/5 vs 2/5) lack power but show large effect sizes.

F.4 Confidence Intervals (Wilson Score Method)

95% Confidence Intervals for Completion Rates (n=5):

Scenario Point Estimate 95% CI
MCD Structured (5/5) 1.00 [0.57, 1.00]
MCD Structured (4/5) 0.80 [0.38, 0.96]
Few-Shot (3/5) 0.60 [0.23, 0.88]
Few-Shot (2/5) 0.40 [0.12, 0.77]
Ultra-Minimal (0/5) 0.00 [0.00, 0.43]

Interpretation: Wide confidence intervals reflect estimation uncertainty with n=5, emphasizing need for effect size analysis and cross-tier replication over point estimates.

F.5 Cross-Tier Reliability Ratio

MCD Cross-Tier Performance:

  • Q1: 0.80, Q4: 0.80, Q8: 0.80
  • Mean: 0.80, SD = 0.00 (perfect consistency)

Few-Shot Cross-Tier Performance:

  • Q1: 0.40, Q4: 0.30, Q8: 0.20
  • Mean: 0.30, SD = 0.10 (high variance)

Reliability Ratio: MCD demonstrates zero variance across tiers while Few-Shot shows 50% degradation (Q1 → Q8), validating constraint-resilience claim.

F.6 Effect Size Summary

Comparison Metric Value Interpretation Sample
MCD vs Ultra-Minimal (T1) Cohen's d ∞ (5/5 vs 0/5) Extreme effect n=5/group
MCD vs Few-Shot (W3) Cohen's d 0.82 Large effect n=5/group
Hybrid vs System Role (W1) Cohen's d 1.00 Large effect n=5/group
Token Efficiency (T1) η² 0.14-0.16 Large practical effect n=5 groups
Cross-Tier Consistency σ ratio MCD: 0.00 vs FS: 0.10 Perfect vs variable n=3 tiers

F.7 Statistical Interpretation Guidelines

Sample Size Limitations: Small sample sizes (n=5 per variant) limit statistical power and generalizability. Traditional parametric assumptions (normality, homogeneity of variance) cannot be reliably assessed.

Effect Size Emphasis: Analysis prioritizes practical significance (effect sizes) over statistical significance (p-values):

  • Cohen's d > 0.8 = large effect (practically meaningful)
  • η² > 0.14 = large effect (substantial variance explained)
  • Wide CIs reflect uncertainty but extreme point estimates (1.00 vs 0.00) provide categorical evidence

Validation Strategy: Strength of claims derives from:

  1. Extreme effect sizes (d = 2.0, η² = 0.14-0.16)
  2. Cross-tier replication (Q1/Q4/Q8 consistent patterns)
  3. Cross-domain validation (W1/W2/W3 convergent evidence)
  4. Categorical outcomes (100% vs 0% completion where applicable)

Appropriate Use Cases:

  • ✅ Fisher's Exact Test for extreme binary outcomes (5/5 vs 0/5)
  • ✅ Effect size calculations for practical significance
  • ✅ Wide CIs to reflect estimation uncertainty
  • ❌ Parametric tests (t-tests, ANOVA) underpowered with n=5
  • ❌ Point estimates without confidence intervals