Appendix C

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

C.0 Introduction and Methodological Framework

Appendix C: Cross-Validation Performance Matrices and Statistical Analysis

This appendix provides comprehensive performance matrices, statistical validation, and trial-by-trial evidence supporting the MCD framework evaluation presented in Chapter 6 (Tests T1-T10). All data presented follow the validation methodology established in Section 3.3 (Simulation Validation Strategy) and Section 3.4 (Walkthrough Design Method).


C.0.1 Repeated Trials Methodology

Experimental Design:

  • Sample size: n=5 independent measurements per variant approach
  • •Total validation measurements: Approximately 1,050 measurements across 10 tests (T1-T10: 7 variants × 5 trials × 3 tiers per test), plus 75 measurements across 3 walkthroughs (W1-W3: 5 variants × 5 trials per walkthrough
  • Quantization tiers tested: Q1-tier (Qwen2-0.5B), Q4-tier (TinyLlama-1.1B), Q8-tier (Llama-3.2-1B)
  • Execution environment: Browser-based WebAssembly (WebLLM) offline execution
  • Measurement precision: performance.now() API for microsecond-level timing accuracy

Statistical Approach:

  • Binary outcomes (completion rates): Fisher's Exact Test for categorical completion rates where extreme separability exists (e.g., 100% vs 0%)
  • Continuous metrics (tokens, latency): Welch's t-test for comparing means between variants; descriptive statistics (mean ± standard deviation) reported for all metrics
  • Confidence intervals: 95% CI calculated using Wilson score method for binomial proportions
  • Effect size measurement: Cohen's d for continuous variables where applicable; Cohen's h for binary outcome comparisons

Sample Size Acknowledgment: While n=5 per variant represents a small sample size that limits traditional parametric inference, the methodology provides robust qualitative evidence through:

  1. Extreme effect sizes: Binary outcomes with complete categorical separation (100% vs 0% completion) provide clear differentiation
  2. Cross-tier replication: Patterns replicated across three independent quantization tiers (Q1/Q4/Q8) strengthen reliability beyond single-tier testing
  3. Zero-variance consistency: Perfect within-variant consistency (e.g., 5/5 or 0/5 trials) demonstrates categorical distinctions
  4. Convergent evidence: Consistent patterns across multiple independent tests (T1-T10)

Statistical power is limited by small per-variant samples. Analysis emphasizes effect size magnitude, categorical differences, and cross-tier consistency patterns rather than traditional inferential statistics alone.


C.0.2 How to Read Appendix C Tables

Performance Metrics Definitions:

Completion Rate: Proportion of trials successfully completing the assigned task

  • Format: X.XX (n/N) where n = successful trials, N = total trials
  • Example: 1.00 (5/5) = 100% completion; 0.60 (3/5) = 60% completion
  • Interpretation: Higher values indicate better task reliability

95% Confidence Interval (CI): Statistical confidence bounds for completion rate estimates

  • Calculated using Wilson score method for binomial proportions
  • Format: [lower bound, upper bound]
  • Example: [0.48, 0.99] for 4/5 completion rate
  • Interpretation: True completion rate likely falls within this range with 95% confidence

Token Efficiency: Resource optimization metric calculated as semantic_fidelity / (tokens × latency_ms)

  • Higher values indicate better resource utilization per unit of semantic quality
  • Useful for comparing resource consumption across approaches
  • Not calculable for failed variants (0% completion)

Semantic Fidelity: Quality score on 0-4 scale based on content accuracy and completeness

Resource Stability: Percentage of trials staying within predefined token budget without overflow

  • 100% = All trials met budget constraints
  • <100% = Some trials exceeded budget (resource instability)

Average Tokens: Mean number of tokens consumed across all trials for the variant

  • Lower values indicate greater efficiency (for equivalent task success)
  • Standard deviation (±) shows consistency across trials

Average Latency: Mean response time from prompt submission to completion (milliseconds)

  • Lower values indicate faster execution
  • Standard deviation (±) shows temporal consistency

Categorical Difference: Indicates validated statistical distinction between variants

  • ✓ Validated: Fisher's Exact Test confirms categorical separation OR extreme effect size with cross-tier replication
  • Not specified: Insufficient evidence for categorical claim

Cross-Tier Consistency (σ): Standard deviation of completion rates across Q1/Q4/Q8 quantization tiers

  • σ = 0.00 indicates perfect consistency (same performance across all tiers)
  • Higher σ values indicate tier-dependent variability
  • Perfect consistency (0.00) strengthens confidence in constraint-resilience

C.0.3 Statistical Interpretation Guidelines

Understanding Small Sample Sizes: With n=5 trials per variant, traditional parametric assumptions (normality, independence, homogeneity of variance) cannot be reliably verified. However, the methodology provides robust evidence through:

  1. Categorical Outcomes: Binary completion rates with extreme separability (100% vs 0%) provide unambiguous categorical distinctions. Fisher's Exact Test validates these separations even with small samples.

  2. Effect Size Emphasis: Rather than relying solely on p-values, analysis emphasizes practical significance through effect size magnitude. Large effect sizes (e.g., MCD: 63 tokens vs Verbose: 147 tokens = 133% difference) demonstrate meaningful practical differences.

  3. Replication Evidence: Cross-tier consistency (Q1/Q4/Q8) provides three independent replications of each comparison. Perfect consistency (σ=0.00) across tiers strengthens conclusions beyond single-tier testing.

  4. Pattern Convergence: Consistent patterns across 10 independent tests (T1-T10) and 3 domain walkthroughs (W1-W3) demonstrate framework-level validation rather than isolated test-specific results.

Confidence Interval Interpretation: 95% confidence intervals for completion rates are calculated using the Wilson score method, which provides accurate bounds even for small samples and extreme proportions (0% or 100%). Wide confidence intervals reflect estimation uncertainty but do not invalidate categorical distinctions when non-overlapping.

Example:

  • Variant A: 1.00 (5/5), 95% CI [1.00, 1.00]
  • Variant B: 0.00 (0/5), 95% CI [0.00, 0.00]
  • Interpretation: Clear categorical separation; no overlap indicates distinct performance classes

Cross-Tier Validation Strength: Cross-tier consistency provides stronger evidence than single-tier testing:

  • Perfect consistency (σ=0.00): Same performance across Q1/Q4/Q8 confirms constraint-resilience is independent of model capacity
  • Variable consistency (σ>0.00): Performance depends on quantization tier, suggesting tier-specific optimization requirements
  • Example: Ultra-Minimal showing 0% completion across all tiers (σ=0.00) confirms fundamental architectural insufficiency rather than model-specific limitation

In Reference to Chapter 6 T1 - T10 Tests

C.1 Test T1 – Constraint-Resilient vs. Ultra-Minimal Prompt Comparison

Note: Cross-validation methodology and interpretation guidelines are detailed in Appendix C.0 Introduction. This section presents test-specific results only.


Table C.1.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured MCD Ultra-Minimal Verbose Baseline CoT Few-Shot System Role
Completion Rate Q1 1.00 (5/5) 0.00 (0/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
95% CI Q1 [1.00, 1.00] [0.00, 0.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q1 63 147 172 138 63 63
Avg Latency (ms) Q1 1,273 4,208 4,227 3,205 1,273 1,273
Completion Rate Q4 1.00 (5/5) 0.00 (0/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
95% CI Q4 [1.00, 1.00] [0.00, 0.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q4 71 185 203 163 71 71
Avg Latency (ms) Q4 2,845 9,412 10,287 7,156 2,845 2,845
Completion Rate Q8 1.00 (5/5) 0.00 (0/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
95% CI Q8 [1.00, 1.00] [0.00, 0.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q8 160 250 277 160 160 160
Avg Latency (ms) Q8 4,231 6,673 6,835 4,231 4,231 4,231

Note: n=5 trials per variant per tier. Ultra-Minimal showed complete failure (0%) across all tiers. Semantic fidelity: 4.0/4.0 for all successful variants.


Table C.1.2: Cross-Tier Consistency and MCD Alignment

Variant Q1 Success Q4 Success Q8 Success Cross-Tier Consistency (σ) MCD-Aligned
Structured MCD 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ✅ Yes
Ultra-Minimal 0% (0/5) 0% (0/5) 0% (0/5) Perfect failure (0.00) ❌ No
Verbose 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ⚠️ Partial
Baseline (Polite) 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ❌ No
Chain-of-Thought 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ❌ No
Few-Shot 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ✅ Compatible
System Role 100% (5/5) 100% (5/5) 100% (5/5) Perfect (0.00) ✅ Compatible

Table C.1.3: Efficiency Classification and Deployment Viability

Variant Token Range Efficiency Class Resource Profile Deployment Viability
Structured MCD 63-160 Optimal Predictable, stable ✅ High
Ultra-Minimal Failed Context failure ❌ Unsuitable
Verbose 147-250 Over-engineered Variable across tiers ⚠️ Moderate
Baseline (Polite) 172-277 Over-engineered High overhead ⚠️ Low
Chain-of-Thought 138-160 Process bloat Medium overhead ⚠️ Moderate
Few-Shot 63-71 MCD-compatible Predictable, efficient ✅ High
System Role 63-71 MCD-compatible Predictable, efficient ✅ High

Statistical Notes for T1

Categorical Outcome Analysis: Ultra-Minimal variant demonstrated 100% consistent failure across all three quantization tiers (0/5 trials each), confirming that extreme minimalism sacrifices reliability regardless of model capacity. MCD-aligned approaches (Structured MCD, Few-Shot, System Role) achieved identical performance (63-71 tokens, 100% completion) across all tiers, validating constraint-resilience through cross-tier consistency.

Efficiency Plateau Evidence: Token counts beyond 90-130 tokens (Verbose: 147-250, Baseline: 172-277) provided no measurable quality improvements—all successful variants achieved 4.0/4.0 semantic fidelity, confirming resource optimization plateau. MCD token efficiency (0.297 at Q1-tier) vs Verbose (0.114) represents 161% improvement.

Statistical Approach: With n=5 per variant, categorical differences validated through Fisher's Exact Test for binary outcomes with extreme separability (100% vs 0%). Continuous metrics analyzed using descriptive statistics with 95% CI (Wilson score method). Cross-tier replication across Q1/Q4/Q8 provides stronger evidence than single-tier testing.

C.2 Test T2 – Constraint-Resilient Symbolic Input Processing

Note: Methodology and interpretation guidelines detailed in Appendix C.0 Introduction. Information density metric: semantic_fidelity / token_count (higher = better semantic preservation per token).


Table C.2.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured Symbolic Ultra-Minimal Verbose Extended Natural
Task Completion Q1 0.80 ± 0.18 (4/5) 0.00 ± 0.00 (0/5) 1.00 ± 0.00 (5/5) 0.20 ± 0.18 (1/5)
95% CI Q1 [0.62, 0.98] [0.00, 0.00] [1.00, 1.00] [0.02, 0.38]
Information Density Q1 3.2 ± 0.4 0.8 ± 0.2 2.4 ± 0.3 1.2 ± 0.6
Avg Tokens Q1 24 12 42 65
Avg Latency (ms) Q1 1,106 910 1,739
Resource Stability Q1 100% 0% 100% 20% (overflow)
Task Completion Q4 0.80 ± 0.18 (4/5) 0.00 ± 0.00 (0/5) 1.00 ± 0.00 (5/5) 0.20 ± 0.18 (1/5)
95% CI Q4 [0.62, 0.98] [0.00, 0.00] [1.00, 1.00] [0.02, 0.38]
Information Density Q4 3.5 ± 0.3 0.0 ± 0.0 2.6 ± 0.2 1.3 ± 0.5
Avg Tokens Q4 28 48 72
Avg Latency (ms) Q4 2,586 4,566 4,651
Resource Stability Q4 100% 0% 100% 20% (overflow)
Task Completion Q8 0.80 ± 0.18 (4/5) 0.00 ± 0.00 (0/5) 1.00 ± 0.00 (5/5) 0.20 ± 0.18 (1/5)
95% CI Q8 [0.62, 0.98] [0.00, 0.00] [1.00, 1.00] [0.02, 0.38]
Information Density Q8 3.8 ± 0.3 0.0 ± 0.0 2.8 ± 0.2 1.4 ± 0.5
Avg Tokens Q8 32 55 85
Avg Latency (ms) Q8 6,957 6,674 6,835
Resource Stability Q8 100% 0% 100% 20% (overflow)

Note: n=5 trials per variant per tier. Semantic fidelity: 4.0 for successful variants, 0.0 for failures. Processing consistency variance: Structured (2.6-3.2%), Extended Natural (13.9-15.4%).


Table C.2.2: Cross-Tier Consistency and Medical Reasoning Viability

Variant Cross-Tier Completion Info Density Range Clinical Usability Edge Deployment Score
Structured Symbolic 80% (12/15 across tiers) 3.2–3.8 ✅ High (actionable format) 9.5/10
Ultra-Minimal 0% (0/15 across tiers) 0.0–0.8 ❌ Unsuitable (context failure) 0/10
Verbose 100% (15/15 across tiers) 2.4–2.8 ⚠️ Moderate (resource-heavy) 6/10
Extended Natural 20% (3/15 across tiers) 1.2–1.4 ❌ Poor (80% overflow) 2/10

Edge Deployment Score: Composite of completion rate, resource stability, and constraint resilience.


Table C.2.3: Context Sufficiency Analysis

Variant Min Viable Tokens Token Efficiency Semantic Loss Risk Key Limitation
Structured Symbolic 24 tokens (medium) Optimal Low Trial variance (1/5 failure)
Ultra-Minimal 12 tokens (insufficient) Theoretical only Critical 100% context failure
Verbose 42-55 tokens (high) Suboptimal None 75% token overhead
Extended Natural 65-85 tokens (excessive) Poor Overflow-induced 80% budget overflow

Statistical Notes for T2

Information Density Validation: Structured symbolic approaches achieved 3.2–3.8 information density across all tiers, representing 33-171% efficiency advantage over verbose (2.4–2.8) and extended natural (1.2–1.4) variants. This pattern replicated consistently across Q1/Q4/Q8, providing cross-tier validation with total n=15 per variant.

Context Insufficiency Boundary: Ultra-minimal variant showed 100% failure (0/15 trials across all tiers), establishing empirical lower bound for viable symbolic formatting. The 24-token structured approach represents minimal sufficient context for 80% reliability (12/15 trials) in medical reasoning.

Resource Overflow Pattern: Extended natural exhibited systematic overflow (12/15 trials: 80% across tiers), with token budgets consumed before actionable conclusions. Processing consistency variance: structured approaches 2.6-3.2% vs extended natural 13.9-15.4% (4-5× more stable).

Medical Domain Application: In clinical decision support, structured symbolic maintained 80% diagnostic accuracy (12/15) while ensuring actionable format. Extended natural achieved only 20% actionable output (3/15) despite consuming 170-270% more tokens, demonstrating practical efficiency-effectiveness trade-offs.

Effect Size Interpretation: Information density improvements (3.2-3.8 vs 1.2-1.4) represent 166-317% gains. The 100% token overhead (24 vs 12 tokens) represents minimum investment for 80% reliability improvement in medical diagnostic scenarios, confirmed through cross-tier replication.

C.3 Test T3 – Constraint-Resilient Prompt Recovery

Note: Methodology detailed in Appendix C.0. Test context: Degraded input recovery ("IDK symptoms. Plz help??!!"). Both approaches achieved 100% recovery success across all tiers.


Table C.3.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured Fallback (MCD) Conversational Fallback
Recovery Success Q1 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q1 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q1 66 71
Token Efficiency Q1 1.515 1.408
Avg Latency (ms) Q1 1,300 1,072
Information Gathering Q1 Explicit fields Open-ended
Recovery Success Q4 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q4 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q4 202 208
Token Efficiency Q4 0.495 0.481
Avg Latency (ms) Q4 4,691 4,412
Recovery Success Q8 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q8 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q8 136 208
Token Efficiency Q8 0.735 0.481
Avg Latency (ms) Q8 3,405 4,412

Note: n=5 trials per approach per tier. Token efficiency = recovery_success / avg_tokens. Both approaches achieved 100% resource stability (zero overflow).


Table C.3.2: Cross-Tier Consistency and Resource Trade-offs

Characteristic Structured (MCD) Conversational Trade-off Analysis
Cross-Tier Success 100% (15/15 trials) 100% (15/15 trials) Equivalent functional outcome
Token Range 66–202 71–208 7-35% structured advantage
Latency Range 1,300–4,691 ms 1,072–4,412 ms 18% conversational advantage (Q1)
Information Structure Explicit fields (location, duration, severity) Open-ended invitation Systematic vs empathetic
User Experience Directive, clinical Supportive, empathetic Context-dependent preference
Edge Viability ✅ High (optimal tokens) ⚠️ Moderate (UX priority) Resource vs engagement trade-off
Stateless Operation Excellent (zero memory dependency) Excellent (zero memory dependency) Both MCD-compatible

Table C.3.3: Fallback Strategy Deployment Recommendations

Deployment Context Recommended Approach Justification Expected Outcome
Resource-constrained edge Structured (MCD) 7-35% token efficiency gain Optimal computational utilization
User experience priority Conversational 18% faster processing, empathetic tone Enhanced engagement quality
Medical/clinical systems Structured (MCD) Systematic field collection Actionable diagnostic data
General assistance Either approach Equivalent 100% recovery success Context-dependent selection
Stateless deployment Either approach Both achieve zero memory dependency Framework flexibility

Statistical Notes for T3

Equivalent Recovery Success: Both approaches achieved 100% recovery across all three quantization tiers (15/15 trials each), validating that fallback effectiveness depends on prompt design rather than specific architectural philosophy. Zero-variance consistency (σ=0 for token counts at Q1-tier) demonstrates exceptional execution stability.

Token Efficiency Trade-off: Structured fallback achieved 7-35% token reduction across tiers (Q1: 66 vs 71 tokens, Q4: 202 vs 208 tokens, Q8: 136 vs 208 tokens), confirming explicit field-based clarification provides resource advantages while maintaining equivalent functional outcomes. Q8-tier represents large practical effect size (35% reduction).

Latency Counterintuitive Finding: Conversational fallback processed faster (1,072ms vs 1,300ms on Q1-tier: 18% reduction), contrary to theoretical assumptions about structured prompt efficiency. This demonstrates the importance of empirical testing over theoretical predictions.

Stateless Validation: T3 uniquely confirms that recovery in stateless systems depends entirely on prompt design without conversational memory. Both approaches successfully elicited clarification without dialogue history access, validating robust fallback mechanisms in memory-constrained deployments.

Deployment Context Guidance: The choice between structured and conversational fallback depends on optimization priorities: resource-constrained environments benefit from structured fallback's token efficiency (7-35% reduction), while user experience prioritization may favor conversational fallback's empathetic engagement and faster processing. Both achieve equivalent functional outcomes (100% recovery) in stateless operation.

C.4 Test T4 – Constraint-Resilient Stateless Context Management

Note: Methodology detailed in Appendix C.0. Test context: Multi-turn appointment scheduling without memory. Turn 1: "I'd like to schedule a physiotherapy appointment for knee pain." Turn 2A (Implicit): "Make it next Monday morning." Turn 2B (Structured): "Schedule a physiotherapy appointment for knee pain on Monday morning."


Table C.4.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured Reinjection (MCD) Implicit Reference
Task Success Q1 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q1 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q1 120 112
Token Overhead Q1 +7.1% Baseline
Avg Latency (ms) Q1 3,798 3,512
Context Completeness Q1 Explicit (model-independent) Inference-dependent
Task Success Q4 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q4 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q4 193 190
Token Overhead Q4 +1.6% Baseline
Avg Latency (ms) Q4 5,059 4,341
Task Success Q8 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q8 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q8 236 227
Token Overhead Q8 +3.9% Baseline
Avg Latency (ms) Q8 11,166 10,462

Note: n=5 trials per approach per tier. Both achieved 100% resource stability. Token variance σ=0 (perfect consistency) across all trials.


Table C.4.2: Cross-Tier Reliability Analysis and Trade-offs

Characteristic Structured Reinjection (MCD) Implicit Reference Key Distinction
Cross-Tier Success 100% (15/15 trials) 100% (15/15 trials) Equivalent functional outcome
Token Overhead Range +1.6% to +7.1% Baseline Reliability insurance premium
Context Approach Explicit slot-carryover (appointment type, condition, timing) Implicit pronoun reference ("it", "next Monday") Systematic vs inference-based
Reliability Model Model-independent (each turn self-contained) Model-dependent (requires inference capability) Deployment guarantee difference
Turn Interpretability Each turn fully interpretable standalone Turn 2 requires Turn 1 context Self-containment vs reference
Edge Deployment Viability ✅ High (guaranteed preservation) ⚠️ Variable (depends on model capability) Predictability vs resource efficiency
Stateless Operation ✓ Confirmed (explicit carryover) ✓ Confirmed (inference-based) Both truly stateless

Table C.4.3: Deployment Context Recommendations

Deployment Scenario Recommended Approach Rationale Token Cost Trade-off
Variable model capacity Structured (MCD) Model-independent reliability +1.6-7.1% overhead acceptable
Resource-abundant context Implicit Reference Lower token cost (baseline) Leverage inference capabilities
Safety-critical systems Structured (MCD) Guaranteed context preservation Eliminate inference uncertainty
Multi-tier deployment Structured (MCD) Consistent behavior across Q1/Q4/Q8 Predictable overhead (1.6-7.1%)
Known robust models Either approach Both achieve 100% success Context-dependent selection

Statistical Notes for T4

Equivalent Task Success: Both approaches achieved 100% success across all tiers (15/15 trials each), validating that stateless multi-turn context management succeeds through either explicit reinjection or model inference when capabilities permit. Zero token variance (σ=0) at all tiers indicates highly deterministic, predictable behavior.

Reliability Insurance Premium: Structured reinjection required modest token overhead: +7.1% (Q1), +1.6% (Q4), +3.9% (Q8). This quantifies the cost of deployment-independent reliability—eliminating inference uncertainty and ensuring each turn is self-contained. The variable overhead (1.6-7.1%) suggests context preservation costs scale differently across model capacities.

Deployment Reliability Classification: Structured reinjection achieves model-independent reliability by making each turn fully interpretable without prior turn reference. Implicit reference creates model-dependent reliability, where success relies on the model's pronoun resolution and temporal reference inference capabilities.

Stateless Operation Validation: Both mechanisms are truly stateless but differ fundamentally: (1) Explicit slot-carryover (structured) guarantees preservation through systematic reinjection; (2) Implicit reference requires model inference to resolve "it" and "next Monday morning" connections to Turn 1 content. T4 confirms stateless systems can manage multi-turn interactions through both pathways, with reliability trade-offs quantified at 1.6-7.1% token overhead for guaranteed preservation.

Architectural Design Choice: Stateless context management presents a fundamental trade-off: Explicit reinjection (+1.6% to +7.1% tokens) provides model-independent reliability and guaranteed preservation, while implicit reference (baseline tokens) offers lower resource cost but model-dependent reliability. Selection depends on deployment constraints, model variance expectations, and whether predictability outweighs resource optimization.

C.5 Test T5 – Constraint-Resilient Semantic Precision

Note: Methodology detailed in Appendix C.0. Test context: Spatial navigation comparing systematic anchoring (metric + cardinal) vs contextual inference (relational positioning). Both achieved 100% task success.


Table C.5.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured Specification (MCD) Naturalistic Spatial
Task Success Q1 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q1 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q1 80 53
Token Efficiency Q1 0.625 0.943
Avg Latency (ms) Q1 1,952 1,111
Spatial Specification Q1 Metric (2m) + Cardinal (north) Relational (shadow, past it)
Task Success Q4 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q4 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q4 90 191
Token Efficiency Q4 0.556 0.262
Avg Latency (ms) Q4 1,466 4,691
Task Success Q8 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q8 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q8 136 93
Token Efficiency Q8 0.368 0.538
Avg Latency (ms) Q8 3,182 2,298

Note: n=5 trials per approach per tier. Both approaches achieved 100% resource stability. Token variance within tiers: σ=0 (perfect consistency).


Table C.5.2: Cross-Tier Resource Variability and Execution Predictability

Metric Structured (MCD) Naturalistic Key Distinction
Cross-Tier Success 100% (15/15 trials) 100% (15/15 trials) Equivalent functional outcome
Token Pattern Q1: 80 → Q4: 90 → Q8: 136 Q1: 53 → Q4: 191 → Q8: 93 Predictable vs unpredictable scaling
Q1 Token Overhead +51% (80 vs 53) Baseline Structured pays efficiency cost
Q4 Token Overhead Baseline +112% (191 vs 90) Reversed pattern
Q8 Token Overhead +46% (136 vs 93) Baseline Pattern returns to Q1 direction
Execution Pattern Systematic anchoring Contextual inference Model-independent vs model-dependent
Deployment Reliability Predictable (metric + cardinal) Variable (relational metaphors) Safety-critical suitability difference

Table C.5.3: Deployment Context Recommendations

Application Domain Recommended Approach Critical Requirement Justification
Safety-critical robotics Structured (mandatory) Unambiguous spatial coordinates Eliminates interpretation ambiguity
Autonomous navigation Structured (mandatory) Deterministic action sequences Metric + cardinal eliminates drift
Medical procedures Structured (mandatory) Precise spatial positioning Safety requires quantifiable measurements
Resource-predictable edge Structured (recommended) Consistent resource patterns Tier-independent execution stability
General-purpose contexts Either approach acceptable Spatial precision tolerance allows 100% success for both when capable models
Cross-model portability Structured (recommended) Model-independent execution No reliance on inference capabilities

Statistical Notes for T5

Equivalent Task Success: Both approaches achieved 100% task success across all three quantization tiers (15/15 trials each), validating that spatial reasoning can succeed through either systematic anchoring or contextual inference when models possess adequate capabilities.

Tier-Dependent Token Variability: Token overhead showed unpredictable cross-tier patterns demonstrating deployment reliability differences:

  • Q1-tier: Structured +51% overhead (80 vs 53 tokens)
  • Q4-tier: Naturalistic +112% overhead (191 vs 90 tokens) — reversed pattern
  • Q8-tier: Structured +46% overhead (136 vs 93 tokens)

This non-monotonic scaling for naturalistic approaches (53→191→93) demonstrates unpredictable resource requirements across model capacities, while structured approaches show predictable scaling (80→90→136), validating MCD's constraint-resilience principle.

Execution Predictability: Structured specification achieved deployment-independent predictability through systematic spatial anchoring (metric distance, cardinal direction, explicit sequencing), eliminating reliance on model-specific spatial inference capabilities. Naturalistic approaches created model-dependent execution where success relies on contextual inference to resolve relational metaphors ("shadow") and implied sequencing ("continue past").

Safety-Critical Implications: For applications requiring precise spatial behavior (robotics, medical, autonomous systems), structured specification provides unambiguous spatial coordinates through quantifiable measurements. The Q4-tier reversal (naturalistic consuming 112% more tokens despite Q1/Q8 efficiency) confirms that relational spatial reasoning creates unpredictable resource patterns unsuitable for deployment-critical contexts.

Key Trade-off: The tier-specific variability validates that execution predictability (structured: consistent cross-tier patterns) outweighs token minimization (naturalistic: variable efficiency) when deployment reliability is prioritized over resource optimization in individual tiers.

C.6 Test T6 – Constraint-Resilient Resource Optimization Analysis

Note: Methodology detailed in Appendix C.0. Task: "Summarize causes of Type 2 diabetes." All variants achieved 100% task completion across all tiers (15/15 trials each). Primary differentiator: computational efficiency. Resource waste = (tokens_used - hybrid_baseline) / tokens_used × 100%.


Table C.6.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured MCD Verbose CoT Few-Shot Hybrid
Task Completion Q1 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q1 131 173 171 114 94
Resource Efficiency Q1 0.76 ± 0.04 0.58 ± 0.08 0.58 ± 0.08 0.88 ± 0.05 1.06 ± 0.03
Resource Waste Q1 39% 84% 82% 21% 0% (baseline)
Avg Latency (ms) Q1 4,285 4,213 4,216 1,901 1,965
Task Completion Q4 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q4 196 241 239 117 104
Resource Efficiency Q4 0.51 ± 0.03 0.41 ± 0.05 0.42 ± 0.06 0.85 ± 0.04 0.96 ± 0.02
Resource Waste Q4 88% 132% 130% 13% 0% (baseline)
Avg Latency (ms) Q4 4,837 4,502 5,634 860 1,514
Task Completion Q8 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q8 245 289 287 129 107
Resource Efficiency Q8 0.41 ± 0.03 0.35 ± 0.05 0.35 ± 0.06 0.77 ± 0.04 0.93 ± 0.02
Resource Waste Q8 127% 169% 167% 20% 0% (baseline)
Avg Latency (ms) Q8 6,850 7,245 7,198 2,980 2,545

Note: n=5 trials per variant per tier. All variants achieved 3.5/4.0 semantic fidelity. Resource efficiency = task_completion / token_count. Effect sizes: Hybrid vs CoT/Verbose (Cohen's d > 2.0 - very large).


Table C.6.2: Cross-Tier Efficiency Classification and Waste Scaling Patterns

Variant Efficiency Category Q1 Waste Q4 Waste Q8 Waste Waste Trend Cross-Tier Consistency
Hybrid Superior Optimization 0% 0% 0% Flat (0%) 100% stable
Few-Shot MCD-Compatible 21% 13% 20% Flat (18% avg) 100% stable
Structured MCD Moderate Bloat 39% 88% 127% Increasing (3.3×) 100% stable
Chain-of-Thought Process Bloat 82% 130% 167% Increasing (2.0×) 100% stable
Verbose Over-Engineered 84% 132% 169% Increasing (2.0×) 100% stable

Key Pattern: MCD-compatible approaches (Hybrid, Few-Shot) maintain ≤21% waste regardless of tier. Non-MCD approaches (CoT, Verbose, Structured MCD) show 2.0-3.3× waste increase Q1→Q8, demonstrating computational debt compounding with model capacity. Perfect ranking consistency across all tiers (100%) validates categorical efficiency differences.


Table C.6.3: Resource Optimization Plateau Evidence

Finding Evidence Implication
Universal Task Success 100% completion across all 5 variants × 3 tiers = 25/25 trials Success ≠ efficiency under constraints
Capability Plateau All variants achieved 3.5/4.0 semantic fidelity regardless of token count (94-289 tokens) Additional tokens beyond 90-130 provide no quality benefit
Structural vs Process Distinction Few-Shot (structural): 18% avg waste; CoT (process): 126% avg waste; Effect size d=2.4 Structural guidance scales efficiently; process guidance creates overhead
Hybrid Superiority Consistent optimal performance: Q1 (1.06), Q4 (0.96), Q8 (0.93); 28-39% efficiency gain Combining constraints + examples achieves optimal resource utilization
Waste Compounding CoT/Verbose waste increases 2.0× from Q1→Q8 while Few-Shot remains stable Process approaches scale poorly with model capacity

Statistical Notes for T6

Universal Task Success with Variable Efficiency: All five strategies achieved 100% completion (25/25 trials total), demonstrating that success does not equal efficiency. The key differentiator was computational resource utilization (0-169% waste range), validating focus on efficiency metrics as primary outcome.

Resource Optimization Plateau: Consistent plateau around 90-130 tokens across approaches validated independently in all three tiers. Beyond this threshold, additional tokens provided no semantic quality improvements (all variants: 3.5 fidelity), confirming resource optimization ceiling existence.

Structural vs Process Guidance Distinction: Few-shot examples (structural guidance) achieved 18% average waste (21%→13%→20% across tiers) while Chain-of-Thought (process guidance) demonstrated 126% average waste (82%→130%→167%), representing very large effect size (Cohen's d = 2.4). This validates fundamental distinction between constraint-compatible structural templates and resource-intensive process reasoning.

Cross-Tier Validation Strength: Perfect consistency of efficiency rankings across three independent quantization tiers (Q1/Q4/Q8) provides robust evidence for categorical efficiency differences. No variant changed its efficiency category across tiers, demonstrating 100% classification stability and strengthening findings beyond per-tier sample limitations (n=5 per tier, n=15 total per variant).

Design Implication: Resource-constrained deployments should prioritize structural guidance (few-shot examples, hybrid approaches) over process guidance (chain-of-thought reasoning) when efficiency is critical, as structural approaches maintain ≤21% resource waste across varying model capacities while process approaches demonstrate 2.0-3.3× waste compounding.

C.7 Test T7 – Constraint-Resilient Bounded Adaptation vs. Structured Planning

Note: Methodology detailed in Appendix C.0. Navigation task with escalating constraint complexity: Baseline → Simple (+ wet floors) → Complex (+ detours, red corridors). All variants achieved 100% completion; resource efficiency is the critical differentiator.


Table C.7.1: Combined Performance Matrix Across All Quantization Tiers

Variant Tier Baseline Tokens Simple Tokens Complex Tokens Completion Rate Avg Latency (ms) Resource Efficiency
MCD Baseline Q1 87 67 70 5/5 (100%) 1,400 1.149–1.493
MCD Baseline Q4 118 121 130 5/5 (100%) 2,613 0.769–0.847
MCD Baseline Q8 123 133 140 5/5 (100%) 3,416 0.714–0.813
CoT Planning Q1 152 152 152 5/5 (100%) 3,422 0.658
CoT Planning Q4 188 188 188 5/5 (100%) 2,624 0.381
CoT Planning Q8 233 233 233 5/5 (100%) 4,495 0.343
Few-Shot Q1 143 143 143 5/5 (100%) 2,663 0.699
Few-Shot Q4 188 188 188 5/5 (100%) 2,624 0.381
Few-Shot Q8 128 128 128 5/5 (100%) 1,620 1.062
System Role Q1 70 70 70 5/5 (100%) 687 1.429
System Role Q4 157 157 157 5/5 (100%) 2,638 0.610
System Role Q8 162 162 162 5/5 (100%) 3,422 0.617
Verbose Q1 135 135 135 5/5 (100%) 3,205 0.741
Verbose Q4 173 173 173 5/5 (100%) 4,213 0.487
Verbose Q8 219 219 219 5/5 (100%) 5,666 0.386

Note: n=5 trials per variant per complexity level per tier (45 total observations per variant). Resource efficiency = 1/(tokens × latency/1000).


Table C.7.2: Cross-Tier Consistency and Resource Overhead Analysis

Variant Token Scaling Pattern Cross-Tier Success Avg Resource Cost Ratio Deployment Viability
MCD Baseline Adaptive (67→87 tokens) 100% (45/45 trials) 1.0× (baseline) ✅ High (optimal scaling)
CoT Planning Constant (152–233 tokens) 100% (45/45 trials) 2.2× overhead ❌ Low (invariant cost)
Few-Shot Consistent (128–188 tokens) 100% (45/45 trials) 1.3× ✅ Moderate (stable)
System Role Minimal (70–162 tokens) 100% (45/45 trials) 0.9× ✅ High (efficient)
Verbose High baseline (135–219 tokens) 100% (45/45 trials) 1.5× ⚠️ Moderate (over-engineered)

Resource Cost Ratio: Calculated relative to MCD baseline across all tiers and complexity levels. CoT's 2.2× represents token ratio (1.75×) × latency ratio (1.38×) = 2.41× combined resource cost.


Table C.7.3: Constraint Scaling Behavior and Edge Deployment Recommendations

Scaling Pattern Token Range Efficiency Class Key Characteristic Recommended For
Adaptive (MCD) 67–140 Optimal Scales with complexity (67→70→87) Edge devices, mobile platforms
Constant (CoT) 152–233 Poor Invariant overhead regardless of task ❌ Not constraint-suitable
Consistent (Few-Shot) 128–188 High Stable structure-guided approach General-purpose deployment
Minimal (System Role) 70–162 Optimal Low baseline with moderate scaling Resource-critical applications
High Baseline (Verbose) 135–219 Poor Excessive initial cost ❌ Avoid for edge deployment

Statistical Notes for T7

Equivalent Task Success with Divergent Resource Costs: All seven variants achieved 100% completion (45/45 trials: 5 trials × 3 tiers × 3 complexity levels), validating that task success is independent of prompting approach. Resource efficiency becomes the sole differentiator, with dramatic variations (0.343 to 1.493 efficiency scores).

CoT Resource Overhead Quantification: Chain-of-thought consumed 1.75-2.4× more tokens across tiers with weighted average 2.2× computational cost for identical outcomes. Combined resource cost (tokens × latency): CoT vs MCD baseline = 2.41× overhead, representing exceptionally large effect size (Cohen's d > 2.0).

Constraint Scaling Validation: MCD demonstrated adaptive scaling (baseline 87 → simple 67 → complex 70 tokens) while CoT maintained constant 152-233 token overhead regardless of task complexity. This invariance demonstrates fundamental architectural mismatch with constraint-first design principles.

Multi-Dimensional Validation: Perfect reliability across 45 observations per variant (completion rate σ=0.00). Resource efficiency patterns remained consistent across all conditions with MCD variants achieving 1.5-2.5× superior efficiency. Cross-tier and cross-complexity replication strengthens confidence despite small per-condition samples.

Deployment Implications: CoT's widespread adoption reflects optimization for unconstrained environments. T7 demonstrates that resource-bounded contexts require fundamentally different strategies. The constant 152-233 token CoT overhead vs MCD's adaptive 67-140 token range represents design paradigm mismatch for edge deployment, with 2.2-2.4× efficiency penalty translating to tangible costs (battery life, latency, throughput).

C.8 Test T8 – Constraint-Resilient Offline Execution with Different Prompt Types

Note: Methodology detailed in Appendix C.0. Test context: WebAssembly (WebLLM) offline execution, "Summarize solar power benefits in ≤50 tokens." All variants achieved 100% completion (30/30 trials across tiers)—focus on resource efficiency differentiation.


Table C.8.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Structured Verbose CoT Few-Shot System Role Hybrid
Completion Q1 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q1 131 156 170 97 144 68
Avg Latency (ms) Q1 4,273 4,383 4,345 1,757 4,184 1,242
Memory Δ (MB) Q1 +18 +6 -2 -9 -4 0
Completion Q4 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q4 191 221 233 221 209 205
Avg Latency (ms) Q4 4,477 4,548 4,495 5,030 4,587 4,346
Memory Δ (MB) Q4 +6 0 -2 -1 -2 +8
Completion Q8 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5) 1.00 (5/5)
Avg Tokens Q8 201 211 240 211 208 116
Avg Latency (ms) Q8 5,043 4,940 5,293 5,093 4,980 2,445
Memory Δ (MB) Q8 +2 -6 +5 +2 -1 +10

Note: n=5 trials per variant per tier. 95% CI: [1.00, 1.00] for all completion rates. Memory stability: All variants remained within ±20MB (WebAssembly stable range).


Table C.8.2: Cross-Tier Resource Efficiency and Deployment Classification

Variant Token Range (Q1/Q4/Q8) Latency Profile Deployment Class Edge Viability Resource Efficiency Score
Hybrid 68 / 205 / 116 Low (1,242–4,346ms) Edge-superior ✅ Optimal 9.5/10
Few-Shot 97 / 221 / 211 Moderate (1,757–5,093ms) Edge-compatible ✅ High 9.0/10
Structured 131 / 191 / 201 Moderate (4,273–5,043ms) Edge-optimized ✅ High 8.5/10
System Role 144 / 209 / 208 Moderate (4,184–4,980ms) Edge-compatible ✅ High 8.0/10
Verbose 156 / 221 / 211 High (4,383–4,940ms) Edge-challenging ⚠️ Moderate 6.0/10
CoT 170 / 233 / 240 High (4,345–5,293ms) Resource-intensive ❌ Avoid 2.5/10

Resource Efficiency Score: Composite of token efficiency (40%), latency (30%), memory stability (20%), browser compatibility (10%). Scale: 0-10.


Table C.8.3: Resource Efficiency Trade-off Analysis

Comparison Token Overhead Latency Impact Deployment Recommendation
Hybrid vs CoT (Q1) 2.5× fewer tokens (68 vs 170) 3.5× faster (1,242ms vs 4,345ms) ✅ Hybrid optimal for edge
Few-Shot vs CoT (Q1) 1.8× fewer tokens (97 vs 170) 2.5× faster (1,757ms vs 4,345ms) ✅ Few-Shot edge-compatible
Hybrid vs CoT (Q8) 2.1× fewer tokens (116 vs 240) 2.2× faster (2,445ms vs 5,293ms) ✅ Hybrid maintains advantage
Structured vs Verbose (Q1) 1.2× fewer tokens (131 vs 156) Equivalent latency ⚠️ Marginal efficiency gain
Cross-Tier Consistency All variants: 100% completion Zero failures (30/30 per approach) ✅ Functional equivalence validated

Statistical Notes for T8

Universal Task Success: All six approaches achieved 100% completion (30/30 trials across Q1/Q4/Q8), validating functional equivalence. Focus shifts to deployment resource efficiency rather than capability differences.

Token Efficiency Range: Dramatic resource variations despite identical outcomes: Q1-tier: 68 tokens (Hybrid) to 170 tokens (CoT) = 2.5× difference; Q8-tier: 116 tokens (Hybrid) to 240 tokens (CoT) = 2.1× difference. This confirms Chain-of-Thought creates substantial deployment overhead without functional benefits.

Latency Performance: Hybrid (1,242ms) and Few-Shot (1,757ms) demonstrated 2.5-3.5× faster execution vs CoT (4,345ms) at Q1-tier, validating that structured guidance optimizes browser execution while maintaining equivalent outcomes.

Memory Stability: All variants maintained stable profiles (±20MB range), confirming WebAssembly memory management handled all approaches without crashes or browser instability. Zero failures across 180 total trials (6 variants × 3 tiers × 10 measurements).

Deployment Resource Screening: Results validate that constraint-resilient frameworks must distinguish edge-efficient enhancements (few-shot patterns, role-based framing) from resource-intensive techniques (process-heavy reasoning) during design phase. The 2.5× token cost and 3.5× latency differences represent large practical effect sizes for deployment efficiency.

Cross-Tier Replication: Efficiency patterns held consistent across all quantization levels, with Hybrid maintaining optimal performance (Q1: 68 tokens, Q4: 205 tokens, Q8: 116 tokens) compared to CoT resource intensity (Q1: 170, Q4: 233, Q8: 240 tokens).

C.9 Test T9 – Constraint-Resilient Fallback Loop Optimization

Note: Methodology detailed in Appendix C.0. Test context: Underspecified input recovery ("Schedule a cardiology checkup."). Both approaches achieved 100% recovery success; analysis focuses on resource efficiency.


Table C.9.1: Combined Performance Matrix Across All Quantization Tiers

Metric Tier Constraint-Resilient Loop Resource-Intensive Chain
Recovery Success Q1 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q1 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q1 73 129
Token Efficiency Q1 1.370 0.775
Avg Latency (ms) Q1 1,929 4,071
Token Variance Q1 σ = 0 (0%) σ = 12%
Fallback Depth Q1 2 steps (bounded) 3+ steps (recursive)
Recovery Success Q4 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q4 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q4 106 188
Token Efficiency Q4 0.943 0.532
Avg Latency (ms) Q4 5,148† 4,371
Token Variance Q4 σ = 0 (0%) σ = 9%
Recovery Success Q8 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI Q8 [1.00, 1.00] [1.00, 1.00]
Avg Tokens Q8 149 230
Token Efficiency Q8 0.671 0.435
Avg Latency (ms) Q8 4,443 6,885
Token Variance Q8 σ = 0 (0%) σ = 8%

Note: n=5 trials per approach per tier. †Q4-tier latency anomaly (one outlier at 45s) for constraint-resilient approach. Token efficiency = recovery_success / avg_tokens.


Table C.9.2: Cross-Tier Consistency and Resource Optimization

Characteristic Constraint-Resilient Loop Resource-Intensive Chain Efficiency Advantage
Cross-Tier Recovery 100% (15/15 trials) 100% (15/15 trials) Equivalent functional outcome
Token Range 73–149 129–230 35-44% reduction
Clarification Strategy Slot-specific targeting (date, time) Open-ended recursive ("What else?") Explicit vs exploratory
Recovery Depth Bounded at 2 steps (deterministic) Recursive 3+ steps (variable) Predictable resource ceiling
Token Consistency Zero variance (σ=0 at Q1) 8-12% variance across tiers 100% vs 88-92% predictability
Edge Deployment ✅ High (predictable budget) ⚠️ Moderate (variable demand) Resource planning advantage
Recovery Distribution 60% Step 2, 40% Step 1 (Q1-tier) 100% full recursive chain Faster convergence

Table C.9.3: Fallback Design Comparison and Deployment Guidance

Design Element Constraint-Resilient Resource-Intensive Deployment Recommendation
Clarification Example "Please provide date and time for cardiology appointment" "What else do I need to know? Be specific." Explicit > open-ended for efficiency
Information Targeting Explicit slots (date, time, type) Open-ended broad questioning Slot-specific converges 35-44% faster
Recovery Predictability Deterministic 2-step maximum Variable 3+ step recursion Bounded depth for resource planning
Resource Efficiency 43% fewer tokens (Q1), 44% (Q4), 35% (Q8) Baseline comparison Large practical effect size
Token Consistency Zero variance (σ=0) High variance (8-12%) Predictable vs unpredictable cost
Best Use Case Resource-constrained edge deployment Exploratory conversational systems Context-dependent selection

Statistical Notes for T9

Equivalent Recovery with Substantial Efficiency Gap: Both approaches achieved 100% recovery success across all three tiers (15/15 trials each), validating equivalent functional outcomes. Token efficiency differed substantially: 43% reduction on Q1 (73 vs 129 tokens), 44% on Q4 (106 vs 188), and 35% on Q8 (149 vs 230). This consistent cross-tier advantage represents large practical effect size (Cohen's d > 1.5).

Bounded Depth Advantage: Constraint-resilient loops bounded fallback at 2 steps maximum with 60% Q1-tier recovery by Step 2 and 40% by Step 1, while resource-intensive chains required 3+ recursive steps in all trials. This deterministic depth ceiling provides predictable resource budgets essential for edge deployment planning.

Zero Token Variance: Constraint-resilient loops showed zero token variance (σ=0) across all Q1-tier trials and maintained ≤1% variance on Q4/Q8, demonstrating highly consistent slot-specific clarification behavior. Resource-intensive chains showed 8-12% variance due to variable recursive questioning depth, creating unpredictable resource demands unsuitable for constraint-bounded environments.

Slot-Specific Convergence: Explicit slot targeting ("Please provide date and time") proved consistently more efficient than open-ended questioning ("What else do I need to know?"). Slot-specific approaches converge faster by explicitly naming missing fields, eliminating iterative discovery processes inherent in recursive clarification chains.

Design Principle Validation: Bounding recovery depth at 2 steps with slot-specific clarification provides optimal balance between recovery reliability (100%) and computational efficiency (35-44% reduction). Open-ended recursive chains waste tokens on repeated broad requests without improving recovery success, creating unnecessary overhead in resource-constrained scenarios. Cross-tier consistency validates this design principle scales effectively across model capacity variations.

C.10 Test T10 – Constraint-Resilient Quantization Tier Optimization

Note: Methodology detailed in Appendix C.0. Task: "Summarize pancreas functions in ≤60 tokens." All tiers achieved 100% completion; test validates optimal resource sufficiency principle.


Table C.10.1: Comprehensive Quantization Tier Performance Matrix

Metric Q1 (1-bit) Q4 (4-bit) Q8 (8-bit)
Task Completion 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5) 1.00 ± 0.00 (5/5)
95% CI [1.00, 1.00] [1.00, 1.00] [1.00, 1.00]
Avg Tokens 131 114 (13% ↓) 94 (28% ↓)
Avg Latency (ms) 4,285 1,901 (56% faster) 1,965 (54% faster)
Computational Overhead Minimal (1-bit ops) Low (4-bit ops) High (8-bit ops, 8× per operation)
Resource Optimization ✅ Optimal ✅ High (balanced) ❌ Over-provisioned
Constraint Compliant ✅ Yes ✅ Yes ⚠️ No (unnecessary overhead)
Adaptive Optimization Q1→Q4 (1/5 trials) None None
Edge Deployment ✅ Maximum efficiency ✅ High viability ⚠️ Suboptimal (precision waste)

Note: n=5 trials per tier. Zero variance in token counts (σ=0) indicates deterministic generation. Latency variance <20ms across all tiers.


Table C.10.2: Resource Efficiency Analysis and Deployment Verdict

Tier Token Efficiency Computational Overhead Holistic Assessment Deployment Verdict
Q1 (1-bit) Lowest token efficiency (131 tokens) Minimal (1-bit precision per operation) Optimal resource sufficiency Recommended (maximum edge efficiency)
Q4 (4-bit) Medium token efficiency (114 tokens, 13% reduction) Low (4× overhead vs Q1) Balanced efficiency-performance Recommended (optimal for 80% tasks)
Q8 (8-bit) Highest token efficiency (94 tokens, 28% reduction) High (8× overhead vs Q1) Over-provisioned computational cost Not recommended (token gains negated by 8× computational overhead)

Critical Finding: Q8's 28% token reduction represents resource over-provisioning when Q1 achieves identical 100% task success. The 8× computational overhead per operation exceeds efficiency benefits of lower token count, violating minimal viable resource allocation principle.


Table C.10.3: Adaptive Optimization Logic and Cross-Tier Patterns

Optimization Pattern Frequency Trigger Condition Constraint-Resilient Logic
Q1 maintained 4/5 trials (80%) Optimal baseline sufficiency Default tier for edge deployment
Q1→Q4 upgrade 1/5 trials (20%) Computational efficiency enhancement detected Justified by 13% token reduction without violating overhead threshold
Q1→Q8 upgrade 0/5 trials (0%) Never triggered Prohibited: 8× computational overhead violates constraint-resilient principles despite 28% token gain
Q4 maintained 5/5 trials (100%) Balanced efficiency achieved Optimal for most constraint-bounded tasks

Adaptive Philosophy: Tier upgrades justified only when computational efficiency enhancements occur without violating constraint-resilient principles. Q8's superior token count (94 vs 131) is counterproductive when 8× computational overhead per operation is considered.


Statistical Notes for T10

Equivalent Task Success: All three tiers achieved 100% completion (15/15 total trials), providing categorical evidence that quantization tier selection does not compromise functional effectiveness. This validates ultra-low-bit quantization (Q1) maintains task capability without sacrificing reliability.

Counterintuitive Token Efficiency Paradox: Q8 achieved lowest token usage (94 tokens, 28% reduction from Q1) but represents resource over-provisioning because 8-bit precision operations consume 8× computational resources per operation compared to 1-bit. This demonstrates that token count alone is insufficient for resource efficiency assessment—computational overhead per operation must be evaluated.

Computational Overhead Analysis: Q1 (1-bit) requires minimal computational resources per operation; Q4 (4-bit) requires 4× computational resources vs Q1; Q8 (8-bit) requires 8× computational resources vs Q1. Despite Q8's 28% token advantage, the 8× overhead results in net over-provisioning when Q1 achieves identical task success.

Adaptive Optimization Validation: Q1→Q4 triggered in 1/5 trials (20%) when efficiency enhancement justified tier upgrade. Critically, Q1→Q8 never triggered (0/5 trials), validating that constraint-resilient logic prohibits unnecessary precision increases when lower tiers achieve equivalent outcomes.

Latency Patterns: Q4 achieved fastest processing (1,901ms) despite mid-tier precision, representing optimal balance between quantization compression and computational efficiency. Q8's slightly slower latency vs Q4 (1,965ms vs 1,901ms, 3% slower) may indicate memory bandwidth saturation with larger parameters.

Cross-Tier Consistency: Perfect token consistency (σ=0) and minimal latency variance (<20ms) demonstrate deterministic performance suitable for production deployment. The combination of 100% task completion across 15 trials and zero-variance token generation provides robust evidence despite small per-tier sample sizes.