Designing Lightweight AI Agents for Edge Deployment
A Minimal Capability Framework with Insights from Literature Synthesis
Appendix C: Cross-Validation Performance Matrices and Statistical Analysis
This appendix provides comprehensive performance matrices, statistical validation, and trial-by-trial evidence supporting the MCD framework evaluation presented in Chapter 6 (Tests T1-T10). All data presented follow the validation methodology established in Section 3.3 (Simulation Validation Strategy) and Section 3.4 (Walkthrough Design Method).
C.0.1 Repeated Trials Methodology
Experimental Design:
- Sample size: n=5 independent measurements per variant approach
- •Total validation measurements: Approximately 1,050 measurements across 10 tests (T1-T10: 7 variants × 5 trials × 3 tiers per test), plus 75 measurements across 3 walkthroughs (W1-W3: 5 variants × 5 trials per walkthrough
- Quantization tiers tested: Q1-tier (Qwen2-0.5B), Q4-tier (TinyLlama-1.1B), Q8-tier (Llama-3.2-1B)
- Execution environment: Browser-based WebAssembly (WebLLM) offline execution
- Measurement precision: performance.now() API for microsecond-level timing accuracy
Statistical Approach:
- Binary outcomes (completion rates): Fisher's Exact Test for categorical completion rates where extreme separability exists (e.g., 100% vs 0%)
- Continuous metrics (tokens, latency): Welch's t-test for comparing means between variants; descriptive statistics (mean ± standard deviation) reported for all metrics
- Confidence intervals: 95% CI calculated using Wilson score method for binomial proportions
- Effect size measurement: Cohen's d for continuous variables where applicable; Cohen's h for binary outcome comparisons
Sample Size Acknowledgment: While n=5 per variant represents a small sample size that limits traditional parametric inference, the methodology provides robust qualitative evidence through:
- Extreme effect sizes: Binary outcomes with complete categorical separation (100% vs 0% completion) provide clear differentiation
- Cross-tier replication: Patterns replicated across three independent quantization tiers (Q1/Q4/Q8) strengthen reliability beyond single-tier testing
- Zero-variance consistency: Perfect within-variant consistency (e.g., 5/5 or 0/5 trials) demonstrates categorical distinctions
- Convergent evidence: Consistent patterns across multiple independent tests (T1-T10)
Statistical power is limited by small per-variant samples. Analysis emphasizes effect size magnitude, categorical differences, and cross-tier consistency patterns rather than traditional inferential statistics alone.
C.0.2 How to Read Appendix C Tables
Performance Metrics Definitions:
Completion Rate: Proportion of trials successfully completing the assigned task
- Format: X.XX (n/N) where n = successful trials, N = total trials
- Example: 1.00 (5/5) = 100% completion; 0.60 (3/5) = 60% completion
- Interpretation: Higher values indicate better task reliability
95% Confidence Interval (CI): Statistical confidence bounds for completion rate estimates
- Calculated using Wilson score method for binomial proportions
- Format: [lower bound, upper bound]
- Example: [0.48, 0.99] for 4/5 completion rate
- Interpretation: True completion rate likely falls within this range with 95% confidence
Token Efficiency: Resource optimization metric calculated as semantic_fidelity / (tokens × latency_ms)
- Higher values indicate better resource utilization per unit of semantic quality
- Useful for comparing resource consumption across approaches
- Not calculable for failed variants (0% completion)
Semantic Fidelity: Quality score on 0-4 scale based on content accuracy and completeness
Resource Stability: Percentage of trials staying within predefined token budget without overflow
- 100% = All trials met budget constraints
- <100% = Some trials exceeded budget (resource instability)
Average Tokens: Mean number of tokens consumed across all trials for the variant
- Lower values indicate greater efficiency (for equivalent task success)
- Standard deviation (±) shows consistency across trials
Average Latency: Mean response time from prompt submission to completion (milliseconds)
- Lower values indicate faster execution
- Standard deviation (±) shows temporal consistency
Categorical Difference: Indicates validated statistical distinction between variants
- ✓ Validated: Fisher's Exact Test confirms categorical separation OR extreme effect size with cross-tier replication
- Not specified: Insufficient evidence for categorical claim
Cross-Tier Consistency (σ): Standard deviation of completion rates across Q1/Q4/Q8 quantization tiers
- σ = 0.00 indicates perfect consistency (same performance across all tiers)
- Higher σ values indicate tier-dependent variability
- Perfect consistency (0.00) strengthens confidence in constraint-resilience
C.0.3 Statistical Interpretation Guidelines
Understanding Small Sample Sizes: With n=5 trials per variant, traditional parametric assumptions (normality, independence, homogeneity of variance) cannot be reliably verified. However, the methodology provides robust evidence through:
Categorical Outcomes: Binary completion rates with extreme separability (100% vs 0%) provide unambiguous categorical distinctions. Fisher's Exact Test validates these separations even with small samples.
Effect Size Emphasis: Rather than relying solely on p-values, analysis emphasizes practical significance through effect size magnitude. Large effect sizes (e.g., MCD: 63 tokens vs Verbose: 147 tokens = 133% difference) demonstrate meaningful practical differences.
Replication Evidence: Cross-tier consistency (Q1/Q4/Q8) provides three independent replications of each comparison. Perfect consistency (σ=0.00) across tiers strengthens conclusions beyond single-tier testing.
Pattern Convergence: Consistent patterns across 10 independent tests (T1-T10) and 3 domain walkthroughs (W1-W3) demonstrate framework-level validation rather than isolated test-specific results.
Confidence Interval Interpretation: 95% confidence intervals for completion rates are calculated using the Wilson score method, which provides accurate bounds even for small samples and extreme proportions (0% or 100%). Wide confidence intervals reflect estimation uncertainty but do not invalidate categorical distinctions when non-overlapping.
Example:
- Variant A: 1.00 (5/5), 95% CI [1.00, 1.00]
- Variant B: 0.00 (0/5), 95% CI [0.00, 0.00]
- Interpretation: Clear categorical separation; no overlap indicates distinct performance classes
Cross-Tier Validation Strength: Cross-tier consistency provides stronger evidence than single-tier testing:
- Perfect consistency (σ=0.00): Same performance across Q1/Q4/Q8 confirms constraint-resilience is independent of model capacity
- Variable consistency (σ>0.00): Performance depends on quantization tier, suggesting tier-specific optimization requirements
- Example: Ultra-Minimal showing 0% completion across all tiers (σ=0.00) confirms fundamental architectural insufficiency rather than model-specific limitation
Note: Cross-validation methodology and interpretation guidelines are detailed in Appendix C.0 Introduction. This section presents test-specific results only.
Table C.1.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured MCD | Ultra-Minimal | Verbose | Baseline | CoT | Few-Shot | System Role |
---|---|---|---|---|---|---|---|---|
Completion Rate | Q1 | 1.00 (5/5) | 0.00 (0/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
95% CI | Q1 | [1.00, 1.00] | [0.00, 0.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q1 | 63 | — | 147 | 172 | 138 | 63 | 63 |
Avg Latency (ms) | Q1 | 1,273 | — | 4,208 | 4,227 | 3,205 | 1,273 | 1,273 |
Completion Rate | Q4 | 1.00 (5/5) | 0.00 (0/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
95% CI | Q4 | [1.00, 1.00] | [0.00, 0.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q4 | 71 | — | 185 | 203 | 163 | 71 | 71 |
Avg Latency (ms) | Q4 | 2,845 | — | 9,412 | 10,287 | 7,156 | 2,845 | 2,845 |
Completion Rate | Q8 | 1.00 (5/5) | 0.00 (0/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
95% CI | Q8 | [1.00, 1.00] | [0.00, 0.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q8 | 160 | — | 250 | 277 | 160 | 160 | 160 |
Avg Latency (ms) | Q8 | 4,231 | — | 6,673 | 6,835 | 4,231 | 4,231 | 4,231 |
Note: n=5 trials per variant per tier. Ultra-Minimal showed complete failure (0%) across all tiers. Semantic fidelity: 4.0/4.0 for all successful variants.
Table C.1.2: Cross-Tier Consistency and MCD Alignment
Variant | Q1 Success | Q4 Success | Q8 Success | Cross-Tier Consistency (σ) | MCD-Aligned |
---|---|---|---|---|---|
Structured MCD | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ✅ Yes |
Ultra-Minimal | 0% (0/5) | 0% (0/5) | 0% (0/5) | Perfect failure (0.00) | ❌ No |
Verbose | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ⚠️ Partial |
Baseline (Polite) | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ❌ No |
Chain-of-Thought | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ❌ No |
Few-Shot | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ✅ Compatible |
System Role | 100% (5/5) | 100% (5/5) | 100% (5/5) | Perfect (0.00) | ✅ Compatible |
Table C.1.3: Efficiency Classification and Deployment Viability
Variant | Token Range | Efficiency Class | Resource Profile | Deployment Viability |
---|---|---|---|---|
Structured MCD | 63-160 | Optimal | Predictable, stable | ✅ High |
Ultra-Minimal | — | Failed | Context failure | ❌ Unsuitable |
Verbose | 147-250 | Over-engineered | Variable across tiers | ⚠️ Moderate |
Baseline (Polite) | 172-277 | Over-engineered | High overhead | ⚠️ Low |
Chain-of-Thought | 138-160 | Process bloat | Medium overhead | ⚠️ Moderate |
Few-Shot | 63-71 | MCD-compatible | Predictable, efficient | ✅ High |
System Role | 63-71 | MCD-compatible | Predictable, efficient | ✅ High |
Statistical Notes for T1
Categorical Outcome Analysis: Ultra-Minimal variant demonstrated 100% consistent failure across all three quantization tiers (0/5 trials each), confirming that extreme minimalism sacrifices reliability regardless of model capacity. MCD-aligned approaches (Structured MCD, Few-Shot, System Role) achieved identical performance (63-71 tokens, 100% completion) across all tiers, validating constraint-resilience through cross-tier consistency.
Efficiency Plateau Evidence: Token counts beyond 90-130 tokens (Verbose: 147-250, Baseline: 172-277) provided no measurable quality improvements—all successful variants achieved 4.0/4.0 semantic fidelity, confirming resource optimization plateau. MCD token efficiency (0.297 at Q1-tier) vs Verbose (0.114) represents 161% improvement.
Statistical Approach: With n=5 per variant, categorical differences validated through Fisher's Exact Test for binary outcomes with extreme separability (100% vs 0%). Continuous metrics analyzed using descriptive statistics with 95% CI (Wilson score method). Cross-tier replication across Q1/Q4/Q8 provides stronger evidence than single-tier testing.
Note: Methodology and interpretation guidelines detailed in Appendix C.0 Introduction. Information density metric: semantic_fidelity / token_count (higher = better semantic preservation per token).
Table C.2.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured Symbolic | Ultra-Minimal | Verbose | Extended Natural |
---|---|---|---|---|---|
Task Completion | Q1 | 0.80 ± 0.18 (4/5) | 0.00 ± 0.00 (0/5) | 1.00 ± 0.00 (5/5) | 0.20 ± 0.18 (1/5) |
95% CI | Q1 | [0.62, 0.98] | [0.00, 0.00] | [1.00, 1.00] | [0.02, 0.38] |
Information Density | Q1 | 3.2 ± 0.4 | 0.8 ± 0.2 | 2.4 ± 0.3 | 1.2 ± 0.6 |
Avg Tokens | Q1 | 24 | 12 | 42 | 65 |
Avg Latency (ms) | Q1 | 1,106 | — | 910 | 1,739 |
Resource Stability | Q1 | 100% | 0% | 100% | 20% (overflow) |
Task Completion | Q4 | 0.80 ± 0.18 (4/5) | 0.00 ± 0.00 (0/5) | 1.00 ± 0.00 (5/5) | 0.20 ± 0.18 (1/5) |
95% CI | Q4 | [0.62, 0.98] | [0.00, 0.00] | [1.00, 1.00] | [0.02, 0.38] |
Information Density | Q4 | 3.5 ± 0.3 | 0.0 ± 0.0 | 2.6 ± 0.2 | 1.3 ± 0.5 |
Avg Tokens | Q4 | 28 | — | 48 | 72 |
Avg Latency (ms) | Q4 | 2,586 | — | 4,566 | 4,651 |
Resource Stability | Q4 | 100% | 0% | 100% | 20% (overflow) |
Task Completion | Q8 | 0.80 ± 0.18 (4/5) | 0.00 ± 0.00 (0/5) | 1.00 ± 0.00 (5/5) | 0.20 ± 0.18 (1/5) |
95% CI | Q8 | [0.62, 0.98] | [0.00, 0.00] | [1.00, 1.00] | [0.02, 0.38] |
Information Density | Q8 | 3.8 ± 0.3 | 0.0 ± 0.0 | 2.8 ± 0.2 | 1.4 ± 0.5 |
Avg Tokens | Q8 | 32 | — | 55 | 85 |
Avg Latency (ms) | Q8 | 6,957 | — | 6,674 | 6,835 |
Resource Stability | Q8 | 100% | 0% | 100% | 20% (overflow) |
Note: n=5 trials per variant per tier. Semantic fidelity: 4.0 for successful variants, 0.0 for failures. Processing consistency variance: Structured (2.6-3.2%), Extended Natural (13.9-15.4%).
Table C.2.2: Cross-Tier Consistency and Medical Reasoning Viability
Variant | Cross-Tier Completion | Info Density Range | Clinical Usability | Edge Deployment Score |
---|---|---|---|---|
Structured Symbolic | 80% (12/15 across tiers) | 3.2–3.8 | ✅ High (actionable format) | 9.5/10 |
Ultra-Minimal | 0% (0/15 across tiers) | 0.0–0.8 | ❌ Unsuitable (context failure) | 0/10 |
Verbose | 100% (15/15 across tiers) | 2.4–2.8 | ⚠️ Moderate (resource-heavy) | 6/10 |
Extended Natural | 20% (3/15 across tiers) | 1.2–1.4 | ❌ Poor (80% overflow) | 2/10 |
Edge Deployment Score: Composite of completion rate, resource stability, and constraint resilience.
Table C.2.3: Context Sufficiency Analysis
Variant | Min Viable Tokens | Token Efficiency | Semantic Loss Risk | Key Limitation |
---|---|---|---|---|
Structured Symbolic | 24 tokens (medium) | Optimal | Low | Trial variance (1/5 failure) |
Ultra-Minimal | 12 tokens (insufficient) | Theoretical only | Critical | 100% context failure |
Verbose | 42-55 tokens (high) | Suboptimal | None | 75% token overhead |
Extended Natural | 65-85 tokens (excessive) | Poor | Overflow-induced | 80% budget overflow |
Statistical Notes for T2
Information Density Validation: Structured symbolic approaches achieved 3.2–3.8 information density across all tiers, representing 33-171% efficiency advantage over verbose (2.4–2.8) and extended natural (1.2–1.4) variants. This pattern replicated consistently across Q1/Q4/Q8, providing cross-tier validation with total n=15 per variant.
Context Insufficiency Boundary: Ultra-minimal variant showed 100% failure (0/15 trials across all tiers), establishing empirical lower bound for viable symbolic formatting. The 24-token structured approach represents minimal sufficient context for 80% reliability (12/15 trials) in medical reasoning.
Resource Overflow Pattern: Extended natural exhibited systematic overflow (12/15 trials: 80% across tiers), with token budgets consumed before actionable conclusions. Processing consistency variance: structured approaches 2.6-3.2% vs extended natural 13.9-15.4% (4-5× more stable).
Medical Domain Application: In clinical decision support, structured symbolic maintained 80% diagnostic accuracy (12/15) while ensuring actionable format. Extended natural achieved only 20% actionable output (3/15) despite consuming 170-270% more tokens, demonstrating practical efficiency-effectiveness trade-offs.
Effect Size Interpretation: Information density improvements (3.2-3.8 vs 1.2-1.4) represent 166-317% gains. The 100% token overhead (24 vs 12 tokens) represents minimum investment for 80% reliability improvement in medical diagnostic scenarios, confirmed through cross-tier replication.
Note: Methodology detailed in Appendix C.0. Test context: Degraded input recovery ("IDK symptoms. Plz help??!!"). Both approaches achieved 100% recovery success across all tiers.
Table C.3.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured Fallback (MCD) | Conversational Fallback |
---|---|---|---|
Recovery Success | Q1 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q1 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q1 | 66 | 71 |
Token Efficiency | Q1 | 1.515 | 1.408 |
Avg Latency (ms) | Q1 | 1,300 | 1,072 |
Information Gathering | Q1 | Explicit fields | Open-ended |
Recovery Success | Q4 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q4 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q4 | 202 | 208 |
Token Efficiency | Q4 | 0.495 | 0.481 |
Avg Latency (ms) | Q4 | 4,691 | 4,412 |
Recovery Success | Q8 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q8 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q8 | 136 | 208 |
Token Efficiency | Q8 | 0.735 | 0.481 |
Avg Latency (ms) | Q8 | 3,405 | 4,412 |
Note: n=5 trials per approach per tier. Token efficiency = recovery_success / avg_tokens. Both approaches achieved 100% resource stability (zero overflow).
Table C.3.2: Cross-Tier Consistency and Resource Trade-offs
Characteristic | Structured (MCD) | Conversational | Trade-off Analysis |
---|---|---|---|
Cross-Tier Success | 100% (15/15 trials) | 100% (15/15 trials) | Equivalent functional outcome |
Token Range | 66–202 | 71–208 | 7-35% structured advantage |
Latency Range | 1,300–4,691 ms | 1,072–4,412 ms | 18% conversational advantage (Q1) |
Information Structure | Explicit fields (location, duration, severity) | Open-ended invitation | Systematic vs empathetic |
User Experience | Directive, clinical | Supportive, empathetic | Context-dependent preference |
Edge Viability | ✅ High (optimal tokens) | ⚠️ Moderate (UX priority) | Resource vs engagement trade-off |
Stateless Operation | Excellent (zero memory dependency) | Excellent (zero memory dependency) | Both MCD-compatible |
Table C.3.3: Fallback Strategy Deployment Recommendations
Deployment Context | Recommended Approach | Justification | Expected Outcome |
---|---|---|---|
Resource-constrained edge | Structured (MCD) | 7-35% token efficiency gain | Optimal computational utilization |
User experience priority | Conversational | 18% faster processing, empathetic tone | Enhanced engagement quality |
Medical/clinical systems | Structured (MCD) | Systematic field collection | Actionable diagnostic data |
General assistance | Either approach | Equivalent 100% recovery success | Context-dependent selection |
Stateless deployment | Either approach | Both achieve zero memory dependency | Framework flexibility |
Statistical Notes for T3
Equivalent Recovery Success: Both approaches achieved 100% recovery across all three quantization tiers (15/15 trials each), validating that fallback effectiveness depends on prompt design rather than specific architectural philosophy. Zero-variance consistency (σ=0 for token counts at Q1-tier) demonstrates exceptional execution stability.
Token Efficiency Trade-off: Structured fallback achieved 7-35% token reduction across tiers (Q1: 66 vs 71 tokens, Q4: 202 vs 208 tokens, Q8: 136 vs 208 tokens), confirming explicit field-based clarification provides resource advantages while maintaining equivalent functional outcomes. Q8-tier represents large practical effect size (35% reduction).
Latency Counterintuitive Finding: Conversational fallback processed faster (1,072ms vs 1,300ms on Q1-tier: 18% reduction), contrary to theoretical assumptions about structured prompt efficiency. This demonstrates the importance of empirical testing over theoretical predictions.
Stateless Validation: T3 uniquely confirms that recovery in stateless systems depends entirely on prompt design without conversational memory. Both approaches successfully elicited clarification without dialogue history access, validating robust fallback mechanisms in memory-constrained deployments.
Deployment Context Guidance: The choice between structured and conversational fallback depends on optimization priorities: resource-constrained environments benefit from structured fallback's token efficiency (7-35% reduction), while user experience prioritization may favor conversational fallback's empathetic engagement and faster processing. Both achieve equivalent functional outcomes (100% recovery) in stateless operation.
Note: Methodology detailed in Appendix C.0. Test context: Multi-turn appointment scheduling without memory. Turn 1: "I'd like to schedule a physiotherapy appointment for knee pain." Turn 2A (Implicit): "Make it next Monday morning." Turn 2B (Structured): "Schedule a physiotherapy appointment for knee pain on Monday morning."
Table C.4.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured Reinjection (MCD) | Implicit Reference |
---|---|---|---|
Task Success | Q1 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q1 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q1 | 120 | 112 |
Token Overhead | Q1 | +7.1% | Baseline |
Avg Latency (ms) | Q1 | 3,798 | 3,512 |
Context Completeness | Q1 | Explicit (model-independent) | Inference-dependent |
Task Success | Q4 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q4 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q4 | 193 | 190 |
Token Overhead | Q4 | +1.6% | Baseline |
Avg Latency (ms) | Q4 | 5,059 | 4,341 |
Task Success | Q8 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q8 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q8 | 236 | 227 |
Token Overhead | Q8 | +3.9% | Baseline |
Avg Latency (ms) | Q8 | 11,166 | 10,462 |
Note: n=5 trials per approach per tier. Both achieved 100% resource stability. Token variance σ=0 (perfect consistency) across all trials.
Table C.4.2: Cross-Tier Reliability Analysis and Trade-offs
Characteristic | Structured Reinjection (MCD) | Implicit Reference | Key Distinction |
---|---|---|---|
Cross-Tier Success | 100% (15/15 trials) | 100% (15/15 trials) | Equivalent functional outcome |
Token Overhead Range | +1.6% to +7.1% | Baseline | Reliability insurance premium |
Context Approach | Explicit slot-carryover (appointment type, condition, timing) | Implicit pronoun reference ("it", "next Monday") | Systematic vs inference-based |
Reliability Model | Model-independent (each turn self-contained) | Model-dependent (requires inference capability) | Deployment guarantee difference |
Turn Interpretability | Each turn fully interpretable standalone | Turn 2 requires Turn 1 context | Self-containment vs reference |
Edge Deployment Viability | ✅ High (guaranteed preservation) | ⚠️ Variable (depends on model capability) | Predictability vs resource efficiency |
Stateless Operation | ✓ Confirmed (explicit carryover) | ✓ Confirmed (inference-based) | Both truly stateless |
Table C.4.3: Deployment Context Recommendations
Deployment Scenario | Recommended Approach | Rationale | Token Cost Trade-off |
---|---|---|---|
Variable model capacity | Structured (MCD) | Model-independent reliability | +1.6-7.1% overhead acceptable |
Resource-abundant context | Implicit Reference | Lower token cost (baseline) | Leverage inference capabilities |
Safety-critical systems | Structured (MCD) | Guaranteed context preservation | Eliminate inference uncertainty |
Multi-tier deployment | Structured (MCD) | Consistent behavior across Q1/Q4/Q8 | Predictable overhead (1.6-7.1%) |
Known robust models | Either approach | Both achieve 100% success | Context-dependent selection |
Statistical Notes for T4
Equivalent Task Success: Both approaches achieved 100% success across all tiers (15/15 trials each), validating that stateless multi-turn context management succeeds through either explicit reinjection or model inference when capabilities permit. Zero token variance (σ=0) at all tiers indicates highly deterministic, predictable behavior.
Reliability Insurance Premium: Structured reinjection required modest token overhead: +7.1% (Q1), +1.6% (Q4), +3.9% (Q8). This quantifies the cost of deployment-independent reliability—eliminating inference uncertainty and ensuring each turn is self-contained. The variable overhead (1.6-7.1%) suggests context preservation costs scale differently across model capacities.
Deployment Reliability Classification: Structured reinjection achieves model-independent reliability by making each turn fully interpretable without prior turn reference. Implicit reference creates model-dependent reliability, where success relies on the model's pronoun resolution and temporal reference inference capabilities.
Stateless Operation Validation: Both mechanisms are truly stateless but differ fundamentally: (1) Explicit slot-carryover (structured) guarantees preservation through systematic reinjection; (2) Implicit reference requires model inference to resolve "it" and "next Monday morning" connections to Turn 1 content. T4 confirms stateless systems can manage multi-turn interactions through both pathways, with reliability trade-offs quantified at 1.6-7.1% token overhead for guaranteed preservation.
Architectural Design Choice: Stateless context management presents a fundamental trade-off: Explicit reinjection (+1.6% to +7.1% tokens) provides model-independent reliability and guaranteed preservation, while implicit reference (baseline tokens) offers lower resource cost but model-dependent reliability. Selection depends on deployment constraints, model variance expectations, and whether predictability outweighs resource optimization.
Note: Methodology detailed in Appendix C.0. Test context: Spatial navigation comparing systematic anchoring (metric + cardinal) vs contextual inference (relational positioning). Both achieved 100% task success.
Table C.5.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured Specification (MCD) | Naturalistic Spatial |
---|---|---|---|
Task Success | Q1 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q1 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q1 | 80 | 53 |
Token Efficiency | Q1 | 0.625 | 0.943 |
Avg Latency (ms) | Q1 | 1,952 | 1,111 |
Spatial Specification | Q1 | Metric (2m) + Cardinal (north) | Relational (shadow, past it) |
Task Success | Q4 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q4 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q4 | 90 | 191 |
Token Efficiency | Q4 | 0.556 | 0.262 |
Avg Latency (ms) | Q4 | 1,466 | 4,691 |
Task Success | Q8 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q8 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q8 | 136 | 93 |
Token Efficiency | Q8 | 0.368 | 0.538 |
Avg Latency (ms) | Q8 | 3,182 | 2,298 |
Note: n=5 trials per approach per tier. Both approaches achieved 100% resource stability. Token variance within tiers: σ=0 (perfect consistency).
Table C.5.2: Cross-Tier Resource Variability and Execution Predictability
Metric | Structured (MCD) | Naturalistic | Key Distinction |
---|---|---|---|
Cross-Tier Success | 100% (15/15 trials) | 100% (15/15 trials) | Equivalent functional outcome |
Token Pattern | Q1: 80 → Q4: 90 → Q8: 136 | Q1: 53 → Q4: 191 → Q8: 93 | Predictable vs unpredictable scaling |
Q1 Token Overhead | +51% (80 vs 53) | Baseline | Structured pays efficiency cost |
Q4 Token Overhead | Baseline | +112% (191 vs 90) | Reversed pattern |
Q8 Token Overhead | +46% (136 vs 93) | Baseline | Pattern returns to Q1 direction |
Execution Pattern | Systematic anchoring | Contextual inference | Model-independent vs model-dependent |
Deployment Reliability | Predictable (metric + cardinal) | Variable (relational metaphors) | Safety-critical suitability difference |
Table C.5.3: Deployment Context Recommendations
Application Domain | Recommended Approach | Critical Requirement | Justification |
---|---|---|---|
Safety-critical robotics | Structured (mandatory) | Unambiguous spatial coordinates | Eliminates interpretation ambiguity |
Autonomous navigation | Structured (mandatory) | Deterministic action sequences | Metric + cardinal eliminates drift |
Medical procedures | Structured (mandatory) | Precise spatial positioning | Safety requires quantifiable measurements |
Resource-predictable edge | Structured (recommended) | Consistent resource patterns | Tier-independent execution stability |
General-purpose contexts | Either approach acceptable | Spatial precision tolerance allows | 100% success for both when capable models |
Cross-model portability | Structured (recommended) | Model-independent execution | No reliance on inference capabilities |
Statistical Notes for T5
Equivalent Task Success: Both approaches achieved 100% task success across all three quantization tiers (15/15 trials each), validating that spatial reasoning can succeed through either systematic anchoring or contextual inference when models possess adequate capabilities.
Tier-Dependent Token Variability: Token overhead showed unpredictable cross-tier patterns demonstrating deployment reliability differences:
- Q1-tier: Structured +51% overhead (80 vs 53 tokens)
- Q4-tier: Naturalistic +112% overhead (191 vs 90 tokens) — reversed pattern
- Q8-tier: Structured +46% overhead (136 vs 93 tokens)
This non-monotonic scaling for naturalistic approaches (53→191→93) demonstrates unpredictable resource requirements across model capacities, while structured approaches show predictable scaling (80→90→136), validating MCD's constraint-resilience principle.
Execution Predictability: Structured specification achieved deployment-independent predictability through systematic spatial anchoring (metric distance, cardinal direction, explicit sequencing), eliminating reliance on model-specific spatial inference capabilities. Naturalistic approaches created model-dependent execution where success relies on contextual inference to resolve relational metaphors ("shadow") and implied sequencing ("continue past").
Safety-Critical Implications: For applications requiring precise spatial behavior (robotics, medical, autonomous systems), structured specification provides unambiguous spatial coordinates through quantifiable measurements. The Q4-tier reversal (naturalistic consuming 112% more tokens despite Q1/Q8 efficiency) confirms that relational spatial reasoning creates unpredictable resource patterns unsuitable for deployment-critical contexts.
Key Trade-off: The tier-specific variability validates that execution predictability (structured: consistent cross-tier patterns) outweighs token minimization (naturalistic: variable efficiency) when deployment reliability is prioritized over resource optimization in individual tiers.
Note: Methodology detailed in Appendix C.0. Task: "Summarize causes of Type 2 diabetes." All variants achieved 100% task completion across all tiers (15/15 trials each). Primary differentiator: computational efficiency. Resource waste = (tokens_used - hybrid_baseline) / tokens_used × 100%.
Table C.6.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured MCD | Verbose | CoT | Few-Shot | Hybrid |
---|---|---|---|---|---|---|
Task Completion | Q1 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q1 | 131 | 173 | 171 | 114 | 94 |
Resource Efficiency | Q1 | 0.76 ± 0.04 | 0.58 ± 0.08 | 0.58 ± 0.08 | 0.88 ± 0.05 | 1.06 ± 0.03 |
Resource Waste | Q1 | 39% | 84% | 82% | 21% | 0% (baseline) |
Avg Latency (ms) | Q1 | 4,285 | 4,213 | 4,216 | 1,901 | 1,965 |
Task Completion | Q4 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q4 | 196 | 241 | 239 | 117 | 104 |
Resource Efficiency | Q4 | 0.51 ± 0.03 | 0.41 ± 0.05 | 0.42 ± 0.06 | 0.85 ± 0.04 | 0.96 ± 0.02 |
Resource Waste | Q4 | 88% | 132% | 130% | 13% | 0% (baseline) |
Avg Latency (ms) | Q4 | 4,837 | 4,502 | 5,634 | 860 | 1,514 |
Task Completion | Q8 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q8 | 245 | 289 | 287 | 129 | 107 |
Resource Efficiency | Q8 | 0.41 ± 0.03 | 0.35 ± 0.05 | 0.35 ± 0.06 | 0.77 ± 0.04 | 0.93 ± 0.02 |
Resource Waste | Q8 | 127% | 169% | 167% | 20% | 0% (baseline) |
Avg Latency (ms) | Q8 | 6,850 | 7,245 | 7,198 | 2,980 | 2,545 |
Note: n=5 trials per variant per tier. All variants achieved 3.5/4.0 semantic fidelity. Resource efficiency = task_completion / token_count. Effect sizes: Hybrid vs CoT/Verbose (Cohen's d > 2.0 - very large).
Table C.6.2: Cross-Tier Efficiency Classification and Waste Scaling Patterns
Variant | Efficiency Category | Q1 Waste | Q4 Waste | Q8 Waste | Waste Trend | Cross-Tier Consistency |
---|---|---|---|---|---|---|
Hybrid | Superior Optimization | 0% | 0% | 0% | Flat (0%) | 100% stable |
Few-Shot | MCD-Compatible | 21% | 13% | 20% | Flat (18% avg) | 100% stable |
Structured MCD | Moderate Bloat | 39% | 88% | 127% | Increasing (3.3×) | 100% stable |
Chain-of-Thought | Process Bloat | 82% | 130% | 167% | Increasing (2.0×) | 100% stable |
Verbose | Over-Engineered | 84% | 132% | 169% | Increasing (2.0×) | 100% stable |
Key Pattern: MCD-compatible approaches (Hybrid, Few-Shot) maintain ≤21% waste regardless of tier. Non-MCD approaches (CoT, Verbose, Structured MCD) show 2.0-3.3× waste increase Q1→Q8, demonstrating computational debt compounding with model capacity. Perfect ranking consistency across all tiers (100%) validates categorical efficiency differences.
Table C.6.3: Resource Optimization Plateau Evidence
Finding | Evidence | Implication |
---|---|---|
Universal Task Success | 100% completion across all 5 variants × 3 tiers = 25/25 trials | Success ≠ efficiency under constraints |
Capability Plateau | All variants achieved 3.5/4.0 semantic fidelity regardless of token count (94-289 tokens) | Additional tokens beyond 90-130 provide no quality benefit |
Structural vs Process Distinction | Few-Shot (structural): 18% avg waste; CoT (process): 126% avg waste; Effect size d=2.4 | Structural guidance scales efficiently; process guidance creates overhead |
Hybrid Superiority | Consistent optimal performance: Q1 (1.06), Q4 (0.96), Q8 (0.93); 28-39% efficiency gain | Combining constraints + examples achieves optimal resource utilization |
Waste Compounding | CoT/Verbose waste increases 2.0× from Q1→Q8 while Few-Shot remains stable | Process approaches scale poorly with model capacity |
Statistical Notes for T6
Universal Task Success with Variable Efficiency: All five strategies achieved 100% completion (25/25 trials total), demonstrating that success does not equal efficiency. The key differentiator was computational resource utilization (0-169% waste range), validating focus on efficiency metrics as primary outcome.
Resource Optimization Plateau: Consistent plateau around 90-130 tokens across approaches validated independently in all three tiers. Beyond this threshold, additional tokens provided no semantic quality improvements (all variants: 3.5 fidelity), confirming resource optimization ceiling existence.
Structural vs Process Guidance Distinction: Few-shot examples (structural guidance) achieved 18% average waste (21%→13%→20% across tiers) while Chain-of-Thought (process guidance) demonstrated 126% average waste (82%→130%→167%), representing very large effect size (Cohen's d = 2.4). This validates fundamental distinction between constraint-compatible structural templates and resource-intensive process reasoning.
Cross-Tier Validation Strength: Perfect consistency of efficiency rankings across three independent quantization tiers (Q1/Q4/Q8) provides robust evidence for categorical efficiency differences. No variant changed its efficiency category across tiers, demonstrating 100% classification stability and strengthening findings beyond per-tier sample limitations (n=5 per tier, n=15 total per variant).
Design Implication: Resource-constrained deployments should prioritize structural guidance (few-shot examples, hybrid approaches) over process guidance (chain-of-thought reasoning) when efficiency is critical, as structural approaches maintain ≤21% resource waste across varying model capacities while process approaches demonstrate 2.0-3.3× waste compounding.
Note: Methodology detailed in Appendix C.0. Navigation task with escalating constraint complexity: Baseline → Simple (+ wet floors) → Complex (+ detours, red corridors). All variants achieved 100% completion; resource efficiency is the critical differentiator.
Table C.7.1: Combined Performance Matrix Across All Quantization Tiers
Variant | Tier | Baseline Tokens | Simple Tokens | Complex Tokens | Completion Rate | Avg Latency (ms) | Resource Efficiency |
---|---|---|---|---|---|---|---|
MCD Baseline | Q1 | 87 | 67 | 70 | 5/5 (100%) | 1,400 | 1.149–1.493 |
MCD Baseline | Q4 | 118 | 121 | 130 | 5/5 (100%) | 2,613 | 0.769–0.847 |
MCD Baseline | Q8 | 123 | 133 | 140 | 5/5 (100%) | 3,416 | 0.714–0.813 |
CoT Planning | Q1 | 152 | 152 | 152 | 5/5 (100%) | 3,422 | 0.658 |
CoT Planning | Q4 | 188 | 188 | 188 | 5/5 (100%) | 2,624 | 0.381 |
CoT Planning | Q8 | 233 | 233 | 233 | 5/5 (100%) | 4,495 | 0.343 |
Few-Shot | Q1 | 143 | 143 | 143 | 5/5 (100%) | 2,663 | 0.699 |
Few-Shot | Q4 | 188 | 188 | 188 | 5/5 (100%) | 2,624 | 0.381 |
Few-Shot | Q8 | 128 | 128 | 128 | 5/5 (100%) | 1,620 | 1.062 |
System Role | Q1 | 70 | 70 | 70 | 5/5 (100%) | 687 | 1.429 |
System Role | Q4 | 157 | 157 | 157 | 5/5 (100%) | 2,638 | 0.610 |
System Role | Q8 | 162 | 162 | 162 | 5/5 (100%) | 3,422 | 0.617 |
Verbose | Q1 | 135 | 135 | 135 | 5/5 (100%) | 3,205 | 0.741 |
Verbose | Q4 | 173 | 173 | 173 | 5/5 (100%) | 4,213 | 0.487 |
Verbose | Q8 | 219 | 219 | 219 | 5/5 (100%) | 5,666 | 0.386 |
Note: n=5 trials per variant per complexity level per tier (45 total observations per variant). Resource efficiency = 1/(tokens × latency/1000).
Table C.7.2: Cross-Tier Consistency and Resource Overhead Analysis
Variant | Token Scaling Pattern | Cross-Tier Success | Avg Resource Cost Ratio | Deployment Viability |
---|---|---|---|---|
MCD Baseline | Adaptive (67→87 tokens) | 100% (45/45 trials) | 1.0× (baseline) | ✅ High (optimal scaling) |
CoT Planning | Constant (152–233 tokens) | 100% (45/45 trials) | 2.2× overhead | ❌ Low (invariant cost) |
Few-Shot | Consistent (128–188 tokens) | 100% (45/45 trials) | 1.3× | ✅ Moderate (stable) |
System Role | Minimal (70–162 tokens) | 100% (45/45 trials) | 0.9× | ✅ High (efficient) |
Verbose | High baseline (135–219 tokens) | 100% (45/45 trials) | 1.5× | ⚠️ Moderate (over-engineered) |
Resource Cost Ratio: Calculated relative to MCD baseline across all tiers and complexity levels. CoT's 2.2× represents token ratio (1.75×) × latency ratio (1.38×) = 2.41× combined resource cost.
Table C.7.3: Constraint Scaling Behavior and Edge Deployment Recommendations
Scaling Pattern | Token Range | Efficiency Class | Key Characteristic | Recommended For |
---|---|---|---|---|
Adaptive (MCD) | 67–140 | Optimal | Scales with complexity (67→70→87) | Edge devices, mobile platforms |
Constant (CoT) | 152–233 | Poor | Invariant overhead regardless of task | ❌ Not constraint-suitable |
Consistent (Few-Shot) | 128–188 | High | Stable structure-guided approach | General-purpose deployment |
Minimal (System Role) | 70–162 | Optimal | Low baseline with moderate scaling | Resource-critical applications |
High Baseline (Verbose) | 135–219 | Poor | Excessive initial cost | ❌ Avoid for edge deployment |
Statistical Notes for T7
Equivalent Task Success with Divergent Resource Costs: All seven variants achieved 100% completion (45/45 trials: 5 trials × 3 tiers × 3 complexity levels), validating that task success is independent of prompting approach. Resource efficiency becomes the sole differentiator, with dramatic variations (0.343 to 1.493 efficiency scores).
CoT Resource Overhead Quantification: Chain-of-thought consumed 1.75-2.4× more tokens across tiers with weighted average 2.2× computational cost for identical outcomes. Combined resource cost (tokens × latency): CoT vs MCD baseline = 2.41× overhead, representing exceptionally large effect size (Cohen's d > 2.0).
Constraint Scaling Validation: MCD demonstrated adaptive scaling (baseline 87 → simple 67 → complex 70 tokens) while CoT maintained constant 152-233 token overhead regardless of task complexity. This invariance demonstrates fundamental architectural mismatch with constraint-first design principles.
Multi-Dimensional Validation: Perfect reliability across 45 observations per variant (completion rate σ=0.00). Resource efficiency patterns remained consistent across all conditions with MCD variants achieving 1.5-2.5× superior efficiency. Cross-tier and cross-complexity replication strengthens confidence despite small per-condition samples.
Deployment Implications: CoT's widespread adoption reflects optimization for unconstrained environments. T7 demonstrates that resource-bounded contexts require fundamentally different strategies. The constant 152-233 token CoT overhead vs MCD's adaptive 67-140 token range represents design paradigm mismatch for edge deployment, with 2.2-2.4× efficiency penalty translating to tangible costs (battery life, latency, throughput).
Note: Methodology detailed in Appendix C.0. Test context: WebAssembly (WebLLM) offline execution, "Summarize solar power benefits in ≤50 tokens." All variants achieved 100% completion (30/30 trials across tiers)—focus on resource efficiency differentiation.
Table C.8.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Structured | Verbose | CoT | Few-Shot | System Role | Hybrid |
---|---|---|---|---|---|---|---|
Completion | Q1 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q1 | 131 | 156 | 170 | 97 | 144 | 68 |
Avg Latency (ms) | Q1 | 4,273 | 4,383 | 4,345 | 1,757 | 4,184 | 1,242 |
Memory Δ (MB) | Q1 | +18 | +6 | -2 | -9 | -4 | 0 |
Completion | Q4 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q4 | 191 | 221 | 233 | 221 | 209 | 205 |
Avg Latency (ms) | Q4 | 4,477 | 4,548 | 4,495 | 5,030 | 4,587 | 4,346 |
Memory Δ (MB) | Q4 | +6 | 0 | -2 | -1 | -2 | +8 |
Completion | Q8 | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) | 1.00 (5/5) |
Avg Tokens | Q8 | 201 | 211 | 240 | 211 | 208 | 116 |
Avg Latency (ms) | Q8 | 5,043 | 4,940 | 5,293 | 5,093 | 4,980 | 2,445 |
Memory Δ (MB) | Q8 | +2 | -6 | +5 | +2 | -1 | +10 |
Note: n=5 trials per variant per tier. 95% CI: [1.00, 1.00] for all completion rates. Memory stability: All variants remained within ±20MB (WebAssembly stable range).
Table C.8.2: Cross-Tier Resource Efficiency and Deployment Classification
Variant | Token Range (Q1/Q4/Q8) | Latency Profile | Deployment Class | Edge Viability | Resource Efficiency Score |
---|---|---|---|---|---|
Hybrid | 68 / 205 / 116 | Low (1,242–4,346ms) | Edge-superior | ✅ Optimal | 9.5/10 |
Few-Shot | 97 / 221 / 211 | Moderate (1,757–5,093ms) | Edge-compatible | ✅ High | 9.0/10 |
Structured | 131 / 191 / 201 | Moderate (4,273–5,043ms) | Edge-optimized | ✅ High | 8.5/10 |
System Role | 144 / 209 / 208 | Moderate (4,184–4,980ms) | Edge-compatible | ✅ High | 8.0/10 |
Verbose | 156 / 221 / 211 | High (4,383–4,940ms) | Edge-challenging | ⚠️ Moderate | 6.0/10 |
CoT | 170 / 233 / 240 | High (4,345–5,293ms) | Resource-intensive | ❌ Avoid | 2.5/10 |
Resource Efficiency Score: Composite of token efficiency (40%), latency (30%), memory stability (20%), browser compatibility (10%). Scale: 0-10.
Table C.8.3: Resource Efficiency Trade-off Analysis
Comparison | Token Overhead | Latency Impact | Deployment Recommendation |
---|---|---|---|
Hybrid vs CoT (Q1) | 2.5× fewer tokens (68 vs 170) | 3.5× faster (1,242ms vs 4,345ms) | ✅ Hybrid optimal for edge |
Few-Shot vs CoT (Q1) | 1.8× fewer tokens (97 vs 170) | 2.5× faster (1,757ms vs 4,345ms) | ✅ Few-Shot edge-compatible |
Hybrid vs CoT (Q8) | 2.1× fewer tokens (116 vs 240) | 2.2× faster (2,445ms vs 5,293ms) | ✅ Hybrid maintains advantage |
Structured vs Verbose (Q1) | 1.2× fewer tokens (131 vs 156) | Equivalent latency | ⚠️ Marginal efficiency gain |
Cross-Tier Consistency | All variants: 100% completion | Zero failures (30/30 per approach) | ✅ Functional equivalence validated |
Statistical Notes for T8
Universal Task Success: All six approaches achieved 100% completion (30/30 trials across Q1/Q4/Q8), validating functional equivalence. Focus shifts to deployment resource efficiency rather than capability differences.
Token Efficiency Range: Dramatic resource variations despite identical outcomes: Q1-tier: 68 tokens (Hybrid) to 170 tokens (CoT) = 2.5× difference; Q8-tier: 116 tokens (Hybrid) to 240 tokens (CoT) = 2.1× difference. This confirms Chain-of-Thought creates substantial deployment overhead without functional benefits.
Latency Performance: Hybrid (1,242ms) and Few-Shot (1,757ms) demonstrated 2.5-3.5× faster execution vs CoT (4,345ms) at Q1-tier, validating that structured guidance optimizes browser execution while maintaining equivalent outcomes.
Memory Stability: All variants maintained stable profiles (±20MB range), confirming WebAssembly memory management handled all approaches without crashes or browser instability. Zero failures across 180 total trials (6 variants × 3 tiers × 10 measurements).
Deployment Resource Screening: Results validate that constraint-resilient frameworks must distinguish edge-efficient enhancements (few-shot patterns, role-based framing) from resource-intensive techniques (process-heavy reasoning) during design phase. The 2.5× token cost and 3.5× latency differences represent large practical effect sizes for deployment efficiency.
Cross-Tier Replication: Efficiency patterns held consistent across all quantization levels, with Hybrid maintaining optimal performance (Q1: 68 tokens, Q4: 205 tokens, Q8: 116 tokens) compared to CoT resource intensity (Q1: 170, Q4: 233, Q8: 240 tokens).
Note: Methodology detailed in Appendix C.0. Test context: Underspecified input recovery ("Schedule a cardiology checkup."). Both approaches achieved 100% recovery success; analysis focuses on resource efficiency.
Table C.9.1: Combined Performance Matrix Across All Quantization Tiers
Metric | Tier | Constraint-Resilient Loop | Resource-Intensive Chain |
---|---|---|---|
Recovery Success | Q1 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q1 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q1 | 73 | 129 |
Token Efficiency | Q1 | 1.370 | 0.775 |
Avg Latency (ms) | Q1 | 1,929 | 4,071 |
Token Variance | Q1 | σ = 0 (0%) | σ = 12% |
Fallback Depth | Q1 | 2 steps (bounded) | 3+ steps (recursive) |
Recovery Success | Q4 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q4 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q4 | 106 | 188 |
Token Efficiency | Q4 | 0.943 | 0.532 |
Avg Latency (ms) | Q4 | 5,148† | 4,371 |
Token Variance | Q4 | σ = 0 (0%) | σ = 9% |
Recovery Success | Q8 | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | Q8 | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | Q8 | 149 | 230 |
Token Efficiency | Q8 | 0.671 | 0.435 |
Avg Latency (ms) | Q8 | 4,443 | 6,885 |
Token Variance | Q8 | σ = 0 (0%) | σ = 8% |
Note: n=5 trials per approach per tier. †Q4-tier latency anomaly (one outlier at 45s) for constraint-resilient approach. Token efficiency = recovery_success / avg_tokens.
Table C.9.2: Cross-Tier Consistency and Resource Optimization
Characteristic | Constraint-Resilient Loop | Resource-Intensive Chain | Efficiency Advantage |
---|---|---|---|
Cross-Tier Recovery | 100% (15/15 trials) | 100% (15/15 trials) | Equivalent functional outcome |
Token Range | 73–149 | 129–230 | 35-44% reduction |
Clarification Strategy | Slot-specific targeting (date, time) | Open-ended recursive ("What else?") | Explicit vs exploratory |
Recovery Depth | Bounded at 2 steps (deterministic) | Recursive 3+ steps (variable) | Predictable resource ceiling |
Token Consistency | Zero variance (σ=0 at Q1) | 8-12% variance across tiers | 100% vs 88-92% predictability |
Edge Deployment | ✅ High (predictable budget) | ⚠️ Moderate (variable demand) | Resource planning advantage |
Recovery Distribution | 60% Step 2, 40% Step 1 (Q1-tier) | 100% full recursive chain | Faster convergence |
Table C.9.3: Fallback Design Comparison and Deployment Guidance
Design Element | Constraint-Resilient | Resource-Intensive | Deployment Recommendation |
---|---|---|---|
Clarification Example | "Please provide date and time for cardiology appointment" | "What else do I need to know? Be specific." | Explicit > open-ended for efficiency |
Information Targeting | Explicit slots (date, time, type) | Open-ended broad questioning | Slot-specific converges 35-44% faster |
Recovery Predictability | Deterministic 2-step maximum | Variable 3+ step recursion | Bounded depth for resource planning |
Resource Efficiency | 43% fewer tokens (Q1), 44% (Q4), 35% (Q8) | Baseline comparison | Large practical effect size |
Token Consistency | Zero variance (σ=0) | High variance (8-12%) | Predictable vs unpredictable cost |
Best Use Case | Resource-constrained edge deployment | Exploratory conversational systems | Context-dependent selection |
Statistical Notes for T9
Equivalent Recovery with Substantial Efficiency Gap: Both approaches achieved 100% recovery success across all three tiers (15/15 trials each), validating equivalent functional outcomes. Token efficiency differed substantially: 43% reduction on Q1 (73 vs 129 tokens), 44% on Q4 (106 vs 188), and 35% on Q8 (149 vs 230). This consistent cross-tier advantage represents large practical effect size (Cohen's d > 1.5).
Bounded Depth Advantage: Constraint-resilient loops bounded fallback at 2 steps maximum with 60% Q1-tier recovery by Step 2 and 40% by Step 1, while resource-intensive chains required 3+ recursive steps in all trials. This deterministic depth ceiling provides predictable resource budgets essential for edge deployment planning.
Zero Token Variance: Constraint-resilient loops showed zero token variance (σ=0) across all Q1-tier trials and maintained ≤1% variance on Q4/Q8, demonstrating highly consistent slot-specific clarification behavior. Resource-intensive chains showed 8-12% variance due to variable recursive questioning depth, creating unpredictable resource demands unsuitable for constraint-bounded environments.
Slot-Specific Convergence: Explicit slot targeting ("Please provide date and time") proved consistently more efficient than open-ended questioning ("What else do I need to know?"). Slot-specific approaches converge faster by explicitly naming missing fields, eliminating iterative discovery processes inherent in recursive clarification chains.
Design Principle Validation: Bounding recovery depth at 2 steps with slot-specific clarification provides optimal balance between recovery reliability (100%) and computational efficiency (35-44% reduction). Open-ended recursive chains waste tokens on repeated broad requests without improving recovery success, creating unnecessary overhead in resource-constrained scenarios. Cross-tier consistency validates this design principle scales effectively across model capacity variations.
Note: Methodology detailed in Appendix C.0. Task: "Summarize pancreas functions in ≤60 tokens." All tiers achieved 100% completion; test validates optimal resource sufficiency principle.
Table C.10.1: Comprehensive Quantization Tier Performance Matrix
Metric | Q1 (1-bit) | Q4 (4-bit) | Q8 (8-bit) |
---|---|---|---|
Task Completion | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) | 1.00 ± 0.00 (5/5) |
95% CI | [1.00, 1.00] | [1.00, 1.00] | [1.00, 1.00] |
Avg Tokens | 131 | 114 (13% ↓) | 94 (28% ↓) |
Avg Latency (ms) | 4,285 | 1,901 (56% faster) | 1,965 (54% faster) |
Computational Overhead | Minimal (1-bit ops) | Low (4-bit ops) | High (8-bit ops, 8× per operation) |
Resource Optimization | ✅ Optimal | ✅ High (balanced) | ❌ Over-provisioned |
Constraint Compliant | ✅ Yes | ✅ Yes | ⚠️ No (unnecessary overhead) |
Adaptive Optimization | Q1→Q4 (1/5 trials) | None | None |
Edge Deployment | ✅ Maximum efficiency | ✅ High viability | ⚠️ Suboptimal (precision waste) |
Note: n=5 trials per tier. Zero variance in token counts (σ=0) indicates deterministic generation. Latency variance <20ms across all tiers.
Table C.10.2: Resource Efficiency Analysis and Deployment Verdict
Tier | Token Efficiency | Computational Overhead | Holistic Assessment | Deployment Verdict |
---|---|---|---|---|
Q1 (1-bit) | Lowest token efficiency (131 tokens) | Minimal (1-bit precision per operation) | Optimal resource sufficiency | ✅ Recommended (maximum edge efficiency) |
Q4 (4-bit) | Medium token efficiency (114 tokens, 13% reduction) | Low (4× overhead vs Q1) | Balanced efficiency-performance | ✅ Recommended (optimal for 80% tasks) |
Q8 (8-bit) | Highest token efficiency (94 tokens, 28% reduction) | High (8× overhead vs Q1) | Over-provisioned computational cost | ❌ Not recommended (token gains negated by 8× computational overhead) |
Critical Finding: Q8's 28% token reduction represents resource over-provisioning when Q1 achieves identical 100% task success. The 8× computational overhead per operation exceeds efficiency benefits of lower token count, violating minimal viable resource allocation principle.
Table C.10.3: Adaptive Optimization Logic and Cross-Tier Patterns
Optimization Pattern | Frequency | Trigger Condition | Constraint-Resilient Logic |
---|---|---|---|
Q1 maintained | 4/5 trials (80%) | Optimal baseline sufficiency | Default tier for edge deployment |
Q1→Q4 upgrade | 1/5 trials (20%) | Computational efficiency enhancement detected | Justified by 13% token reduction without violating overhead threshold |
Q1→Q8 upgrade | 0/5 trials (0%) | Never triggered | Prohibited: 8× computational overhead violates constraint-resilient principles despite 28% token gain |
Q4 maintained | 5/5 trials (100%) | Balanced efficiency achieved | Optimal for most constraint-bounded tasks |
Adaptive Philosophy: Tier upgrades justified only when computational efficiency enhancements occur without violating constraint-resilient principles. Q8's superior token count (94 vs 131) is counterproductive when 8× computational overhead per operation is considered.
Statistical Notes for T10
Equivalent Task Success: All three tiers achieved 100% completion (15/15 total trials), providing categorical evidence that quantization tier selection does not compromise functional effectiveness. This validates ultra-low-bit quantization (Q1) maintains task capability without sacrificing reliability.
Counterintuitive Token Efficiency Paradox: Q8 achieved lowest token usage (94 tokens, 28% reduction from Q1) but represents resource over-provisioning because 8-bit precision operations consume 8× computational resources per operation compared to 1-bit. This demonstrates that token count alone is insufficient for resource efficiency assessment—computational overhead per operation must be evaluated.
Computational Overhead Analysis: Q1 (1-bit) requires minimal computational resources per operation; Q4 (4-bit) requires 4× computational resources vs Q1; Q8 (8-bit) requires 8× computational resources vs Q1. Despite Q8's 28% token advantage, the 8× overhead results in net over-provisioning when Q1 achieves identical task success.
Adaptive Optimization Validation: Q1→Q4 triggered in 1/5 trials (20%) when efficiency enhancement justified tier upgrade. Critically, Q1→Q8 never triggered (0/5 trials), validating that constraint-resilient logic prohibits unnecessary precision increases when lower tiers achieve equivalent outcomes.
Latency Patterns: Q4 achieved fastest processing (1,901ms) despite mid-tier precision, representing optimal balance between quantization compression and computational efficiency. Q8's slightly slower latency vs Q4 (1,965ms vs 1,901ms, 3% slower) may indicate memory bandwidth saturation with larger parameters.
Cross-Tier Consistency: Perfect token consistency (σ=0) and minimal latency variance (<20ms) demonstrate deterministic performance suitable for production deployment. The combination of 100% task completion across 15 trials and zero-variance token generation provides robust evidence despite small per-tier sample sizes.