Thesis Home

Designing Lightweight AI Agents for Edge Deployment

A Minimal Capability Framework with Insights from Literature Synthesis

C.0 Introduction and Methodological Framework

Appendix C: Cross-Validation Performance Matrices and Statistical Analysis

This appendix provides comprehensive performance matrices, statistical validation, and trial-by-trial evidence supporting the MCD framework evaluation presented in Chapter 6 (Tests T1-T10). All data presented follow the validation methodology established in Section 3.3 (Simulation Validation Strategy) and Section 3.4 (Walkthrough Design Method).

C.0.1 Repeated Trials Methodology

Experimental Design:

Sample size: n=5 independent measurements per variant approach
•Total validation measurements: Approximately 1,050 measurements across 10 tests (T1-T10: 7 variants × 5 trials × 3 tiers per test), plus 75 measurements across 3 walkthroughs (W1-W3: 5 variants × 5 trials per walkthrough
Quantization tiers tested: Q1-tier (Qwen2-0.5B), Q4-tier (TinyLlama-1.1B), Q8-tier (Llama-3.2-1B)
Execution environment: Browser-based WebAssembly (WebLLM) offline execution
Measurement precision: performance.now() API for microsecond-level timing accuracy

Statistical Approach:

Binary outcomes (completion rates): Fisher's Exact Test for categorical completion rates where extreme separability exists (e.g., 100% vs 0%)
Continuous metrics (tokens, latency): Welch's t-test for comparing means between variants; descriptive statistics (mean ± standard deviation) reported for all metrics
Confidence intervals: 95% CI calculated using Wilson score method for binomial proportions
Effect size measurement: Cohen's d for continuous variables where applicable; Cohen's h for binary outcome comparisons

Sample Size Acknowledgment: While n=5 per variant represents a small sample size that limits traditional parametric inference, the methodology provides robust qualitative evidence through:

Extreme effect sizes: Binary outcomes with complete categorical separation (100% vs 0% completion) provide clear differentiation
Cross-tier replication: Patterns replicated across three independent quantization tiers (Q1/Q4/Q8) strengthen reliability beyond single-tier testing
Zero-variance consistency: Perfect within-variant consistency (e.g., 5/5 or 0/5 trials) demonstrates categorical distinctions
Convergent evidence: Consistent patterns across multiple independent tests (T1-T10)

Statistical power is limited by small per-variant samples. Analysis emphasizes effect size magnitude, categorical differences, and cross-tier consistency patterns rather than traditional inferential statistics alone.

C.0.2 How to Read Appendix C Tables

Performance Metrics Definitions:

Completion Rate: Proportion of trials successfully completing the assigned task

Format: X.XX (n/N) where n = successful trials, N = total trials
Example: 1.00 (5/5) = 100% completion; 0.60 (3/5) = 60% completion
Interpretation: Higher values indicate better task reliability

95% Confidence Interval (CI): Statistical confidence bounds for completion rate estimates

Calculated using Wilson score method for binomial proportions
Format: [lower bound, upper bound]
Example: [0.48, 0.99] for 4/5 completion rate
Interpretation: True completion rate likely falls within this range with 95% confidence

Token Efficiency: Resource optimization metric calculated as semantic_fidelity / (tokens × latency_ms)

Higher values indicate better resource utilization per unit of semantic quality
Useful for comparing resource consumption across approaches
Not calculable for failed variants (0% completion)

Semantic Fidelity: Quality score on 0-4 scale based on content accuracy and completeness

Resource Stability: Percentage of trials staying within predefined token budget without overflow

100% = All trials met budget constraints
<100% = Some trials exceeded budget (resource instability)

Average Tokens: Mean number of tokens consumed across all trials for the variant

Lower values indicate greater efficiency (for equivalent task success)
Standard deviation (±) shows consistency across trials

Average Latency: Mean response time from prompt submission to completion (milliseconds)

Lower values indicate faster execution
Standard deviation (±) shows temporal consistency

Categorical Difference: Indicates validated statistical distinction between variants

✓ Validated: Fisher's Exact Test confirms categorical separation OR extreme effect size with cross-tier replication
Not specified: Insufficient evidence for categorical claim

Cross-Tier Consistency (σ): Standard deviation of completion rates across Q1/Q4/Q8 quantization tiers

σ = 0.00 indicates perfect consistency (same performance across all tiers)
Higher σ values indicate tier-dependent variability
Perfect consistency (0.00) strengthens confidence in constraint-resilience

C.0.3 Statistical Interpretation Guidelines

Understanding Small Sample Sizes: With n=5 trials per variant, traditional parametric assumptions (normality, independence, homogeneity of variance) cannot be reliably verified. However, the methodology provides robust evidence through:

Categorical Outcomes: Binary completion rates with extreme separability (100% vs 0%) provide unambiguous categorical distinctions. Fisher's Exact Test validates these separations even with small samples.
Effect Size Emphasis: Rather than relying solely on p-values, analysis emphasizes practical significance through effect size magnitude. Large effect sizes (e.g., MCD: 63 tokens vs Verbose: 147 tokens = 133% difference) demonstrate meaningful practical differences.
Replication Evidence: Cross-tier consistency (Q1/Q4/Q8) provides three independent replications of each comparison. Perfect consistency (σ=0.00) across tiers strengthens conclusions beyond single-tier testing.
Pattern Convergence: Consistent patterns across 10 independent tests (T1-T10) and 3 domain walkthroughs (W1-W3) demonstrate framework-level validation rather than isolated test-specific results.

Confidence Interval Interpretation: 95% confidence intervals for completion rates are calculated using the Wilson score method, which provides accurate bounds even for small samples and extreme proportions (0% or 100%). Wide confidence intervals reflect estimation uncertainty but do not invalidate categorical distinctions when non-overlapping.

Example:

Variant A: 1.00 (5/5), 95% CI [1.00, 1.00]
Variant B: 0.00 (0/5), 95% CI [0.00, 0.00]
Interpretation: Clear categorical separation; no overlap indicates distinct performance classes

Cross-Tier Validation Strength: Cross-tier consistency provides stronger evidence than single-tier testing:

Perfect consistency (σ=0.00): Same performance across Q1/Q4/Q8 confirms constraint-resilience is independent of model capacity
Variable consistency (σ>0.00): Performance depends on quantization tier, suggesting tier-specific optimization requirements
Example: Ultra-Minimal showing 0% completion across all tiers (σ=0.00) confirms fundamental architectural insufficiency rather than model-specific limitation

In Reference to Chapter 6 T1 - T10 Tests

C.1 Test T1 – Constraint-Resilient vs. Ultra-Minimal Prompt Comparison

Note: Cross-validation methodology and interpretation guidelines are detailed in Appendix C.0 Introduction. This section presents test-specific results only.

Table C.1.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured MCD	Ultra-Minimal	Verbose	Baseline	CoT	Few-Shot	System Role
Completion Rate	Q1	1.00 (5/5)	0.00 (0/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
95% CI	Q1	[1.00, 1.00]	[0.00, 0.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q1	63	—	147	172	138	63	63
Avg Latency (ms)	Q1	1,273	—	4,208	4,227	3,205	1,273	1,273

Completion Rate	Q4	1.00 (5/5)	0.00 (0/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
95% CI	Q4	[1.00, 1.00]	[0.00, 0.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q4	71	—	185	203	163	71	71
Avg Latency (ms)	Q4	2,845	—	9,412	10,287	7,156	2,845	2,845

Completion Rate	Q8	1.00 (5/5)	0.00 (0/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
95% CI	Q8	[1.00, 1.00]	[0.00, 0.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q8	160	—	250	277	160	160	160
Avg Latency (ms)	Q8	4,231	—	6,673	6,835	4,231	4,231	4,231

Note: n=5 trials per variant per tier. Ultra-Minimal showed complete failure (0%) across all tiers. Semantic fidelity: 4.0/4.0 for all successful variants.

Table C.1.2: Cross-Tier Consistency and MCD Alignment

Variant	Q1 Success	Q4 Success	Q8 Success	Cross-Tier Consistency (σ)	MCD-Aligned
Structured MCD	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	✅ Yes
Ultra-Minimal	0% (0/5)	0% (0/5)	0% (0/5)	Perfect failure (0.00)	❌ No
Verbose	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	⚠️ Partial
Baseline (Polite)	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	❌ No
Chain-of-Thought	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	❌ No
Few-Shot	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	✅ Compatible
System Role	100% (5/5)	100% (5/5)	100% (5/5)	Perfect (0.00)	✅ Compatible

Table C.1.3: Efficiency Classification and Deployment Viability

Variant	Token Range	Efficiency Class	Resource Profile	Deployment Viability
Structured MCD	63-160	Optimal	Predictable, stable	✅ High
Ultra-Minimal	—	Failed	Context failure	❌ Unsuitable
Verbose	147-250	Over-engineered	Variable across tiers	⚠️ Moderate
Baseline (Polite)	172-277	Over-engineered	High overhead	⚠️ Low
Chain-of-Thought	138-160	Process bloat	Medium overhead	⚠️ Moderate
Few-Shot	63-71	MCD-compatible	Predictable, efficient	✅ High
System Role	63-71	MCD-compatible	Predictable, efficient	✅ High

Statistical Notes for T1

Categorical Outcome Analysis: Ultra-Minimal variant demonstrated 100% consistent failure across all three quantization tiers (0/5 trials each), confirming that extreme minimalism sacrifices reliability regardless of model capacity. MCD-aligned approaches (Structured MCD, Few-Shot, System Role) achieved identical performance (63-71 tokens, 100% completion) across all tiers, validating constraint-resilience through cross-tier consistency.

Efficiency Plateau Evidence: Token counts beyond 90-130 tokens (Verbose: 147-250, Baseline: 172-277) provided no measurable quality improvements—all successful variants achieved 4.0/4.0 semantic fidelity, confirming resource optimization plateau. MCD token efficiency (0.297 at Q1-tier) vs Verbose (0.114) represents 161% improvement.

Statistical Approach: With n=5 per variant, categorical differences validated through Fisher's Exact Test for binary outcomes with extreme separability (100% vs 0%). Continuous metrics analyzed using descriptive statistics with 95% CI (Wilson score method). Cross-tier replication across Q1/Q4/Q8 provides stronger evidence than single-tier testing.

C.2 Test T2 – Constraint-Resilient Symbolic Input Processing

Note: Methodology and interpretation guidelines detailed in Appendix C.0 Introduction. Information density metric: semantic_fidelity / token_count (higher = better semantic preservation per token).

Table C.2.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured Symbolic	Ultra-Minimal	Verbose	Extended Natural
Task Completion	Q1	0.80 ± 0.18 (4/5)	0.00 ± 0.00 (0/5)	1.00 ± 0.00 (5/5)	0.20 ± 0.18 (1/5)
95% CI	Q1	[0.62, 0.98]	[0.00, 0.00]	[1.00, 1.00]	[0.02, 0.38]
Information Density	Q1	3.2 ± 0.4	0.8 ± 0.2	2.4 ± 0.3	1.2 ± 0.6
Avg Tokens	Q1	24	12	42	65
Avg Latency (ms)	Q1	1,106	—	910	1,739
Resource Stability	Q1	100%	0%	100%	20% (overflow)

Task Completion	Q4	0.80 ± 0.18 (4/5)	0.00 ± 0.00 (0/5)	1.00 ± 0.00 (5/5)	0.20 ± 0.18 (1/5)
95% CI	Q4	[0.62, 0.98]	[0.00, 0.00]	[1.00, 1.00]	[0.02, 0.38]
Information Density	Q4	3.5 ± 0.3	0.0 ± 0.0	2.6 ± 0.2	1.3 ± 0.5
Avg Tokens	Q4	28	—	48	72
Avg Latency (ms)	Q4	2,586	—	4,566	4,651
Resource Stability	Q4	100%	0%	100%	20% (overflow)

Task Completion	Q8	0.80 ± 0.18 (4/5)	0.00 ± 0.00 (0/5)	1.00 ± 0.00 (5/5)	0.20 ± 0.18 (1/5)
95% CI	Q8	[0.62, 0.98]	[0.00, 0.00]	[1.00, 1.00]	[0.02, 0.38]
Information Density	Q8	3.8 ± 0.3	0.0 ± 0.0	2.8 ± 0.2	1.4 ± 0.5
Avg Tokens	Q8	32	—	55	85
Avg Latency (ms)	Q8	6,957	—	6,674	6,835
Resource Stability	Q8	100%	0%	100%	20% (overflow)

Note: n=5 trials per variant per tier. Semantic fidelity: 4.0 for successful variants, 0.0 for failures. Processing consistency variance: Structured (2.6-3.2%), Extended Natural (13.9-15.4%).

Table C.2.2: Cross-Tier Consistency and Medical Reasoning Viability

Variant	Cross-Tier Completion	Info Density Range	Clinical Usability	Edge Deployment Score
Structured Symbolic	80% (12/15 across tiers)	3.2–3.8	✅ High (actionable format)	9.5/10
Ultra-Minimal	0% (0/15 across tiers)	0.0–0.8	❌ Unsuitable (context failure)	0/10
Verbose	100% (15/15 across tiers)	2.4–2.8	⚠️ Moderate (resource-heavy)	6/10
Extended Natural	20% (3/15 across tiers)	1.2–1.4	❌ Poor (80% overflow)	2/10

Edge Deployment Score: Composite of completion rate, resource stability, and constraint resilience.

Table C.2.3: Context Sufficiency Analysis

Variant	Min Viable Tokens	Token Efficiency	Semantic Loss Risk	Key Limitation
Structured Symbolic	24 tokens (medium)	Optimal	Low	Trial variance (1/5 failure)
Ultra-Minimal	12 tokens (insufficient)	Theoretical only	Critical	100% context failure
Verbose	42-55 tokens (high)	Suboptimal	None	75% token overhead
Extended Natural	65-85 tokens (excessive)	Poor	Overflow-induced	80% budget overflow

Statistical Notes for T2

Information Density Validation: Structured symbolic approaches achieved 3.2–3.8 information density across all tiers, representing 33-171% efficiency advantage over verbose (2.4–2.8) and extended natural (1.2–1.4) variants. This pattern replicated consistently across Q1/Q4/Q8, providing cross-tier validation with total n=15 per variant.

Context Insufficiency Boundary: Ultra-minimal variant showed 100% failure (0/15 trials across all tiers), establishing empirical lower bound for viable symbolic formatting. The 24-token structured approach represents minimal sufficient context for 80% reliability (12/15 trials) in medical reasoning.

Resource Overflow Pattern: Extended natural exhibited systematic overflow (12/15 trials: 80% across tiers), with token budgets consumed before actionable conclusions. Processing consistency variance: structured approaches 2.6-3.2% vs extended natural 13.9-15.4% (4-5× more stable).

Medical Domain Application: In clinical decision support, structured symbolic maintained 80% diagnostic accuracy (12/15) while ensuring actionable format. Extended natural achieved only 20% actionable output (3/15) despite consuming 170-270% more tokens, demonstrating practical efficiency-effectiveness trade-offs.

Effect Size Interpretation: Information density improvements (3.2-3.8 vs 1.2-1.4) represent 166-317% gains. The 100% token overhead (24 vs 12 tokens) represents minimum investment for 80% reliability improvement in medical diagnostic scenarios, confirmed through cross-tier replication.

C.3 Test T3 – Constraint-Resilient Prompt Recovery

Note: Methodology detailed in Appendix C.0. Test context: Degraded input recovery ("IDK symptoms. Plz help??!!"). Both approaches achieved 100% recovery success across all tiers.

Table C.3.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured Fallback (MCD)	Conversational Fallback
Recovery Success	Q1	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q1	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q1	66	71
Token Efficiency	Q1	1.515	1.408
Avg Latency (ms)	Q1	1,300	1,072
Information Gathering	Q1	Explicit fields	Open-ended

Recovery Success	Q4	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q4	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q4	202	208
Token Efficiency	Q4	0.495	0.481
Avg Latency (ms)	Q4	4,691	4,412

Recovery Success	Q8	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q8	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q8	136	208
Token Efficiency	Q8	0.735	0.481
Avg Latency (ms)	Q8	3,405	4,412

Note: n=5 trials per approach per tier. Token efficiency = recovery_success / avg_tokens. Both approaches achieved 100% resource stability (zero overflow).

Table C.3.2: Cross-Tier Consistency and Resource Trade-offs

Characteristic	Structured (MCD)	Conversational	Trade-off Analysis
Cross-Tier Success	100% (15/15 trials)	100% (15/15 trials)	Equivalent functional outcome
Token Range	66–202	71–208	7-35% structured advantage
Latency Range	1,300–4,691 ms	1,072–4,412 ms	18% conversational advantage (Q1)
Information Structure	Explicit fields (location, duration, severity)	Open-ended invitation	Systematic vs empathetic
User Experience	Directive, clinical	Supportive, empathetic	Context-dependent preference
Edge Viability	✅ High (optimal tokens)	⚠️ Moderate (UX priority)	Resource vs engagement trade-off
Stateless Operation	Excellent (zero memory dependency)	Excellent (zero memory dependency)	Both MCD-compatible

Table C.3.3: Fallback Strategy Deployment Recommendations

Deployment Context	Recommended Approach	Justification	Expected Outcome
Resource-constrained edge	Structured (MCD)	7-35% token efficiency gain	Optimal computational utilization
User experience priority	Conversational	18% faster processing, empathetic tone	Enhanced engagement quality
Medical/clinical systems	Structured (MCD)	Systematic field collection	Actionable diagnostic data
General assistance	Either approach	Equivalent 100% recovery success	Context-dependent selection
Stateless deployment	Either approach	Both achieve zero memory dependency	Framework flexibility

Statistical Notes for T3

Equivalent Recovery Success: Both approaches achieved 100% recovery across all three quantization tiers (15/15 trials each), validating that fallback effectiveness depends on prompt design rather than specific architectural philosophy. Zero-variance consistency (σ=0 for token counts at Q1-tier) demonstrates exceptional execution stability.

Token Efficiency Trade-off: Structured fallback achieved 7-35% token reduction across tiers (Q1: 66 vs 71 tokens, Q4: 202 vs 208 tokens, Q8: 136 vs 208 tokens), confirming explicit field-based clarification provides resource advantages while maintaining equivalent functional outcomes. Q8-tier represents large practical effect size (35% reduction).

Latency Counterintuitive Finding: Conversational fallback processed faster (1,072ms vs 1,300ms on Q1-tier: 18% reduction), contrary to theoretical assumptions about structured prompt efficiency. This demonstrates the importance of empirical testing over theoretical predictions.

Stateless Validation: T3 uniquely confirms that recovery in stateless systems depends entirely on prompt design without conversational memory. Both approaches successfully elicited clarification without dialogue history access, validating robust fallback mechanisms in memory-constrained deployments.

Deployment Context Guidance: The choice between structured and conversational fallback depends on optimization priorities: resource-constrained environments benefit from structured fallback's token efficiency (7-35% reduction), while user experience prioritization may favor conversational fallback's empathetic engagement and faster processing. Both achieve equivalent functional outcomes (100% recovery) in stateless operation.

C.4 Test T4 – Constraint-Resilient Stateless Context Management

Note: Methodology detailed in Appendix C.0. Test context: Multi-turn appointment scheduling without memory. Turn 1: "I'd like to schedule a physiotherapy appointment for knee pain." Turn 2A (Implicit): "Make it next Monday morning." Turn 2B (Structured): "Schedule a physiotherapy appointment for knee pain on Monday morning."

Table C.4.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured Reinjection (MCD)	Implicit Reference
Task Success	Q1	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q1	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q1	120	112
Token Overhead	Q1	+7.1%	Baseline
Avg Latency (ms)	Q1	3,798	3,512
Context Completeness	Q1	Explicit (model-independent)	Inference-dependent

Task Success	Q4	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q4	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q4	193	190
Token Overhead	Q4	+1.6%	Baseline
Avg Latency (ms)	Q4	5,059	4,341

Task Success	Q8	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q8	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q8	236	227
Token Overhead	Q8	+3.9%	Baseline
Avg Latency (ms)	Q8	11,166	10,462

Note: n=5 trials per approach per tier. Both achieved 100% resource stability. Token variance σ=0 (perfect consistency) across all trials.

Table C.4.2: Cross-Tier Reliability Analysis and Trade-offs

Characteristic	Structured Reinjection (MCD)	Implicit Reference	Key Distinction
Cross-Tier Success	100% (15/15 trials)	100% (15/15 trials)	Equivalent functional outcome
Token Overhead Range	+1.6% to +7.1%	Baseline	Reliability insurance premium
Context Approach	Explicit slot-carryover (appointment type, condition, timing)	Implicit pronoun reference ("it", "next Monday")	Systematic vs inference-based
Reliability Model	Model-independent (each turn self-contained)	Model-dependent (requires inference capability)	Deployment guarantee difference
Turn Interpretability	Each turn fully interpretable standalone	Turn 2 requires Turn 1 context	Self-containment vs reference
Edge Deployment Viability	✅ High (guaranteed preservation)	⚠️ Variable (depends on model capability)	Predictability vs resource efficiency
Stateless Operation	✓ Confirmed (explicit carryover)	✓ Confirmed (inference-based)	Both truly stateless

Table C.4.3: Deployment Context Recommendations

Deployment Scenario	Recommended Approach	Rationale	Token Cost Trade-off
Variable model capacity	Structured (MCD)	Model-independent reliability	+1.6-7.1% overhead acceptable
Resource-abundant context	Implicit Reference	Lower token cost (baseline)	Leverage inference capabilities
Safety-critical systems	Structured (MCD)	Guaranteed context preservation	Eliminate inference uncertainty
Multi-tier deployment	Structured (MCD)	Consistent behavior across Q1/Q4/Q8	Predictable overhead (1.6-7.1%)
Known robust models	Either approach	Both achieve 100% success	Context-dependent selection

Statistical Notes for T4

Equivalent Task Success: Both approaches achieved 100% success across all tiers (15/15 trials each), validating that stateless multi-turn context management succeeds through either explicit reinjection or model inference when capabilities permit. Zero token variance (σ=0) at all tiers indicates highly deterministic, predictable behavior.

Reliability Insurance Premium: Structured reinjection required modest token overhead: +7.1% (Q1), +1.6% (Q4), +3.9% (Q8). This quantifies the cost of deployment-independent reliability—eliminating inference uncertainty and ensuring each turn is self-contained. The variable overhead (1.6-7.1%) suggests context preservation costs scale differently across model capacities.

Deployment Reliability Classification: Structured reinjection achieves model-independent reliability by making each turn fully interpretable without prior turn reference. Implicit reference creates model-dependent reliability, where success relies on the model's pronoun resolution and temporal reference inference capabilities.

Stateless Operation Validation: Both mechanisms are truly stateless but differ fundamentally: (1) Explicit slot-carryover (structured) guarantees preservation through systematic reinjection; (2) Implicit reference requires model inference to resolve "it" and "next Monday morning" connections to Turn 1 content. T4 confirms stateless systems can manage multi-turn interactions through both pathways, with reliability trade-offs quantified at 1.6-7.1% token overhead for guaranteed preservation.

Architectural Design Choice: Stateless context management presents a fundamental trade-off: Explicit reinjection (+1.6% to +7.1% tokens) provides model-independent reliability and guaranteed preservation, while implicit reference (baseline tokens) offers lower resource cost but model-dependent reliability. Selection depends on deployment constraints, model variance expectations, and whether predictability outweighs resource optimization.

C.5 Test T5 – Constraint-Resilient Semantic Precision

Note: Methodology detailed in Appendix C.0. Test context: Spatial navigation comparing systematic anchoring (metric + cardinal) vs contextual inference (relational positioning). Both achieved 100% task success.

Table C.5.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured Specification (MCD)	Naturalistic Spatial
Task Success	Q1	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q1	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q1	80	53
Token Efficiency	Q1	0.625	0.943
Avg Latency (ms)	Q1	1,952	1,111
Spatial Specification	Q1	Metric (2m) + Cardinal (north)	Relational (shadow, past it)

Task Success	Q4	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q4	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q4	90	191
Token Efficiency	Q4	0.556	0.262
Avg Latency (ms)	Q4	1,466	4,691

Task Success	Q8	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q8	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q8	136	93
Token Efficiency	Q8	0.368	0.538
Avg Latency (ms)	Q8	3,182	2,298

Note: n=5 trials per approach per tier. Both approaches achieved 100% resource stability. Token variance within tiers: σ=0 (perfect consistency).

Table C.5.2: Cross-Tier Resource Variability and Execution Predictability

Metric	Structured (MCD)	Naturalistic	Key Distinction
Cross-Tier Success	100% (15/15 trials)	100% (15/15 trials)	Equivalent functional outcome
Token Pattern	Q1: 80 → Q4: 90 → Q8: 136	Q1: 53 → Q4: 191 → Q8: 93	Predictable vs unpredictable scaling
Q1 Token Overhead	+51% (80 vs 53)	Baseline	Structured pays efficiency cost
Q4 Token Overhead	Baseline	+112% (191 vs 90)	Reversed pattern
Q8 Token Overhead	+46% (136 vs 93)	Baseline	Pattern returns to Q1 direction
Execution Pattern	Systematic anchoring	Contextual inference	Model-independent vs model-dependent
Deployment Reliability	Predictable (metric + cardinal)	Variable (relational metaphors)	Safety-critical suitability difference

Table C.5.3: Deployment Context Recommendations

Application Domain	Recommended Approach	Critical Requirement	Justification
Safety-critical robotics	Structured (mandatory)	Unambiguous spatial coordinates	Eliminates interpretation ambiguity
Autonomous navigation	Structured (mandatory)	Deterministic action sequences	Metric + cardinal eliminates drift
Medical procedures	Structured (mandatory)	Precise spatial positioning	Safety requires quantifiable measurements
Resource-predictable edge	Structured (recommended)	Consistent resource patterns	Tier-independent execution stability
General-purpose contexts	Either approach acceptable	Spatial precision tolerance allows	100% success for both when capable models
Cross-model portability	Structured (recommended)	Model-independent execution	No reliance on inference capabilities

Statistical Notes for T5

Equivalent Task Success: Both approaches achieved 100% task success across all three quantization tiers (15/15 trials each), validating that spatial reasoning can succeed through either systematic anchoring or contextual inference when models possess adequate capabilities.

Tier-Dependent Token Variability: Token overhead showed unpredictable cross-tier patterns demonstrating deployment reliability differences:

Q1-tier: Structured +51% overhead (80 vs 53 tokens)
Q4-tier: Naturalistic +112% overhead (191 vs 90 tokens) — reversed pattern
Q8-tier: Structured +46% overhead (136 vs 93 tokens)

This non-monotonic scaling for naturalistic approaches (53→191→93) demonstrates unpredictable resource requirements across model capacities, while structured approaches show predictable scaling (80→90→136), validating MCD's constraint-resilience principle.

Execution Predictability: Structured specification achieved deployment-independent predictability through systematic spatial anchoring (metric distance, cardinal direction, explicit sequencing), eliminating reliance on model-specific spatial inference capabilities. Naturalistic approaches created model-dependent execution where success relies on contextual inference to resolve relational metaphors ("shadow") and implied sequencing ("continue past").

Safety-Critical Implications: For applications requiring precise spatial behavior (robotics, medical, autonomous systems), structured specification provides unambiguous spatial coordinates through quantifiable measurements. The Q4-tier reversal (naturalistic consuming 112% more tokens despite Q1/Q8 efficiency) confirms that relational spatial reasoning creates unpredictable resource patterns unsuitable for deployment-critical contexts.

Key Trade-off: The tier-specific variability validates that execution predictability (structured: consistent cross-tier patterns) outweighs token minimization (naturalistic: variable efficiency) when deployment reliability is prioritized over resource optimization in individual tiers.

C.6 Test T6 – Constraint-Resilient Resource Optimization Analysis

Note: Methodology detailed in Appendix C.0. Task: "Summarize causes of Type 2 diabetes." All variants achieved 100% task completion across all tiers (15/15 trials each). Primary differentiator: computational efficiency. Resource waste = (tokens_used - hybrid_baseline) / tokens_used × 100%.

Table C.6.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured MCD	Verbose	CoT	Few-Shot	Hybrid
Task Completion	Q1	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q1	131	173	171	114	94
Resource Efficiency	Q1	0.76 ± 0.04	0.58 ± 0.08	0.58 ± 0.08	0.88 ± 0.05	1.06 ± 0.03
Resource Waste	Q1	39%	84%	82%	21%	0% (baseline)
Avg Latency (ms)	Q1	4,285	4,213	4,216	1,901	1,965

Task Completion	Q4	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q4	196	241	239	117	104
Resource Efficiency	Q4	0.51 ± 0.03	0.41 ± 0.05	0.42 ± 0.06	0.85 ± 0.04	0.96 ± 0.02
Resource Waste	Q4	88%	132%	130%	13%	0% (baseline)
Avg Latency (ms)	Q4	4,837	4,502	5,634	860	1,514

Task Completion	Q8	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q8	245	289	287	129	107
Resource Efficiency	Q8	0.41 ± 0.03	0.35 ± 0.05	0.35 ± 0.06	0.77 ± 0.04	0.93 ± 0.02
Resource Waste	Q8	127%	169%	167%	20%	0% (baseline)
Avg Latency (ms)	Q8	6,850	7,245	7,198	2,980	2,545

Note: n=5 trials per variant per tier. All variants achieved 3.5/4.0 semantic fidelity. Resource efficiency = task_completion / token_count. Effect sizes: Hybrid vs CoT/Verbose (Cohen's d > 2.0 - very large).

Table C.6.2: Cross-Tier Efficiency Classification and Waste Scaling Patterns

Variant	Efficiency Category	Q1 Waste	Q4 Waste	Q8 Waste	Waste Trend	Cross-Tier Consistency
Hybrid	Superior Optimization	0%	0%	0%	Flat (0%)	100% stable
Few-Shot	MCD-Compatible	21%	13%	20%	Flat (18% avg)	100% stable
Structured MCD	Moderate Bloat	39%	88%	127%	Increasing (3.3×)	100% stable
Chain-of-Thought	Process Bloat	82%	130%	167%	Increasing (2.0×)	100% stable
Verbose	Over-Engineered	84%	132%	169%	Increasing (2.0×)	100% stable

Key Pattern: MCD-compatible approaches (Hybrid, Few-Shot) maintain ≤21% waste regardless of tier. Non-MCD approaches (CoT, Verbose, Structured MCD) show 2.0-3.3× waste increase Q1→Q8, demonstrating computational debt compounding with model capacity. Perfect ranking consistency across all tiers (100%) validates categorical efficiency differences.

Table C.6.3: Resource Optimization Plateau Evidence

Finding	Evidence	Implication
Universal Task Success	100% completion across all 5 variants × 3 tiers = 25/25 trials	Success ≠ efficiency under constraints
Capability Plateau	All variants achieved 3.5/4.0 semantic fidelity regardless of token count (94-289 tokens)	Additional tokens beyond 90-130 provide no quality benefit
Structural vs Process Distinction	Few-Shot (structural): 18% avg waste; CoT (process): 126% avg waste; Effect size d=2.4	Structural guidance scales efficiently; process guidance creates overhead
Hybrid Superiority	Consistent optimal performance: Q1 (1.06), Q4 (0.96), Q8 (0.93); 28-39% efficiency gain	Combining constraints + examples achieves optimal resource utilization
Waste Compounding	CoT/Verbose waste increases 2.0× from Q1→Q8 while Few-Shot remains stable	Process approaches scale poorly with model capacity

Statistical Notes for T6

Universal Task Success with Variable Efficiency: All five strategies achieved 100% completion (25/25 trials total), demonstrating that success does not equal efficiency. The key differentiator was computational resource utilization (0-169% waste range), validating focus on efficiency metrics as primary outcome.

Resource Optimization Plateau: Consistent plateau around 90-130 tokens across approaches validated independently in all three tiers. Beyond this threshold, additional tokens provided no semantic quality improvements (all variants: 3.5 fidelity), confirming resource optimization ceiling existence.

Structural vs Process Guidance Distinction: Few-shot examples (structural guidance) achieved 18% average waste (21%→13%→20% across tiers) while Chain-of-Thought (process guidance) demonstrated 126% average waste (82%→130%→167%), representing very large effect size (Cohen's d = 2.4). This validates fundamental distinction between constraint-compatible structural templates and resource-intensive process reasoning.

Cross-Tier Validation Strength: Perfect consistency of efficiency rankings across three independent quantization tiers (Q1/Q4/Q8) provides robust evidence for categorical efficiency differences. No variant changed its efficiency category across tiers, demonstrating 100% classification stability and strengthening findings beyond per-tier sample limitations (n=5 per tier, n=15 total per variant).

Design Implication: Resource-constrained deployments should prioritize structural guidance (few-shot examples, hybrid approaches) over process guidance (chain-of-thought reasoning) when efficiency is critical, as structural approaches maintain ≤21% resource waste across varying model capacities while process approaches demonstrate 2.0-3.3× waste compounding.

C.7 Test T7 – Constraint-Resilient Bounded Adaptation vs. Structured Planning

Note: Methodology detailed in Appendix C.0. Navigation task with escalating constraint complexity: Baseline → Simple (+ wet floors) → Complex (+ detours, red corridors). All variants achieved 100% completion; resource efficiency is the critical differentiator.

Table C.7.1: Combined Performance Matrix Across All Quantization Tiers

Variant	Tier	Baseline Tokens	Simple Tokens	Complex Tokens	Completion Rate	Avg Latency (ms)	Resource Efficiency
MCD Baseline	Q1	87	67	70	5/5 (100%)	1,400	1.149–1.493
MCD Baseline	Q4	118	121	130	5/5 (100%)	2,613	0.769–0.847
MCD Baseline	Q8	123	133	140	5/5 (100%)	3,416	0.714–0.813

CoT Planning	Q1	152	152	152	5/5 (100%)	3,422	0.658
CoT Planning	Q4	188	188	188	5/5 (100%)	2,624	0.381
CoT Planning	Q8	233	233	233	5/5 (100%)	4,495	0.343

Few-Shot	Q1	143	143	143	5/5 (100%)	2,663	0.699
Few-Shot	Q4	188	188	188	5/5 (100%)	2,624	0.381
Few-Shot	Q8	128	128	128	5/5 (100%)	1,620	1.062

System Role	Q1	70	70	70	5/5 (100%)	687	1.429
System Role	Q4	157	157	157	5/5 (100%)	2,638	0.610
System Role	Q8	162	162	162	5/5 (100%)	3,422	0.617

Verbose	Q1	135	135	135	5/5 (100%)	3,205	0.741
Verbose	Q4	173	173	173	5/5 (100%)	4,213	0.487
Verbose	Q8	219	219	219	5/5 (100%)	5,666	0.386

Note: n=5 trials per variant per complexity level per tier (45 total observations per variant). Resource efficiency = 1/(tokens × latency/1000).

Table C.7.2: Cross-Tier Consistency and Resource Overhead Analysis

Variant	Token Scaling Pattern	Cross-Tier Success	Avg Resource Cost Ratio	Deployment Viability
MCD Baseline	Adaptive (67→87 tokens)	100% (45/45 trials)	1.0× (baseline)	✅ High (optimal scaling)
CoT Planning	Constant (152–233 tokens)	100% (45/45 trials)	2.2× overhead	❌ Low (invariant cost)
Few-Shot	Consistent (128–188 tokens)	100% (45/45 trials)	1.3×	✅ Moderate (stable)
System Role	Minimal (70–162 tokens)	100% (45/45 trials)	0.9×	✅ High (efficient)
Verbose	High baseline (135–219 tokens)	100% (45/45 trials)	1.5×	⚠️ Moderate (over-engineered)

Resource Cost Ratio: Calculated relative to MCD baseline across all tiers and complexity levels. CoT's 2.2× represents token ratio (1.75×) × latency ratio (1.38×) = 2.41× combined resource cost.

Table C.7.3: Constraint Scaling Behavior and Edge Deployment Recommendations

Scaling Pattern	Token Range	Efficiency Class	Key Characteristic	Recommended For
Adaptive (MCD)	67–140	Optimal	Scales with complexity (67→70→87)	Edge devices, mobile platforms
Constant (CoT)	152–233	Poor	Invariant overhead regardless of task	❌ Not constraint-suitable
Consistent (Few-Shot)	128–188	High	Stable structure-guided approach	General-purpose deployment
Minimal (System Role)	70–162	Optimal	Low baseline with moderate scaling	Resource-critical applications
High Baseline (Verbose)	135–219	Poor	Excessive initial cost	❌ Avoid for edge deployment

Statistical Notes for T7

Equivalent Task Success with Divergent Resource Costs: All seven variants achieved 100% completion (45/45 trials: 5 trials × 3 tiers × 3 complexity levels), validating that task success is independent of prompting approach. Resource efficiency becomes the sole differentiator, with dramatic variations (0.343 to 1.493 efficiency scores).

CoT Resource Overhead Quantification: Chain-of-thought consumed 1.75-2.4× more tokens across tiers with weighted average 2.2× computational cost for identical outcomes. Combined resource cost (tokens × latency): CoT vs MCD baseline = 2.41× overhead, representing exceptionally large effect size (Cohen's d > 2.0).

Constraint Scaling Validation: MCD demonstrated adaptive scaling (baseline 87 → simple 67 → complex 70 tokens) while CoT maintained constant 152-233 token overhead regardless of task complexity. This invariance demonstrates fundamental architectural mismatch with constraint-first design principles.

Multi-Dimensional Validation: Perfect reliability across 45 observations per variant (completion rate σ=0.00). Resource efficiency patterns remained consistent across all conditions with MCD variants achieving 1.5-2.5× superior efficiency. Cross-tier and cross-complexity replication strengthens confidence despite small per-condition samples.

Deployment Implications: CoT's widespread adoption reflects optimization for unconstrained environments. T7 demonstrates that resource-bounded contexts require fundamentally different strategies. The constant 152-233 token CoT overhead vs MCD's adaptive 67-140 token range represents design paradigm mismatch for edge deployment, with 2.2-2.4× efficiency penalty translating to tangible costs (battery life, latency, throughput).

C.8 Test T8 – Constraint-Resilient Offline Execution with Different Prompt Types

Note: Methodology detailed in Appendix C.0. Test context: WebAssembly (WebLLM) offline execution, "Summarize solar power benefits in ≤50 tokens." All variants achieved 100% completion (30/30 trials across tiers)—focus on resource efficiency differentiation.

Table C.8.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Structured	Verbose	CoT	Few-Shot	System Role	Hybrid
Completion	Q1	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q1	131	156	170	97	144	68
Avg Latency (ms)	Q1	4,273	4,383	4,345	1,757	4,184	1,242
Memory Δ (MB)	Q1	+18	+6	-2	-9	-4	0

Completion	Q4	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q4	191	221	233	221	209	205
Avg Latency (ms)	Q4	4,477	4,548	4,495	5,030	4,587	4,346
Memory Δ (MB)	Q4	+6	0	-2	-1	-2	+8

Completion	Q8	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)	1.00 (5/5)
Avg Tokens	Q8	201	211	240	211	208	116
Avg Latency (ms)	Q8	5,043	4,940	5,293	5,093	4,980	2,445
Memory Δ (MB)	Q8	+2	-6	+5	+2	-1	+10

Note: n=5 trials per variant per tier. 95% CI: [1.00, 1.00] for all completion rates. Memory stability: All variants remained within ±20MB (WebAssembly stable range).

Table C.8.2: Cross-Tier Resource Efficiency and Deployment Classification

Variant	Token Range (Q1/Q4/Q8)	Latency Profile	Deployment Class	Edge Viability	Resource Efficiency Score
Hybrid	68 / 205 / 116	Low (1,242–4,346ms)	Edge-superior	✅ Optimal	9.5/10
Few-Shot	97 / 221 / 211	Moderate (1,757–5,093ms)	Edge-compatible	✅ High	9.0/10
Structured	131 / 191 / 201	Moderate (4,273–5,043ms)	Edge-optimized	✅ High	8.5/10
System Role	144 / 209 / 208	Moderate (4,184–4,980ms)	Edge-compatible	✅ High	8.0/10
Verbose	156 / 221 / 211	High (4,383–4,940ms)	Edge-challenging	⚠️ Moderate	6.0/10
CoT	170 / 233 / 240	High (4,345–5,293ms)	Resource-intensive	❌ Avoid	2.5/10

Resource Efficiency Score: Composite of token efficiency (40%), latency (30%), memory stability (20%), browser compatibility (10%). Scale: 0-10.

Table C.8.3: Resource Efficiency Trade-off Analysis

Comparison	Token Overhead	Latency Impact	Deployment Recommendation
Hybrid vs CoT (Q1)	2.5× fewer tokens (68 vs 170)	3.5× faster (1,242ms vs 4,345ms)	✅ Hybrid optimal for edge
Few-Shot vs CoT (Q1)	1.8× fewer tokens (97 vs 170)	2.5× faster (1,757ms vs 4,345ms)	✅ Few-Shot edge-compatible
Hybrid vs CoT (Q8)	2.1× fewer tokens (116 vs 240)	2.2× faster (2,445ms vs 5,293ms)	✅ Hybrid maintains advantage
Structured vs Verbose (Q1)	1.2× fewer tokens (131 vs 156)	Equivalent latency	⚠️ Marginal efficiency gain
Cross-Tier Consistency	All variants: 100% completion	Zero failures (30/30 per approach)	✅ Functional equivalence validated

Statistical Notes for T8

Universal Task Success: All six approaches achieved 100% completion (30/30 trials across Q1/Q4/Q8), validating functional equivalence. Focus shifts to deployment resource efficiency rather than capability differences.

Token Efficiency Range: Dramatic resource variations despite identical outcomes: Q1-tier: 68 tokens (Hybrid) to 170 tokens (CoT) = 2.5× difference; Q8-tier: 116 tokens (Hybrid) to 240 tokens (CoT) = 2.1× difference. This confirms Chain-of-Thought creates substantial deployment overhead without functional benefits.

Latency Performance: Hybrid (1,242ms) and Few-Shot (1,757ms) demonstrated 2.5-3.5× faster execution vs CoT (4,345ms) at Q1-tier, validating that structured guidance optimizes browser execution while maintaining equivalent outcomes.

Memory Stability: All variants maintained stable profiles (±20MB range), confirming WebAssembly memory management handled all approaches without crashes or browser instability. Zero failures across 180 total trials (6 variants × 3 tiers × 10 measurements).

Deployment Resource Screening: Results validate that constraint-resilient frameworks must distinguish edge-efficient enhancements (few-shot patterns, role-based framing) from resource-intensive techniques (process-heavy reasoning) during design phase. The 2.5× token cost and 3.5× latency differences represent large practical effect sizes for deployment efficiency.

Cross-Tier Replication: Efficiency patterns held consistent across all quantization levels, with Hybrid maintaining optimal performance (Q1: 68 tokens, Q4: 205 tokens, Q8: 116 tokens) compared to CoT resource intensity (Q1: 170, Q4: 233, Q8: 240 tokens).

C.9 Test T9 – Constraint-Resilient Fallback Loop Optimization

Note: Methodology detailed in Appendix C.0. Test context: Underspecified input recovery ("Schedule a cardiology checkup."). Both approaches achieved 100% recovery success; analysis focuses on resource efficiency.

Table C.9.1: Combined Performance Matrix Across All Quantization Tiers

Metric	Tier	Constraint-Resilient Loop	Resource-Intensive Chain
Recovery Success	Q1	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q1	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q1	73	129
Token Efficiency	Q1	1.370	0.775
Avg Latency (ms)	Q1	1,929	4,071
Token Variance	Q1	σ = 0 (0%)	σ = 12%
Fallback Depth	Q1	2 steps (bounded)	3+ steps (recursive)

Recovery Success	Q4	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q4	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q4	106	188
Token Efficiency	Q4	0.943	0.532
Avg Latency (ms)	Q4	5,148†	4,371
Token Variance	Q4	σ = 0 (0%)	σ = 9%

Recovery Success	Q8	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	Q8	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	Q8	149	230
Token Efficiency	Q8	0.671	0.435
Avg Latency (ms)	Q8	4,443	6,885
Token Variance	Q8	σ = 0 (0%)	σ = 8%

Note: n=5 trials per approach per tier. †Q4-tier latency anomaly (one outlier at 45s) for constraint-resilient approach. Token efficiency = recovery_success / avg_tokens.

Table C.9.2: Cross-Tier Consistency and Resource Optimization

Characteristic	Constraint-Resilient Loop	Resource-Intensive Chain	Efficiency Advantage
Cross-Tier Recovery	100% (15/15 trials)	100% (15/15 trials)	Equivalent functional outcome
Token Range	73–149	129–230	35-44% reduction
Clarification Strategy	Slot-specific targeting (date, time)	Open-ended recursive ("What else?")	Explicit vs exploratory
Recovery Depth	Bounded at 2 steps (deterministic)	Recursive 3+ steps (variable)	Predictable resource ceiling
Token Consistency	Zero variance (σ=0 at Q1)	8-12% variance across tiers	100% vs 88-92% predictability
Edge Deployment	✅ High (predictable budget)	⚠️ Moderate (variable demand)	Resource planning advantage
Recovery Distribution	60% Step 2, 40% Step 1 (Q1-tier)	100% full recursive chain	Faster convergence

Table C.9.3: Fallback Design Comparison and Deployment Guidance

Design Element	Constraint-Resilient	Resource-Intensive	Deployment Recommendation
Clarification Example	"Please provide date and time for cardiology appointment"	"What else do I need to know? Be specific."	Explicit > open-ended for efficiency
Information Targeting	Explicit slots (date, time, type)	Open-ended broad questioning	Slot-specific converges 35-44% faster
Recovery Predictability	Deterministic 2-step maximum	Variable 3+ step recursion	Bounded depth for resource planning
Resource Efficiency	43% fewer tokens (Q1), 44% (Q4), 35% (Q8)	Baseline comparison	Large practical effect size
Token Consistency	Zero variance (σ=0)	High variance (8-12%)	Predictable vs unpredictable cost
Best Use Case	Resource-constrained edge deployment	Exploratory conversational systems	Context-dependent selection

Statistical Notes for T9

Equivalent Recovery with Substantial Efficiency Gap: Both approaches achieved 100% recovery success across all three tiers (15/15 trials each), validating equivalent functional outcomes. Token efficiency differed substantially: 43% reduction on Q1 (73 vs 129 tokens), 44% on Q4 (106 vs 188), and 35% on Q8 (149 vs 230). This consistent cross-tier advantage represents large practical effect size (Cohen's d > 1.5).

Bounded Depth Advantage: Constraint-resilient loops bounded fallback at 2 steps maximum with 60% Q1-tier recovery by Step 2 and 40% by Step 1, while resource-intensive chains required 3+ recursive steps in all trials. This deterministic depth ceiling provides predictable resource budgets essential for edge deployment planning.

Zero Token Variance: Constraint-resilient loops showed zero token variance (σ=0) across all Q1-tier trials and maintained ≤1% variance on Q4/Q8, demonstrating highly consistent slot-specific clarification behavior. Resource-intensive chains showed 8-12% variance due to variable recursive questioning depth, creating unpredictable resource demands unsuitable for constraint-bounded environments.

Slot-Specific Convergence: Explicit slot targeting ("Please provide date and time") proved consistently more efficient than open-ended questioning ("What else do I need to know?"). Slot-specific approaches converge faster by explicitly naming missing fields, eliminating iterative discovery processes inherent in recursive clarification chains.

Design Principle Validation: Bounding recovery depth at 2 steps with slot-specific clarification provides optimal balance between recovery reliability (100%) and computational efficiency (35-44% reduction). Open-ended recursive chains waste tokens on repeated broad requests without improving recovery success, creating unnecessary overhead in resource-constrained scenarios. Cross-tier consistency validates this design principle scales effectively across model capacity variations.

C.10 Test T10 – Constraint-Resilient Quantization Tier Optimization

Note: Methodology detailed in Appendix C.0. Task: "Summarize pancreas functions in ≤60 tokens." All tiers achieved 100% completion; test validates optimal resource sufficiency principle.

Table C.10.1: Comprehensive Quantization Tier Performance Matrix

Metric	Q1 (1-bit)	Q4 (4-bit)	Q8 (8-bit)
Task Completion	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)	1.00 ± 0.00 (5/5)
95% CI	[1.00, 1.00]	[1.00, 1.00]	[1.00, 1.00]
Avg Tokens	131	114 (13% ↓)	94 (28% ↓)
Avg Latency (ms)	4,285	1,901 (56% faster)	1,965 (54% faster)
Computational Overhead	Minimal (1-bit ops)	Low (4-bit ops)	High (8-bit ops, 8× per operation)
Resource Optimization	✅ Optimal	✅ High (balanced)	❌ Over-provisioned
Constraint Compliant	✅ Yes	✅ Yes	⚠️ No (unnecessary overhead)
Adaptive Optimization	Q1→Q4 (1/5 trials)	None	None
Edge Deployment	✅ Maximum efficiency	✅ High viability	⚠️ Suboptimal (precision waste)

Note: n=5 trials per tier. Zero variance in token counts (σ=0) indicates deterministic generation. Latency variance <20ms across all tiers.

Table C.10.2: Resource Efficiency Analysis and Deployment Verdict

Tier	Token Efficiency	Computational Overhead	Holistic Assessment	Deployment Verdict
Q1 (1-bit)	Lowest token efficiency (131 tokens)	Minimal (1-bit precision per operation)	Optimal resource sufficiency	✅ Recommended (maximum edge efficiency)
Q4 (4-bit)	Medium token efficiency (114 tokens, 13% reduction)	Low (4× overhead vs Q1)	Balanced efficiency-performance	✅ Recommended (optimal for 80% tasks)
Q8 (8-bit)	Highest token efficiency (94 tokens, 28% reduction)	High (8× overhead vs Q1)	Over-provisioned computational cost	❌ Not recommended (token gains negated by 8× computational overhead)

Critical Finding: Q8's 28% token reduction represents resource over-provisioning when Q1 achieves identical 100% task success. The 8× computational overhead per operation exceeds efficiency benefits of lower token count, violating minimal viable resource allocation principle.

Table C.10.3: Adaptive Optimization Logic and Cross-Tier Patterns

Optimization Pattern	Frequency	Trigger Condition	Constraint-Resilient Logic
Q1 maintained	4/5 trials (80%)	Optimal baseline sufficiency	Default tier for edge deployment
Q1→Q4 upgrade	1/5 trials (20%)	Computational efficiency enhancement detected	Justified by 13% token reduction without violating overhead threshold
Q1→Q8 upgrade	0/5 trials (0%)	Never triggered	Prohibited: 8× computational overhead violates constraint-resilient principles despite 28% token gain
Q4 maintained	5/5 trials (100%)	Balanced efficiency achieved	Optimal for most constraint-bounded tasks

Adaptive Philosophy: Tier upgrades justified only when computational efficiency enhancements occur without violating constraint-resilient principles. Q8's superior token count (94 vs 131) is counterproductive when 8× computational overhead per operation is considered.

Statistical Notes for T10

Equivalent Task Success: All three tiers achieved 100% completion (15/15 total trials), providing categorical evidence that quantization tier selection does not compromise functional effectiveness. This validates ultra-low-bit quantization (Q1) maintains task capability without sacrificing reliability.

Counterintuitive Token Efficiency Paradox: Q8 achieved lowest token usage (94 tokens, 28% reduction from Q1) but represents resource over-provisioning because 8-bit precision operations consume 8× computational resources per operation compared to 1-bit. This demonstrates that token count alone is insufficient for resource efficiency assessment—computational overhead per operation must be evaluated.

Computational Overhead Analysis: Q1 (1-bit) requires minimal computational resources per operation; Q4 (4-bit) requires 4× computational resources vs Q1; Q8 (8-bit) requires 8× computational resources vs Q1. Despite Q8's 28% token advantage, the 8× overhead results in net over-provisioning when Q1 achieves identical task success.

Adaptive Optimization Validation: Q1→Q4 triggered in 1/5 trials (20%) when efficiency enhancement justified tier upgrade. Critically, Q1→Q8 never triggered (0/5 trials), validating that constraint-resilient logic prohibits unnecessary precision increases when lower tiers achieve equivalent outcomes.

Latency Patterns: Q4 achieved fastest processing (1,901ms) despite mid-tier precision, representing optimal balance between quantization compression and computational efficiency. Q8's slightly slower latency vs Q4 (1,965ms vs 1,901ms, 3% slower) may indicate memory bandwidth saturation with larger parameters.

Cross-Tier Consistency: Perfect token consistency (σ=0) and minimal latency variance (<20ms) demonstrate deterministic performance suitable for production deployment. The combination of 100% task completion across 15 trials and zero-variance token generation provides robust evidence despite small per-tier sample sizes.

Reference

Chapter 6 T1 - T10 Tests Appendix A for Chapter 6

Appendix C