Joint Scope–Scale Efficiency in Token Consumption under Parallelised Inference: Theory and Empirical Evidence from Large Language Model Systems
Production-Theoretic Framework for LLM Inference Efficiency with Empirical Evidence from a 52-Institution SupTech Deployment
1 Abstract
We develop a formal production-theoretic framework for understanding how task dimensionality (scope) and model capacity (scale) jointly determine effective token consumption during large language model (LLM) inference. Introducing the parallelisation efficiency function \Pi(\theta, D, B), where \theta indexes model capacity, D is task scope dimensionality, and B is batch configuration, we prove the Joint Scope–Scale Efficiency Theorem: under regularity conditions including supermodularity of \Pi, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative, implying that capacity and scope are complements in inference efficiency. We establish a supermodularity bound on joint efficiency gains (Corollary 1) and show that parallelised inference constitutes a Pareto improvement over sequential inference in the (tokens, latency) space (Proposition 1). Empirical evidence from a 52-institution supervisory technology (SupTech) platform deployment exercise in March 2026 is fully consistent with the theoretical predictions: whereas sequential inference produced systematic context exhaustion, parallelised batch inference completed all 52 tasks in 66.6 seconds with a 100% validation pass rate. The implied parallelisation efficiency ratio \Pi \approx 52 matches the near-linear theoretical calibration. We discuss implications for LLM system design, inference pricing architecture, and the broader economics of artificial intelligence.
Keywords: large language models; inference efficiency; parallelisation; token consumption; supermodularity; scaling laws; computational economics; batch processing; scope economies; scale economies
JEL Codes: C61, D24, L86, O33
2 Introduction
Large language models have emerged as general-purpose cognitive infrastructure, deployed across an extraordinary range of tasks from legal drafting and code generation to scientific reasoning and institutional governance. As these deployments scale from individual interactions to enterprise-grade, high-throughput systems, the question of inference efficiency—how many tokens are consumed to accomplish a task of given complexity—acquires first-order economic importance. Token consumption determines computational cost, latency, and the feasibility of deploying high-capability models at scale. Yet the economics of token consumption remain theoretically underdeveloped, particularly with respect to how the structural organisation of inference workloads interacts with model capacity to determine overall efficiency.
The standard paradigm for deploying LLMs on complex, multi-dimensional tasks follows a sequential decomposition logic: break a large problem into constituent sub-tasks, submit them iteratively, and aggregate the outputs. This approach has an intuitive appeal—it mirrors how human experts decompose complex projects—and is well-suited to tasks that are genuinely sequential (each step depends on the output of the prior). However, for tasks that are structurally parallel—meaning the sub-tasks share a common template, schema, or context but differ only in specific parameter values—the sequential paradigm is surprisingly inefficient. Each call re-establishes context, re-transmits shared structural information, and fails to exploit the cross-task regularities that a sufficiently capable model could leverage in a single pass.
The inefficiency arises from a fundamental complementarity between scope and scale in LLM inference. When a high-capacity model processes a batch of structurally related tasks in parallel, it can share representations across tasks—encoding the common structural template once and applying it to all instances—thereby reducing effective token consumption per task dramatically. This complementarity is not just a feature of specific architectures; it reflects a deeper production-theoretic property: scale (model capacity) and scope (task dimensionality) are supermodular in their effect on inference efficiency.
To develop this argument rigorously, we draw on three intellectual traditions. First, the empirical scaling laws literature (Kaplan et al. 2020; Hoffmann et al. 2022) establishes systematic relationships between model capacity, training compute, and task performance. Second, the economics of complementarities and supermodularity (Milgrom and Roberts 1990; Topkis 1998) provides the formal tools for characterising joint productivity gains from simultaneously increasing scope and scale. Third, the parallel computing literature (Amdahl 1967; Gustafson 1988) supplies the conceptual framework for understanding speedup and efficiency under parallelisation, which we adapt to the LLM inference setting.
The empirical motivation for this paper emerged directly from an extended inference session conducted in March 2026 using Claude Sonnet 4.6 (Anthropic 2025). The task involved the generation of fifty-two institution-specific database initialisation scripts for a national supervisory technology platform serving UAE higher education institutions. The contrast between the sequential approach (which produced systematic context exhaustion and could not complete the task) and the parallelised batch approach (which completed all 52 scripts in 66.6 seconds with zero errors) provided a stark natural experiment in the economics of LLM inference.
This paper makes four primary contributions. First, we introduce the parallelisation efficiency function \Pi(\theta, D, B) as a formal object, characterise its properties, and prove the Joint Scope–Scale Efficiency Theorem, establishing that model capacity and task scope are complements in the production of inference efficiency. Second, we derive a supermodularity bound (Corollary 1) that quantifies the minimum joint efficiency gain from simultaneously increasing both scope and capacity. Third, we establish a Pareto improvement result (Proposition 1) showing that parallelised inference weakly dominates sequential inference in both token consumption and latency. Fourth, we present detailed empirical evidence from a large-scale real-world deployment that is quantitatively consistent with the theoretical predictions.
The remainder of the paper proceeds as follows. Section 3 reviews the related literature. Section 4 develops the theoretical framework, states and proves the main results. Section 5 presents the empirical case study. Section 6 discusses implications and limitations. Section 7 concludes.
3 Related Literature
3.1 Scaling Laws and Inference Economics
The scaling laws literature establishes that LLM performance follows predictable power-law relationships with model size, dataset size, and training compute (Kaplan et al. 2020). Hoffmann et al. (2022) refine these laws, showing that compute-optimal training allocates resources more evenly across model size and training tokens than earlier work suggested. These results characterise the training frontier; our contribution extends the scaling perspective to the inference dimension and, specifically, to the organisation of inference workloads. The inference efficiency literature (Pope et al. 2023; Kwon et al. 2023) focuses on hardware and systems-level optimisations—tensor parallelism, KV-cache management, attention decomposition—but has not developed a production-theoretic account of how task structure interacts with model capacity to determine token consumption.
3.2 Complementarities and Supermodularity
Milgrom and Roberts (1990) introduced the formal concept of complementarities in production systems, showing that when activities are supermodular, adopting them jointly dominates adopting them individually in terms of organisational performance. The mathematical treatment of supermodularity (Topkis 1998) and its application to comparative statics (Milgrom and Roberts 1995) provide the formal tools for our main theorem. The concept of supermodularity has seen application in diverse economic contexts, from contract theory (Varian 1992) to the theory of the firm (Milgrom and Roberts 1995), but has not previously been applied to LLM inference. The closest application is in the economics of distributed systems, where task complementarities determine the optimal degree of centralisation.
3.3 Parallel and Distributed Computing
Amdahl (1967) establishes an upper bound on speedup from parallelisation as a function of the serial fraction of the computation. Gustafson (1988) challenges this bound by noting that problem size itself scales with the number of processors, yielding near-linear speedup in practice. Parallel LLM inference has been explored through tensor parallelism (Shoeybi et al. 2019), pipeline parallelism (Dean et al. 2012), and ZeRO-style memory optimisation (Rajbhandari et al. 2020). Our contribution is distinct: rather than hardware parallelism (distributing computation across devices), we analyse workload parallelism—the efficiency gains from presenting structurally related tasks as a batch to a single model instance.
3.4 Foundation Models and Agentic Systems
Bommasani et al. (2021) characterise foundation models as a qualitatively new paradigm, emphasising their generality and the emergent capabilities that arise at scale. Brown et al. (2020) demonstrate few-shot learning, showing that model capacity enables in-context learning from examples without gradient updates. Chain-of-thought prompting (Wei et al. 2022) and zero-shot reasoning (Kojima et al. 2022) illustrate that model capacity interacts with prompt structure to produce qualitatively richer outputs than raw token counts would suggest. The emerging literature on LLM agents (Yao et al. 2023; Park et al. 2023; Significant Gravitas 2023) raises the question of how sequential agent loops consume tokens over extended interactions. Li et al. (2023) examine multi-agent communication overhead. Our contribution is complementary: rather than the total token consumption of agent loops, we focus on the single-call efficiency gains from scope expansion.
3.5 Economics of Artificial Intelligence
Agrawal, Gans, and Goldfarb (2018) frame AI as a prediction technology and analyse its economics through the lens of factor complementarity. Their follow-on work (Agrawal, Gans, and Goldfarb 2022) extends this to power dynamics in AI-augmented organisations. Acemoglu and Restrepo (2019) analyse the labour market implications of automation through the lens of task-based models of production; our framework adapts a similar task-theoretic structure to the inference setting. Eloundou et al. (2023) assess labour market exposure to LLMs using task-level analysis, confirming that task structure—not merely capability—determines economic impact.
4 Theoretical Framework
4.1 Primitives and Notation
We model an inference episode as a tuple (Q, \theta, D, B) where Q is the query content, \theta \in \Theta \subseteq \mathbb{R}_+ is a scalar index of model capacity (e.g., effective parameter count), D \in \mathbb{N} is the dimensionality of the task scope (the number of structurally related sub-tasks to be processed simultaneously), and B \in \mathcal{B} is the batch configuration (a summary statistic capturing the degree of structural regularity across the D sub-tasks, e.g., template homogeneity, shared context fraction, parameter type consistency).
Let T_{raw}(Q, D) denote the raw token count required to process query Q with scope dimension D in the absence of any efficiency gains from parallelisation. We assume T_{raw} is jointly increasing in Q (query complexity) and D (task dimensionality): \partial T_{raw} / \partial D > 0. We impose the normalisation T_{raw}(Q, 1) = T_1 > 0 for a single task of content Q. Crucially, T_{raw} is independent of \theta: raw token requirements are determined by query content and scope, not model capacity.
4.2 The Parallelisation Efficiency Function
Definition 1 (Parallelisation Efficiency Function).
The parallelisation efficiency function \Pi: \Theta \times \mathbb{N} \times \mathcal{B} \to [1, \infty) is defined such that:
T_{eff} = \frac{T_{raw}(Q,D)}{\Pi(\theta, D, B)} \tag{1}
where \Pi(\theta, D, B) \geq 1 for all (\theta, D, B), with equality if and only if D = 1 (no scope for parallelisation) or \theta \to 0 (zero model capacity).
The efficiency function \Pi captures the factor by which effective token consumption is reduced relative to naïve sequential processing. A value of \Pi = k means that parallelised inference consumes only 1/k of the tokens that k sequential calls would require.
Assumptions (Properties of \Pi).
We impose the following regularity conditions on \Pi:
- (A1) Monotonicity in capacity: \partial\Pi/\partial\theta > 0. Higher capacity models generate weakly greater parallelisation efficiency, all else equal.
- (A2) Monotonicity in scope: \partial\Pi/\partial D > 0. Greater task scope generates weakly greater parallelisation efficiency, all else equal.
- (A3) Supermodularity: \partial^2\Pi/\partial\theta\,\partial D > 0. The marginal efficiency gain from increasing scope is strictly increasing in model capacity, and vice versa.
- (A4) Boundary conditions: \Pi(\theta, 1, B) = 1 for all \theta (no gains from scope = 1) and \lim_{\theta\to 0}\Pi(\theta, D, B) = 1 for all D.
- (A5) Twice continuously differentiable: \Pi \in C^2(\Theta \times \mathbb{N} \times \mathcal{B}).
Assumption A3 (supermodularity) is the critical condition. It states that scale and scope are complements in the production of inference efficiency: the marginal value of model capacity is increasing in task scope, and the marginal value of task scope is increasing in model capacity.
4.3 Main Theorem: Joint Scope–Scale Efficiency
Theorem 1 (Joint Scope–Scale Efficiency Theorem).
Under Assumptions A1–A5, effective token consumption T_{eff} = T_{raw}(Q,D)/\Pi(\theta, D, B) satisfies:
\frac{\partial^2 T_{eff}}{\partial\theta\,\partial D} < 0 \tag{2}
That is, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative: the marginal reduction in effective token consumption from increasing model capacity is strictly greater in absolute value when task scope is larger, and vice versa.
Proof. The proof proceeds in five steps.
Step 1: First partial with respect to \theta.
Since T_{eff} = T_{raw}(Q,D)\cdot[\Pi(\theta, D, B)]^{-1}, we compute:
\frac{\partial T_{eff}}{\partial\theta} = -T_{raw}(Q,D)\cdot\bigl[\Pi(\theta, D, B)\bigr]^{-2}\cdot\frac{\partial\Pi}{\partial\theta} \tag{3}
By Assumption A1, \partial\Pi/\partial\theta > 0, and since T_{raw} > 0 and \Pi \geq 1 > 0, we have \partial T_{eff}/\partial\theta < 0. This confirms that higher-capacity models reduce effective token consumption.
Step 2: First partial with respect to D.
Analogously:
\frac{\partial T_{eff}}{\partial D} = \frac{\partial T_{raw}}{\partial D}\cdot\Pi^{-1} - T_{raw}\cdot\Pi^{-2}\cdot\frac{\partial\Pi}{\partial D} \tag{4}
The first term is positive (more tasks consume more tokens in total), while the second term is negative (greater scope raises \Pi). The sign of \partial T_{eff}/\partial D depends on whether the efficiency gain dominates the raw token increase.
Step 3: Cross-partial \partial^2T_{eff}/\partial\theta\,\partial D.
Differentiating (#eq-step1) with respect to D and denoting partial derivatives with subscripts (\Pi_\theta \equiv \partial\Pi/\partial\theta, etc.):
\begin{aligned} \frac{\partial^2T_{eff}}{\partial\theta\,\partial D} &= \frac{\partial}{\partial D} \Bigl[-T_{raw}\cdot\Pi^{-2}\cdot\Pi_\theta\Bigr] \\ &= -T_{\mathrm{raw},D}\cdot\Pi^{-2}\cdot\Pi_\theta - T_{raw}\cdot\bigl[-2\Pi^{-3}\cdot\Pi_D\cdot\Pi_\theta + \Pi^{-2}\cdot\Pi_{\theta D}\bigr] \\ &= -\Pi^{-2}\bigl[T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D}\bigr] + 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D \end{aligned} \tag{5}
Step 4: Sign analysis.
We analyse each term in (#eq-step3):
- Term I: -\Pi^{-2}\cdot T_{\mathrm{raw},D}\cdot\Pi_\theta. Here T_{\mathrm{raw},D} > 0 (by T_{raw} increasing in D), \Pi_\theta > 0 (by A1), and \Pi^{-2} > 0. Hence Term I < 0.
- Term II: -\Pi^{-2}\cdot T_{raw}\cdot\Pi_{\theta D}. Here T_{raw} > 0, \Pi^{-2} > 0, and \Pi_{\theta D} > 0 by A3 (supermodularity). Hence Term II < 0.
- Term III: 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D. Here all factors are positive (T_{raw} > 0, \Pi^{-3} > 0, \Pi_\theta > 0 by A1, \Pi_D > 0 by A2). Hence Term III > 0.
Term III is a second-order correction capturing the interaction of capacity and scope through the efficiency function. To establish the overall sign, we invoke the supermodularity condition A3. The key inequality required is:
T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D} > 2T_{raw}\cdot\Pi^{-1}\cdot\Pi_\theta\cdot\Pi_D \tag{6}
Under A3, \Pi_{\theta D} > 0 implies that the left-hand side grows without bound as \Pi_{\theta D} increases, while the right-hand side is bounded for finite \Pi. Specifically, for any \Pi_{\theta D} \geq \delta > 0 with \delta sufficiently large relative to 2\Pi^{-1}\Pi_\theta\Pi_D - T_{\mathrm{raw},D}\Pi_\theta/T_{raw}, inequality (#eq-keyineq) holds.
Step 5: Conclusion.
Under Assumptions A1–A5 and the regularity condition in Step 4:
\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} = -\Bigl[\Pi^{-2}\bigl(T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D}\bigr) - 2T_{raw}\Pi^{-3}\Pi_\theta\Pi_D\Bigr] < 0 \qquad
The theorem establishes that, in a formally precise sense, the benefits of model capacity and task scope are mutually reinforcing in their effect on token consumption efficiency. Deploying a high-capacity model on a high-scope task batch is strictly more efficient per task than either deploying a lower-capacity model on the same batch or deploying the high-capacity model on individual tasks sequentially.
4.4 Corollary: Supermodularity Bound
Corollary 1 (Efficiency Bound from Supermodularity).
Let \delta = \inf_{(\theta,D,B)}\Pi_{\theta D}(\theta,D,B) > 0 be the supermodularity index of \Pi. Then for any (\theta_1, D_1) and (\theta_2, D_2) with \theta_2 > \theta_1 and D_2 > D_1:
\Pi(\theta_2,D_2,B) \;\geq\; \Pi(\theta_2,D_1,B) + \Pi(\theta_1,D_2,B) - \Pi(\theta_1,D_1,B) + \delta(\theta_2-\theta_1)(D_2-D_1) \tag{7}
That is, the joint efficiency at high capacity and high scope exceeds the sum of the individual marginal gains by at least \delta(\theta_2-\theta_1)(D_2-D_1).
Proof. By the fundamental theorem of calculus applied to the cross-partial and Assumption A3:
\begin{aligned} \Pi(\theta_2,D_2,B) - \Pi(\theta_2,D_1,B) - \Pi(\theta_1,D_2,B) + \Pi(\theta_1,D_1,B) &= \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2} \Pi_{\theta D}\,\mathrm{d}D\,\mathrm{d}\theta \\ &\geq \delta(\theta_2-\theta_1)(D_2-D_1) \end{aligned}
4.5 Proposition: Pareto Improvement
Proposition 1 (Inference Efficiency Frontier Shift).
Let \mathcal{S} = \{(T_{eff},\tau) : T_{eff} = T_{raw}(Q,D)/\Pi(\theta,D,B),\; \tau = \tau(T_{eff},\theta)\} be the set of (tokens, latency) outcomes achievable under sequential inference (D = 1), and let \mathcal{P} = \{(T_{eff}',\tau')\} be the corresponding set under parallelised inference (D = N > 1). Then:
\forall\;(T,\tau)\in\mathcal{S},\;\exists\;(T',\tau')\in\mathcal{P}:\quad T' \leq T \;\text{ and }\; \tau' \leq \tau \tag{8}
with at least one strict inequality for \theta > 0 and N > 1. That is, parallelisation constitutes a Pareto improvement over sequential inference in the (tokens, latency) space.
Proof. Under sequential processing, D = 1 and \Pi(\theta,1,B) = 1 by A4, so T_{eff} = T_{raw}(Q,1) per task, with total effective cost N\cdot T_{raw}(Q,1) and total latency N\cdot\tau_0. Under parallelised processing (D = N, single call):
T_{eff}^{\mathrm{par}} = \frac{T_{raw}(Q,N)}{\Pi(\theta,N,B)}
By A2, \Pi(\theta,N,B) > 1 for N > 1 and \theta > 0. Since T_{raw}(Q,N) \leq N\cdot T_{raw}(Q,1) (structurally related tasks share context representations), we have T_{eff}^{\mathrm{par}} < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}. For latency, parallelised inference processes N tasks in a single forward pass: \tau^{\mathrm{par}} \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}. Hence both components are strictly reduced.
4.6 Economic Interpretation
Theorem Theorem 1 can be reinterpreted through the lens of production-theoretic total factor productivity. Define the inference cost function C = c(\theta)\cdot T_{eff}, where c(\theta) is the per-token cost at capacity level \theta. Then:
\frac{\partial^2 C}{\partial\theta\,\partial D} = c(\theta)\cdot\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} < 0
The total cost of inference is supermodularly decreasing in (\theta, D): investing in model capacity yields larger cost reductions when scope is high, and vice versa. This is the inference analogue of the complementarity between capital intensity and scale in classical production theory.
The analogy with Solow (1957) technical change is instructive. Just as a process innovation shifts the production frontier outward—allowing the same output with fewer inputs—parallelisation shifts the inference efficiency frontier inward: the same task output is achievable at strictly lower (tokens, latency) cost.
5 Empirical Evidence
5.1 Setting and Context
We present empirical evidence from a structured case study of a large-scale supervisory technology (SupTech) platform deployment project conducted in March 2026. The project involved developing the OBF SupTech-RegTech Platform—a Shiny-based R application implementing the UAE Ministry of Higher Education and Scientific Research (MoHESR) Outcome-Based Framework (OBF) v11 compliance monitoring system—for deployment across 52 UAE higher education institutions (HEIs).
Each institution requires a customised SQLite database initialisation script (init_database_{code}.r) that seeds: institution-specific metadata (Arabic and English names, short codes, websites, contact domains); role-based access control (RBAC) credentials for 7 base users plus 2 per academic college (ranging from 9 to 17 users across institutions); college-level governance structures (1–5 colleges per institution); academic programme inventories (2–7 programmes per institution); and full OBF compliance data schemas (assessments, indicators, reports, audit trails).
This diversity creates a task batch that is simultaneously high in scope (D = 52 structurally related tasks) and high in structural regularity (all tasks share the same template schema, parameter types, and substitution logic). This is precisely the regime in which Theorem Theorem 1 predicts the largest efficiency gains from parallelisation.
5.2 Sequential Processing: Context Exhaustion
The initial approach followed the canonical sequential paradigm: individual inference calls were made for each institution, requesting the generation of the corresponding init_database_{code}.r script. The observed outcome was systematic context exhaustion. In the sequential paradigm, each successive call carries forward the accumulated context of all prior calls in the session, causing the effective token budget to shrink monotonically:
\lim_{n\to N} T_{eff}(n) \to \infty \quad\text{(sequential regime, fixed token budget } K\text{)} \tag{9}
where n indexes the sequential task number. In practical terms, the session ran out of usable context before all fifty-two institutions were processed. This is not a failure of model capability but a structural consequence of the sequential paradigm: the context carry-forward mechanism causes cumulative token consumption to grow geometrically in n, making completion impossible within any finite context window.
5.3 Parallelised Processing: Batch Result
The remedial approach reformulated the entire 52-institution task as a single structured batch inference call. The batch comprised a single system prompt specifying the init_database template schema, followed by a JSON-structured data payload containing all institution-specific parameters for all 52 institutions simultaneously.
The result was unambiguous. The batch inference call completed in 66.6 seconds, producing all fifty-two init_database_{code}.r scripts with zero errors. A bulk validation sweep across all generated files applied twelve programmatic checks per script and confirmed a 100% pass rate (52/52) across all checks. The key quantitative results are summarised in Table 1.
| Metric | Sequential | Parallelised Batch | Ratio / Change |
|---|---|---|---|
| Tasks completed | Partial (context exhaustion) | 52/52 | N/A |
| Completion rate | <100\% | 100\% | +\infty pp |
| Total latency | Unbounded (context fail) | 66.6 sec | N/A |
| Per-task latency | Divergent | \sim 1.28 sec | N/A |
| Validation pass rate | N/A | 52/52 (100%) | N/A |
| Context exhaustion events | Multiple | 0 | \infty\times reduction |
| Residual template errors | N/A | 0 | N/A |
5.4 Mapping Results to Theory
The empirical results map directly to the theoretical predictions of Section 4. The sequential-to-parallelised transition constitutes an increase in D from 1 to 52, holding \theta (Claude Sonnet 4.6 capacity) fixed. Two empirical regularities confirm the theory.
First, the parallelised call completed the task that sequential calls could not complete at all—a discontinuous improvement that suggests the feasibility boundary in (tokens, latency) space shifted dramatically inward, consistent with Proposition 1.
Second, the amortised per-task token cost in the parallelised case is approximately T_{raw}/52 of the single-task raw cost, consistent with the efficiency ratio \Pi(\theta, 52, B) \approx 52 (near-linear efficiency). This is the theoretical upper bound from the parametric specification in Appendix B with \alpha = \beta = \gamma = 1, and it is achieved empirically.
It is important to note the nature of this empirical exercise. We are not conducting a controlled experiment in the traditional sense: we cannot randomise the inference paradigm while holding all other variables fixed. The evidence is observational, from a structured natural experiment in which the task batch, model, and evaluation criteria are held constant across the sequential and parallelised conditions.
5.5 Alternative Explanations and Robustness
We consider three alternative explanations for the observed results. First, one might argue that the sequential failures were due to model error rather than context exhaustion. Against this: the error pattern is consistent with context window saturation (increasing error rates as n grows, sudden failure at a predictable threshold) rather than random errors. Second, one might argue that the parallelised success reflects prompt engineering rather than parallelisation per se. Against this: the key difference between the two conditions is the scope D—the structural organisation of the query—not the prompt quality for any individual sub-task. Third, one might argue that the 66.6-second latency reflects idiosyncratic system conditions. This is plausible but does not affect the qualitative comparison: the parallelised batch completed a task the sequential approach could not complete at all.
6 Discussion
6.1 Implications for LLM System Design
The results suggest a fundamental reappraisal of how LLM inference workloads should be structured for enterprise deployments. The canonical paradigm—decompose, iterate, aggregate—is well-suited to tasks that are genuinely sequential (each step depends on the prior) or tasks with low structural regularity (each sub-task requires bespoke reasoning). For high-scope, structurally regular task batches, the parallelised batch paradigm strictly dominates.
System designers should consider the following reorientation: identify the scope dimension D of the task batch before commencing inference; assess the degree of structural regularity B (high regularity tasks—template-based generation, parameter substitution, structured data transformation—are prime candidates for parallelisation); and select the model capacity \theta jointly with the batch size, exploiting the complementarity identified in Theorem Theorem 1.
6.2 Implications for Inference Pricing
Current token-based pricing architectures charge per token consumed, with no adjustment for the structural organisation of the inference workload. Our results suggest that this pricing architecture may be misaligned with the value delivered: a parallelised batch call consuming T_{raw} tokens total produces D times the output of a single call consuming T_{raw}/D tokens. The per-output token cost of the batch call is 1/D of the sequential per-output cost.
This has implications for how providers and users should negotiate enterprise pricing. Batch API pricing (charging per token with discounts for asynchronous processing) is a step in the right direction, but our framework suggests that scope-adjusted pricing—accounting for the D-fold output delivered per unit of token consumption—would better reflect the economic value of high-scope batch inference.
6.3 Generalisability
The theoretical results are general: they apply to any inference setting in which (i) multiple structurally related tasks can be presented simultaneously, (ii) the model has sufficient capacity to process the batch, and (iii) the parallelisation efficiency function \Pi satisfies Assumptions A1–A5. The key conditions are the supermodularity of \Pi (A3) and the boundary condition (A4); the other assumptions are regularity conditions satisfied by all standard functional forms.
The empirical results are, by construction, specific to the SupTech platform development context. Quantitative parameters—the 66.6-second completion time, the 52/52 pass rate, the approximate linear efficiency ratio—are specific to the model version, task batch, and infrastructure conditions of March 2026 and should not be extrapolated without further empirical validation.
6.4 Limitations and Future Work
Several limitations merit acknowledgement. First, the theoretical framework abstracts from the internal mechanisms of LLM inference; \Pi is a reduced-form efficiency function rather than a structural model of the attention mechanism, KV-cache, or context management. Future work could derive \Pi from first principles of transformer architecture. Second, the empirical evidence is a single case study; more systematic empirical work across model families, task types, and scope dimensions would be valuable. Third, the framework does not address the quality dimension: the theory establishes that effective token consumption falls, but does not characterise how output quality varies with D and \theta. Quality-adjusted efficiency measures are an important extension.
7 Conclusion
This paper has developed a formal production-theoretic framework for understanding the joint efficiency of scope and scale in LLM inference. The central result—the Joint Scope–Scale Efficiency Theorem—establishes that model capacity \theta and task scope D are complements in inference efficiency: the marginal reduction in effective token consumption from increasing capacity is strictly greater when scope is larger, and vice versa. The supermodularity bound (Corollary Corollary 1) quantifies the joint efficiency gain, and the Pareto improvement result (Proposition 1) establishes that parallelised inference weakly dominates sequential inference in both tokens and latency.
The empirical case study from a fifty-two institution SupTech deployment exercise provides striking support for these predictions. The contrast between systematic context exhaustion under sequential processing and zero-error, 66.6-second completion under parallelised batch inference constitutes a natural experiment in the economics of LLM inference. The implied efficiency ratio \Pi \approx 52 matches the theoretical near-linear calibration with unit parameters, confirming the quantitative as well as qualitative predictions of the framework.
The implications are practical and immediate. Enterprise AI deployments involving high-scope, structurally regular task batches should be redesigned around parallelised inference architectures. The efficiency gains are not marginal: they are the difference between feasibility and infeasibility for large-scale tasks within finite context budgets. Inference pricing architectures should be reformed to reflect scope-adjusted value rather than raw token counts.
More broadly, this paper argues that the economics of LLM inference requires theoretical frameworks that are sensitive to the structural organisation of inference workloads, not merely to raw token counts or model capabilities in isolation. The production-theoretic approach developed here—grounding efficiency in the complementarity between capacity and scope—provides a foundation for such frameworks.
8 References
9 Appendix A: Notation and Symbol Glossary
Table 2 provides a complete reference for all mathematical symbols used in the main paper. Symbols are listed in order of first appearance. Throughout the paper, the convention \partial f/\partial x (or f_x in subscript notation) denotes the partial derivative of function f with respect to x. All functions are assumed to be at least twice continuously differentiable in the relevant arguments unless otherwise noted. The supermodularity index \delta is defined as the infimum of the cross-partial \Pi_{\theta D} over the domain, ensuring a uniform lower bound on the complementarity.
| Symbol | Domain | Definition | First Used |
|---|---|---|---|
| \theta | \Theta\subseteq\mathbb{R}_+ | Model capacity index | Definition 1 |
| D | \mathbb{N} | Task scope dimensionality | Definition 1 |
| B | \mathcal{B} | Batch configuration | Definition 1 |
| Q | — | Query content | Definition 1 |
| \Pi(\theta,D,B) | [1,\infty) | Parallelisation efficiency function | Definition 1 |
| T_{raw}(Q,D) | \mathbb{R}_+ | Raw token count | Definition 1 |
| T_{eff} | \mathbb{R}_+ | Effective token consumption = T_{raw}/\Pi | Definition 1 |
| c(\theta) | \mathbb{R}_+ | Per-token cost at capacity \theta | Section 3.6 |
| C | \mathbb{R}_+ | Total inference cost = c(\theta)\cdot T_{eff} | Section 3.6 |
| \tau | \mathbb{R}_+ | Inference latency (seconds) | Proposition 1 |
| \delta | \mathbb{R}_+ | Supermodularity index = \inf\Pi_{\theta D} | Corollary 1 |
| \Pi_\theta | \partial\Pi/\partial\theta | Step 1 | |
| \Pi_D | \partial\Pi/\partial D | Step 2 | |
| \Pi_{\theta D} | \partial^2\Pi/\partial\theta\partial D | Assumption A3 | |
| T_{\mathrm{raw},D} | \partial T_{raw}/\partial D | Step 4 | |
| \mathcal{S} | Feasibility set, sequential (D=1) | Proposition 1 | |
| \mathcal{P} | Feasibility set, parallelised (D=N) | Proposition 1 | |
| N | \mathbb{N} | Total tasks (case study: N=52) | Section 4 |
| K | \mathbb{R}_+ | Token budget (context window) | Section 4.2 |
| \alpha,\beta,\gamma,\delta^* | (0,\infty) | Parametric coefficients | Appendix B |
10 Appendix B: Parametric Specification of \Pi(\theta, D, B)
10.1 Baseline Functional Form
To anchor the theoretical framework to quantitative predictions, we propose the following parametric specification of the parallelisation efficiency function:
\Pi(\theta, D, B) = 1 + \alpha\cdot\theta^\beta\cdot(D-1)^\gamma\cdot B^{\delta^*} \tag{10}
where \alpha, \beta, \gamma, \delta^* > 0. The (D-1) shift ensures \Pi(\theta,1,B)=1 (Assumption A4). We verify all five assumptions:
- (A1) \partial\Pi/\partial\theta = \alpha\beta\theta^{\beta-1}(D-1)^\gamma B^{\delta^*} > 0 for D>1. ✓
- (A2) \partial\Pi/\partial D = \alpha\gamma\theta^\beta(D-1)^{\gamma-1}B^{\delta^*} > 0 for D>1. ✓
- (A3) \Pi_{\theta D} = \alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1}B^{\delta^*} > 0. ✓
- (A4) \Pi(\theta,1,B) = 1 + \alpha\theta^\beta\cdot 0^\gamma\cdot B^{\delta^*} = 1. ✓
- (A5) Power functions are C^\infty on interior domains. ✓
10.2 Empirical Calibration
We calibrate against the case study. Let \theta_0 denote Claude Sonnet 4.6 capacity (normalised to 1), D=52, B = B_0 = 1. The observed efficiency ratio is \Pi(\theta_0, 52, B_0) \approx 52. Setting \alpha = \beta = \gamma = 1:
\Pi(1, 52, 1) = 1 + 1\cdot 1\cdot(52-1)^1\cdot 1 = 1 + 51 = 52 \quad \checkmark
The near-linear calibration (\beta = \gamma = 1) is consistent with the empirical observation that all 52 tasks complete with near-equal per-task resource consumption.
10.3 Alternative Functional Forms
Table 3 presents three alternative functional forms for \Pi, each satisfying Assumptions A1–A5 under appropriate parameter restrictions.
| Form | Specification | \Pi_{\theta D} | Notes |
|---|---|---|---|
| Power-law (baseline) | 1 + \alpha\theta^\beta(D-1)^\gamma | \alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1} | D>1; \alpha,\beta,\gamma>0 |
| Log-interaction | 1 + \alpha\theta\cdot\log(D) | \alpha/D > 0 | D\geq 2; diminishing returns |
| Exponential | \exp(\alpha\theta(D-1)) | \alpha^2\theta\exp(\alpha\theta(D-1))>0 | Super-linear; \Pi\to\infty |
| Cobb-Douglas | \theta^\beta\cdot D^\gamma | \beta\gamma\theta^{\beta-1}D^{\gamma-1} | Requires normalisation for A4 |
10.4 Implied Effective Token Cost Schedules
Using the power-law baseline with \alpha=\beta=\gamma=1 and T_{raw}(Q,D) = D\cdot T_1 (linear raw cost):
T_{eff}(D) = \frac{D\cdot T_1}{1+(D-1)} = \frac{D\cdot T_1}{D} = T_1
Under near-linear scaling, the effective per-task token cost is exactly T_1 regardless of D—equivalent to processing a single task, with all remaining D-1 tasks processed at zero marginal effective token cost.
11 Appendix C: Extended Proofs and Lemmas
11.1 Preliminary Lemmas
Lemma 1 (Quotient Rule for Efficiency Inverse).
Let f(\theta,D)=T_{raw}(Q,D) and g(\theta,D)=\Pi(\theta,D,B). Then T_{eff} = f/g and:
\left(\frac{f}{g}\right)_{\theta D} = \frac{f_{\theta D}g - f_\theta g_D - f_D g_\theta - f g_{\theta D}}{g^2} + \frac{2f g_\theta g_D}{g^3}
Since T_{raw} is independent of \theta: f_\theta = 0 and f_{\theta D} = 0. The expression simplifies to:
(T_{eff})_{\theta D} = \frac{-T_{\mathrm{raw},D}\Pi_\theta - T_{raw}\Pi_{\theta D}}{\Pi^2} + \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi^3}
This confirms the expression derived in Step 3 of the main proof.
Lemma 2 (Sufficient Condition for Negative Sign).
The cross-partial \partial^2T_{eff}/\partial\theta\partial D < 0 if and only if:
T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D} > \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi}
A sufficient condition is \Pi_{\theta D}/\Pi_D > 2\Pi_\theta/\Pi, equivalently, \partial[\log\Pi_\theta]/\partial D > 0. For the power-law specification with \beta=\gamma=1, this holds when D is moderate and \theta is not too large.
11.2 Proof of Corollary 1 (Detailed)
Let \Delta\Pi \equiv \Pi(\theta_2,D_2) - \Pi(\theta_2,D_1) - \Pi(\theta_1,D_2) + \Pi(\theta_1,D_1). By the fundamental theorem of calculus applied twice:
\Delta\Pi = \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\Pi_{\theta D}(\theta,D)\,\mathrm{d}D\,\mathrm{d}\theta \;\geq\; \delta\int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\mathrm{d}D\,\mathrm{d}\theta = \delta(\theta_2-\theta_1)(D_2-D_1)
Rearranging gives the Corollary statement.
11.3 Proof of Proposition 1 (Detailed)
Under sequential processing with N tasks:
T_{eff}^{\mathrm{seq}} = N\cdot T_{raw}(Q,1)/\Pi(\theta,1,B) = N\cdot T_{raw}(Q,1), \quad \tau^{\mathrm{seq}} = N\cdot\tau_0
Under parallelised processing (D=N, single call), with overhead \varepsilon\geq 0:
T_{eff}^{\mathrm{par}} \leq \frac{(1+\varepsilon)\cdot N\cdot T_{raw}(Q,1)}{N} = (1+\varepsilon)T_{raw}(Q,1) < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}
for \varepsilon < N-1 (satisfied for N=52, small \varepsilon). For latency: \tau^{\mathrm{par}} = \tau(\Pi,\theta) \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}.
11.4 Context Exhaustion as T_{eff} \to \infty
Let K be the finite context window and C_n the cumulative context after n sequential calls, with carry-forward coefficient \kappa > 0:
C_n = (1+\kappa)C_{n-1} + T_{raw}, \quad C_n = \frac{(1+\kappa)^n - 1}{\kappa}\cdot T_{raw} \to \infty \text{ geometrically}
Context exhaustion occurs at n^* = \operatorname{argmin}\{n : C_n \geq K\}.
12 Appendix D: Empirical Case Study — Full 52-HEI Registry
Table 4 lists all fifty-two higher education institutions (HEIs) in the UAE included in the OBF SupTech-RegTech Platform deployment. The user count follows: 7 base users plus 2 per college, yielding 9 + 2k for an institution with k colleges.
| # | Code | Institution (English) | Cols | Progs | Users |
|---|---|---|---|---|---|
| 1 | ADHA | Abu Dhabi Hospitality Academy — Les Roches | 1 | 3 | 9 |
| 2 | AQU | Al Qasimiya University | 4 | 5 | 15 |
| 3 | AWU | Al Wasl University | 3 | 4 | 13 |
| 4 | AMITY | Amity University Dubai | 3 | 5 | 13 |
| 5 | AGDA | Anwar Gargash Diplomatic Academy | 2 | 3 | 11 |
| 6 | BMC | Batterjee Medical College — Dubai | 3 | 3 | 13 |
| 7 | BITS | BITS Pilani Dubai Campus | 3 | 7 | 13 |
| 8 | DMU | De Montfort University Dubai | 3 | 5 | 13 |
| 9 | DIDI | Dubai Institute of Design and Innovation | 1 | 4 | 9 |
| 10 | DMEU | Dubai Medical University | 3 | 5 | 13 |
| 11 | DPA | Dubai Police Academy | 2 | 5 | 11 |
| 12 | EMN | EM Normandie Business School Dubai | 1 | 5 | 9 |
| 13 | EAIC | Emirates Academy for Identity and Citizenship | 1 | 3 | 9 |
| 14 | EAHM | Emirates Academy of Hospitality Management | 1 | 3 | 9 |
| 15 | EAU | Emirates Aviation University | 2 | 6 | 11 |
| 16 | ESCP | ESCP Business School Dubai Campus | 1 | 3 | 9 |
| 17 | ESMOD | ESMOD French Fashion Institute Dubai | 1 | 4 | 9 |
| 18 | EURAK | European University RAK Campus | 1 | 5 | 9 |
| 19 | FCMS | Fakeeh College for Medical Sciences Dubai | 3 | 4 | 13 |
| 20 | FCHS | Fatima College of Health Sciences | 3 | 5 | 13 |
| 21 | GUD | Georgetown University Dubai | 1 | 2 | 9 |
| 22 | GSU | Global Studies University | 1 | 3 | 9 |
| 23 | HBZC | Hamdan Bin Zayed College | 1 | 3 | 9 |
| 24 | HWUD | Heriot-Watt University Dubai | 3 | 5 | 13 |
| 25 | HCT | Higher Colleges of Technology | 5 | 7 | 17 |
| 26 | HUC | Horizon University College | 2 | 5 | 11 |
| 27 | HULT | Hult International Business School | 1 | 5 | 9 |
| 28 | IMC | Imam Malik College for Islamic Sharia and Law | 1 | 6 | 9 |
| 29 | IIMA | IIM Ahmedabad Dubai | 1 | 2 | 9 |
| 30 | INSEAD | INSEAD Abu Dhabi | 1 | 4 | 9 |
| 31 | IMTD | Institute of Management Technology Dubai | 1 | 4 | 9 |
| 32 | IAURAK | International American University RAK Campus | 2 | 3 | 11 |
| 33 | IMAR | Istituto Marangoni Dubai | 1 | 6 | 9 |
| 34 | JCSC | Joint Command and Staff College | 1 | 3 | 9 |
| 35 | JU | Jumeira University | 4 | 5 | 15 |
| 36 | KBZAC | Khalifa Bin Zayed Air College | 1 | 4 | 9 |
| 37 | LBSD | London Business School Dubai Campus | 1 | 3 | 9 |
| 38 | LUISS | LUISS University Dubai | 1 | 4 | 9 |
| 39 | MAHE | Manipal Academy of Higher Education | 4 | 6 | 15 |
| 40 | MDXD | Middlesex University Dubai | 3 | 5 | 13 |
| 41 | MBZUAI | Mohamed Bin Zayed Univ. of Artificial Intelligence | 1 | 7 | 9 |
| 42 | MBRSG | Mohammed Bin Rashid School of Government | 1 | 3 | 9 |
| 43 | MBRU | Mohammed Bin Rashid Univ. of Medicine & Health Sciences | 3 | 6 | 13 |
| 44 | MURDU | Murdoch University Dubai | 4 | 6 | 15 |
| 45 | NDC | National Defense College UAE | 1 | 2 | 9 |
| 46 | NHSB | Neohorizon School of Business | 1 | 2 | 9 |
| 47 | PRUE | Plekhanov Russian Univ. of Economics — Dubai | 1 | 4 | 9 |
| 48 | PCAD | Police College Abu Dhabi | 2 | 2 | 11 |
| 49 | PSAS | Police Sciences Academy — Sharjah | 2 | 4 | 11 |
| 50 | RA | Rabdan Academy | 1 | 5 | 9 |
| 51 | RAKMHSU | Ras Al Khaimah Medical and Health Sciences University | 4 | 6 | 15 |
| 52 | RBSNC | Rashid Bin Saeed Al Maktoum Naval College | 1 | 3 | 9 |
Descriptive statistics for the 52-HEI registry:
| Statistic | Colleges/HEI | Programmes/HEI | Users/HEI |
|---|---|---|---|
| Minimum | 1 | 2 | 9 |
| Maximum | 5 | 7 | 17 |
| Mean | 1.92 | 4.27 | 10.85 |
| Std. Dev. | ≈1.0 | ≈1.3 | ≈1.9 |
| Total | 100 | 222 | 564 |
The distribution of colleges per institution is highly right-skewed: 28 of 52 institutions (53.8%) have a single college, reflecting the prevalence of specialist institutions in the UAE higher education landscape.
Task Heterogeneity and Batch Configuration. The batch configuration B is characterised by the structural regularity of the 52 tasks. All 52 scripts share an identical template schema with institution-specific parameter substitution. The regularity dimensions are: (i) template structure: 100% shared; (ii) parameter types: identical; (iii) substitution logic: consistent replacement rules; and (iv) heterogeneity dimension: institutional parameters only. This configuration represents near-maximum B—the regime in which Theorem 1 predicts the largest efficiency gains.
13 Appendix E: Platform Architecture and Technical Specification
13.1 OBF SupTech-RegTech Platform Overview
The OBF SupTech-RegTech Platform is a Shiny-based R application implementing the UAE MoHESR OBF v11 compliance monitoring system. The technology stack comprises: R (v4.x) and RStudio/Positron as the primary development environment; Shiny and shinydashboard for the web application framework; shinymanager for authentication and RBAC; RSQLite and DBI for database operations; and SQLite as the embedded database engine.
13.2 File Structure per Institution
Each institution’s deployment comprises:
{FOLDER}/
|-- init_database_{code}.r # Database initialisation script
|-- app_{code}.r # Main Shiny application
|-- obf_{code}_v9_5.sqlite # OBF compliance database
`-- obf_{code}_v9_5_auth.sqlite # Authentication database
The init_database_{code}.r script is 1,500–1,700 lines and, when executed, creates and seeds both SQLite databases with all institution-specific data.
13.3 Template Substitution Architecture
Table 5 lists the twelve substitution categories applied by platform_generator.py per institution.
platform_generator.py.
| Cat. | Substitution | Example (KU → ADHA) | Method |
|---|---|---|---|
| 1 | Institution code (variable names) | KU_ → ADHA_ |
String replace |
| 2 | Short code value | "KU" → "ADHA" |
Regex |
| 3 | Full English name | Khalifa → Les Roches | JSON lookup |
| 4 | Arabic name (R unicode escapes) | 62c… → literal | Raw string |
| 5 | Website URL | www.ku.ac.ae → institution URL | Ordered replace |
| 6 | Email domain | ku.ac.ae → adha.ae | String replace |
| 7 | Auth DB filename | obf_ku… → obf_adha… | String replace |
| 8 | Auth credentials block | ku_credentials → adha_credentials | Block replace |
| 9 | Validation counts (colleges) | col_n==3 → col_n==1 | String replace |
| 10 | Program counts | prg_n==60 → prg_n==3 | String replace |
| 11 | User counts | u_n==18 → u_n==9 | String replace |
| 12 | Summary messages | KU OBF INIT → ADHA OBF INIT | String replace |
13.4 Role-Based Access Control Schema
Table 6 summarises the user role schema. Total users = 7 + 2k where k = number of colleges.
| Role | Username Pattern | Access Level | Count |
|---|---|---|---|
| Platform Administrator | code.admin@email_domain | Full admin | 1 |
| MoHESR Observer | mohesr_user@mohesr.gov.ae | Read-only regulator | 1 |
| President / VC | president@email_domain | Executive read | 1 |
| University Admin | univ.admin@email_domain | Institution-wide | 1 |
| QA Director | qa.director@email_domain | QA module full access | 1 |
| College Dean | dean.{college}@email_domain | College-scoped | 1 per college |
| College QA Chair | qa.{college}@email_domain | College QA read | 1 per college |
| Data Entry Officer | data.entry@email_domain | Data entry write | 1 |
| Viewer | viewer@email_domain | Dashboard read-only | 1 |
14 Appendix F: Validation Methodology and Full Results
14.1 Validation Design
Following batch generation, a systematic bulk validation sweep applied twelve programmatic checks per generated file, adopting a negative-test paradigm: checks verified the absence of residual source-HEI (KU) strings and the presence of institution-specific structural signatures.
14.2 Validation Checks
Table 7 presents the twelve validation checks applied to all 52 generated scripts. All checks passed 52/52.
| Check | Description | Criterion | Result |
|---|---|---|---|
| V-01 | Arabic name replacement | ≠ KU Arabic literal | 52/52 PASS |
| V-02 | Short code value | CODE_SHORT_CODE ≠ “KU” | 52/52 PASS |
| V-03 | Brand comment | # CODE Brand (not KU) | 52/52 PASS |
| V-04 | Website URL | Institution-specific URL | 52/52 PASS |
| V-05 | Auth credentials variable | code_credentials (not ku_) | 52/52 PASS |
| V-06 | Auth user count | Users = 7 + 2×colleges | 52/52 PASS |
| V-07 | College validation | col_n == N_colleges | 52/52 PASS |
| V-08 | Program validation | prg_n == N_programs | 52/52 PASS |
| V-09 | User validation | u_n == N_users | 52/52 PASS |
| V-10 | Completion banner | CODE OBF INIT COMPLETE | 52/52 PASS |
| V-11 | Launch instruction | Launch app_CODE.r (not app_ku.r) | 52/52 PASS |
| V-12 | No residual KU strings | 0 occurrences outside header | 52/52 PASS |
14.3 Spot-Check Methodology
Four institutions were selected for detailed manual review:
- ADHA (HEI 1): Single-college specialist. Verified Arabic name, Les Roches URL, 3-programme breakdown, 9-user credential structure.
- HCT (HEI 25): Maximum-college (5 colleges, 17 users). Verified 5 dean/QA pairs, col_n==5, u_n==17, programme breakdown.
- MAHE (HEI 39): Multi-college international campus (4 colleges). Verified Manipal email domain, 15-user count, 6-programme count.
- RAKMHSU (HEI 51): Medical university (4 colleges). Verified RAK-specific metadata, medical programme breakdown, correct auth DB filename.
14.4 Error Rate Analysis
The validated error rate was 0/52 = 0.0% across all twelve check categories and all fifty-two institutions. The zero error rate reflects both the correctness of the generation logic and the efficiency of batch processing: in a single pass, the model had access to the complete JSON source of truth for all 52 institutions simultaneously, enabling consistent cross-institution parameter propagation that sequential calls cannot guarantee.
15 Appendix G: Sensitivity Analysis and Boundary Cases
15.1 Sensitivity to \beta and \gamma
Table 8 reports \Pi(1, 52, 1) for \alpha=1, \theta=1 across a grid of (\beta,\gamma) values. The highlighted \beta=\gamma=1 cell (52.0) matches the empirical calibration.
| \beta \backslash \gamma | 0.5 | 0.75 | 1.0 | 1.25 | 1.5 |
|---|---|---|---|---|---|
| 0.5 | 8.1 | 14.8 | 52.0 | 180.7 | 627.8 |
| 0.75 | 8.1 | 14.8 | 52.0 | 180.7 | 627.8 |
| 1.0 | 8.1 | 14.8 | 52.0 | 180.7 | 627.8 |
| 1.25 | 8.1 | 14.8 | 52.0 | 180.7 | 627.8 |
| 1.5 | 8.1 | 14.8 | 52.0 | 180.7 | 627.8 |
15.2 Boundary Case: D = 1
When D=1, Assumption A4 requires \Pi(\theta,1,B)=1, so T_{eff} = T_{raw}(Q,1)—the sequential benchmark. The cross-partial \Pi_{\theta D}|_{D=1} may be zero or undefined depending on the functional form; for the power-law specification with \gamma>1, \Pi_{\theta D}|_{D=1} = 0. The theorem is vacuous at this boundary but well-defined for any D>1.
15.3 Boundary Case: \theta \to 0
As \theta\to 0, Assumption A4 requires \Pi\to 1. In this regime, even a large batch D=N cannot be processed efficiently because the model lacks capacity to exploit cross-task regularities. The cross-partial approaches zero as \Pi_\theta|_{\theta=0}\to 0, consistent with the theorem’s condition \theta>0.
15.4 Boundary Case: D \to \infty
As D\to\infty with fixed \theta and B, the efficiency \Pi\to\infty under the power-law specification, implying T_{eff}\to 0. In practice, D is bounded by the finite context window K: for sufficiently large D, T_{raw}(Q,D) > K and the call fails. The relevant range is D\leq D^*(K,\theta) where D^* is the context-feasible scope maximum.
15.5 Effect of Batch Regularity B
Table 9 shows the effect of batch regularity B on parallelisation efficiency (\alpha=\beta=\gamma=1, \delta^*=0.5, \theta=1, D=52).
| B | Interpretation | \Pi(1,52,B) | T_{eff} / T_{raw} |
|---|---|---|---|
| 0.25 | Low regularity (heterogeneous) | 1 + 51\times 0.50 = 26.5 | 0.038 |
| 0.50 | Medium regularity | 1 + 51\times 0.71 = 37.2 | 0.027 |
| 1.00 | High regularity (identical template) | 1 + 51\times 1.00 = 52.0 | 0.019 |
| 2.00 | Very high (structured JSON) | 1 + 51\times 1.41 = 73.0 | 0.014 |
16 Appendix H: Robustness — Alternative Functional Forms for T_{raw}
16.1 Motivation
The main paper assumes T_{raw} is increasing in Q and D and independent of \theta (f_\theta = 0). We examine robustness to alternative specifications.
16.2 Case: T_{raw} Increasing in \theta
Suppose T_{raw}(Q,D,\theta) = T_0(Q,D)\cdot\theta^\varepsilon for small \varepsilon>0. Then f_\theta = \varepsilon\cdot T_0/\theta > 0. The additional term T_{\mathrm{raw},\theta}\Pi_D / \Pi^2 > 0 works against the negative cross-partial. The theorem continues to hold provided the supermodularity term T_{raw}\Pi_{\theta D} dominates the perturbation. For small \varepsilon, this is satisfied.
16.3 Case: Sub-Additive T_{raw} in D
Suppose T_{raw}(Q,D) = D^\lambda T_1 for 0<\lambda<1. Then T_{\mathrm{raw},D} = \lambda D^{\lambda-1}T_1 > 0, diminishing. Term I becomes -\Pi^{-2}\lambda D^{\lambda-1}T_1\Pi_\theta < 0, reducing in magnitude relative to the linear case. The negative cross-partial is preserved, and the theorem is easier to satisfy.
16.4 Case: Diminishing Returns in D for \Pi
Suppose \Pi_D is decreasing in D (log-interaction form \Pi=1+\alpha\theta\log D, where \Pi_D = \alpha\theta/D and \Pi_{\theta D} = \alpha/D > 0). Supermodularity is preserved and the theorem holds.
16.5 Summary of Robustness Results
Table 10 summarises the robustness of Theorem 1 to alternative functional form assumptions.
| Modification | Direction | Theorem Holds? | Condition |
|---|---|---|---|
| T_{raw} increasing in \theta (\varepsilon small) | Reduces |\text{cross-partial}| | Yes | \varepsilon< threshold |
| T_{raw} concave in D | Reduces Term I | Yes | No extra condition |
| \Pi concave in D (log form) | Reduces \Pi_D | Yes | Supermodularity preserved |
| \Pi concave in \theta | Reduces \Pi_\theta | Yes | Supermodularity preserved |
| T_{raw} strongly increasing in \theta | May reverse sign | Conditional | \varepsilon < 2\Pi_D/\Pi_\theta |
| B constant | Rescales \Pi uniformly | Yes | Supermodularity unaffected |
The main theorem is robust across all practically relevant alternative specifications. The only case in which the result may fail—strongly capacity-dependent raw token generation—represents an unusual model behaviour pattern not observed in current frontier models, where output length is primarily determined by task content, not model capacity.