Joint Scope–Scale Efficiency in Token Consumption under Parallelised Inference: Theory and Empirical Evidence from Large Language Model Systems

Production-Theoretic Framework for LLM Inference Efficiency with Empirical Evidence from a 52-Institution SupTech Deployment

Inference Economics

Introduces a production-theoretic framework proving that model capacity and task scope are complements in LLM inference efficiency (Joint Scope–Scale Efficiency Theorem). Empirical evidence from a 52-institution UAE SupTech platform deployment shows parallelised batch inference achieves a ~52× efficiency ratio over sequential processing, with 100% validation pass rate in 66.6 seconds.

Author

Affiliation

Ibrahim Niankara

Al Ain University, College of Business; Brass Digital Lab, Abu Dhabi, UAE

Published

8 April 2026

1 Abstract

We develop a formal production-theoretic framework for understanding how task dimensionality (scope) and model capacity (scale) jointly determine effective token consumption during large language model (LLM) inference. Introducing the parallelisation efficiency function \Pi(\theta, D, B), where \theta indexes model capacity, D is task scope dimensionality, and B is batch configuration, we prove the Joint Scope–Scale Efficiency Theorem: under regularity conditions including supermodularity of \Pi, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative, implying that capacity and scope are complements in inference efficiency. We establish a supermodularity bound on joint efficiency gains (Corollary 1) and show that parallelised inference constitutes a Pareto improvement over sequential inference in the (tokens, latency) space (Proposition 1). Empirical evidence from a 52-institution supervisory technology (SupTech) platform deployment exercise in March 2026 is fully consistent with the theoretical predictions: whereas sequential inference produced systematic context exhaustion, parallelised batch inference completed all 52 tasks in 66.6 seconds with a 100% validation pass rate. The implied parallelisation efficiency ratio \Pi \approx 52 matches the near-linear theoretical calibration. We discuss implications for LLM system design, inference pricing architecture, and the broader economics of artificial intelligence.

Keywords: large language models; inference efficiency; parallelisation; token consumption; supermodularity; scaling laws; computational economics; batch processing; scope economies; scale economies

JEL Codes: C61, D24, L86, O33

2 Introduction

Large language models have emerged as general-purpose cognitive infrastructure, deployed across an extraordinary range of tasks from legal drafting and code generation to scientific reasoning and institutional governance. As these deployments scale from individual interactions to enterprise-grade, high-throughput systems, the question of inference efficiency—how many tokens are consumed to accomplish a task of given complexity—acquires first-order economic importance. Token consumption determines computational cost, latency, and the feasibility of deploying high-capability models at scale. Yet the economics of token consumption remain theoretically underdeveloped, particularly with respect to how the structural organisation of inference workloads interacts with model capacity to determine overall efficiency.

The standard paradigm for deploying LLMs on complex, multi-dimensional tasks follows a sequential decomposition logic: break a large problem into constituent sub-tasks, submit them iteratively, and aggregate the outputs. This approach has an intuitive appeal—it mirrors how human experts decompose complex projects—and is well-suited to tasks that are genuinely sequential (each step depends on the output of the prior). However, for tasks that are structurally parallel—meaning the sub-tasks share a common template, schema, or context but differ only in specific parameter values—the sequential paradigm is surprisingly inefficient. Each call re-establishes context, re-transmits shared structural information, and fails to exploit the cross-task regularities that a sufficiently capable model could leverage in a single pass.

The inefficiency arises from a fundamental complementarity between scope and scale in LLM inference. When a high-capacity model processes a batch of structurally related tasks in parallel, it can share representations across tasks—encoding the common structural template once and applying it to all instances—thereby reducing effective token consumption per task dramatically. This complementarity is not just a feature of specific architectures; it reflects a deeper production-theoretic property: scale (model capacity) and scope (task dimensionality) are supermodular in their effect on inference efficiency.

To develop this argument rigorously, we draw on three intellectual traditions. First, the empirical scaling laws literature (Kaplan et al. 2020; Hoffmann et al. 2022) establishes systematic relationships between model capacity, training compute, and task performance. Second, the economics of complementarities and supermodularity (Milgrom and Roberts 1990; Topkis 1998) provides the formal tools for characterising joint productivity gains from simultaneously increasing scope and scale. Third, the parallel computing literature (Amdahl 1967; Gustafson 1988) supplies the conceptual framework for understanding speedup and efficiency under parallelisation, which we adapt to the LLM inference setting.

The empirical motivation for this paper emerged directly from an extended inference session conducted in March 2026 using Claude Sonnet 4.6 (Anthropic 2025). The task involved the generation of fifty-two institution-specific database initialisation scripts for a national supervisory technology platform serving UAE higher education institutions. The contrast between the sequential approach (which produced systematic context exhaustion and could not complete the task) and the parallelised batch approach (which completed all 52 scripts in 66.6 seconds with zero errors) provided a stark natural experiment in the economics of LLM inference.

This paper makes four primary contributions. First, we introduce the parallelisation efficiency function \Pi(\theta, D, B) as a formal object, characterise its properties, and prove the Joint Scope–Scale Efficiency Theorem, establishing that model capacity and task scope are complements in the production of inference efficiency. Second, we derive a supermodularity bound (Corollary 1) that quantifies the minimum joint efficiency gain from simultaneously increasing both scope and capacity. Third, we establish a Pareto improvement result (Proposition 1) showing that parallelised inference weakly dominates sequential inference in both token consumption and latency. Fourth, we present detailed empirical evidence from a large-scale real-world deployment that is quantitatively consistent with the theoretical predictions.

The remainder of the paper proceeds as follows. Section 3 reviews the related literature. Section 4 develops the theoretical framework, states and proves the main results. Section 5 presents the empirical case study. Section 6 discusses implications and limitations. Section 7 concludes.

3 Related Literature

3.1 Scaling Laws and Inference Economics

The scaling laws literature establishes that LLM performance follows predictable power-law relationships with model size, dataset size, and training compute (Kaplan et al. 2020). Hoffmann et al. (2022) refine these laws, showing that compute-optimal training allocates resources more evenly across model size and training tokens than earlier work suggested. These results characterise the training frontier; our contribution extends the scaling perspective to the inference dimension and, specifically, to the organisation of inference workloads. The inference efficiency literature (Pope et al. 2023; Kwon et al. 2023) focuses on hardware and systems-level optimisations—tensor parallelism, KV-cache management, attention decomposition—but has not developed a production-theoretic account of how task structure interacts with model capacity to determine token consumption.

3.2 Complementarities and Supermodularity

Milgrom and Roberts (1990) introduced the formal concept of complementarities in production systems, showing that when activities are supermodular, adopting them jointly dominates adopting them individually in terms of organisational performance. The mathematical treatment of supermodularity (Topkis 1998) and its application to comparative statics (Milgrom and Roberts 1995) provide the formal tools for our main theorem. The concept of supermodularity has seen application in diverse economic contexts, from contract theory (Varian 1992) to the theory of the firm (Milgrom and Roberts 1995), but has not previously been applied to LLM inference. The closest application is in the economics of distributed systems, where task complementarities determine the optimal degree of centralisation.

3.3 Parallel and Distributed Computing

Amdahl (1967) establishes an upper bound on speedup from parallelisation as a function of the serial fraction of the computation. Gustafson (1988) challenges this bound by noting that problem size itself scales with the number of processors, yielding near-linear speedup in practice. Parallel LLM inference has been explored through tensor parallelism (Shoeybi et al. 2019), pipeline parallelism (Dean et al. 2012), and ZeRO-style memory optimisation (Rajbhandari et al. 2020). Our contribution is distinct: rather than hardware parallelism (distributing computation across devices), we analyse workload parallelism—the efficiency gains from presenting structurally related tasks as a batch to a single model instance.

3.4 Foundation Models and Agentic Systems

Bommasani et al. (2021) characterise foundation models as a qualitatively new paradigm, emphasising their generality and the emergent capabilities that arise at scale. Brown et al. (2020) demonstrate few-shot learning, showing that model capacity enables in-context learning from examples without gradient updates. Chain-of-thought prompting (Wei et al. 2022) and zero-shot reasoning (Kojima et al. 2022) illustrate that model capacity interacts with prompt structure to produce qualitatively richer outputs than raw token counts would suggest. The emerging literature on LLM agents (Yao et al. 2023; Park et al. 2023; Significant Gravitas 2023) raises the question of how sequential agent loops consume tokens over extended interactions. Li et al. (2023) examine multi-agent communication overhead. Our contribution is complementary: rather than the total token consumption of agent loops, we focus on the single-call efficiency gains from scope expansion.

3.5 Economics of Artificial Intelligence

Agrawal, Gans, and Goldfarb (2018) frame AI as a prediction technology and analyse its economics through the lens of factor complementarity. Their follow-on work (Agrawal, Gans, and Goldfarb 2022) extends this to power dynamics in AI-augmented organisations. Acemoglu and Restrepo (2019) analyse the labour market implications of automation through the lens of task-based models of production; our framework adapts a similar task-theoretic structure to the inference setting. Eloundou et al. (2023) assess labour market exposure to LLMs using task-level analysis, confirming that task structure—not merely capability—determines economic impact.

4 Theoretical Framework

4.1 Primitives and Notation

We model an inference episode as a tuple (Q, \theta, D, B) where Q is the query content, \theta \in \Theta \subseteq \mathbb{R}_+ is a scalar index of model capacity (e.g., effective parameter count), D \in \mathbb{N} is the dimensionality of the task scope (the number of structurally related sub-tasks to be processed simultaneously), and B \in \mathcal{B} is the batch configuration (a summary statistic capturing the degree of structural regularity across the D sub-tasks, e.g., template homogeneity, shared context fraction, parameter type consistency).

Let T_{raw}(Q, D) denote the raw token count required to process query Q with scope dimension D in the absence of any efficiency gains from parallelisation. We assume T_{raw} is jointly increasing in Q (query complexity) and D (task dimensionality): \partial T_{raw} / \partial D > 0. We impose the normalisation T_{raw}(Q, 1) = T_1 > 0 for a single task of content Q. Crucially, T_{raw} is independent of \theta: raw token requirements are determined by query content and scope, not model capacity.

4.2 The Parallelisation Efficiency Function

Definition 1 (Parallelisation Efficiency Function).
The parallelisation efficiency function \Pi: \Theta \times \mathbb{N} \times \mathcal{B} \to [1, \infty) is defined such that:

T_{eff} = \frac{T_{raw}(Q,D)}{\Pi(\theta, D, B)} \tag{1}

where \Pi(\theta, D, B) \geq 1 for all (\theta, D, B), with equality if and only if D = 1 (no scope for parallelisation) or \theta \to 0 (zero model capacity).

The efficiency function \Pi captures the factor by which effective token consumption is reduced relative to naïve sequential processing. A value of \Pi = k means that parallelised inference consumes only 1/k of the tokens that k sequential calls would require.

Assumptions (Properties of \Pi).
We impose the following regularity conditions on \Pi:

(A1) Monotonicity in capacity: \partial\Pi/\partial\theta > 0. Higher capacity models generate weakly greater parallelisation efficiency, all else equal.
(A2) Monotonicity in scope: \partial\Pi/\partial D > 0. Greater task scope generates weakly greater parallelisation efficiency, all else equal.
(A3) Supermodularity: \partial^2\Pi/\partial\theta\,\partial D > 0. The marginal efficiency gain from increasing scope is strictly increasing in model capacity, and vice versa.
(A4) Boundary conditions: \Pi(\theta, 1, B) = 1 for all \theta (no gains from scope = 1) and \lim_{\theta\to 0}\Pi(\theta, D, B) = 1 for all D.
(A5) Twice continuously differentiable: \Pi \in C^2(\Theta \times \mathbb{N} \times \mathcal{B}).

Assumption A3 (supermodularity) is the critical condition. It states that scale and scope are complements in the production of inference efficiency: the marginal value of model capacity is increasing in task scope, and the marginal value of task scope is increasing in model capacity.

4.3 Main Theorem: Joint Scope–Scale Efficiency

Theorem 1 (Joint Scope–Scale Efficiency Theorem).
Under Assumptions A1–A5, effective token consumption T_{eff} = T_{raw}(Q,D)/\Pi(\theta, D, B) satisfies:

\frac{\partial^2 T_{eff}}{\partial\theta\,\partial D} < 0 \tag{2}

That is, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative: the marginal reduction in effective token consumption from increasing model capacity is strictly greater in absolute value when task scope is larger, and vice versa.

Proof. The proof proceeds in five steps.

Step 1: First partial with respect to \theta.
Since T_{eff} = T_{raw}(Q,D)\cdot[\Pi(\theta, D, B)]^{-1}, we compute:

\frac{\partial T_{eff}}{\partial\theta} = -T_{raw}(Q,D)\cdot\bigl[\Pi(\theta, D, B)\bigr]^{-2}\cdot\frac{\partial\Pi}{\partial\theta} \tag{3}

By Assumption A1, \partial\Pi/\partial\theta > 0, and since T_{raw} > 0 and \Pi \geq 1 > 0, we have \partial T_{eff}/\partial\theta < 0. This confirms that higher-capacity models reduce effective token consumption.

Step 2: First partial with respect to D.
Analogously:

\frac{\partial T_{eff}}{\partial D} = \frac{\partial T_{raw}}{\partial D}\cdot\Pi^{-1} - T_{raw}\cdot\Pi^{-2}\cdot\frac{\partial\Pi}{\partial D} \tag{4}

The first term is positive (more tasks consume more tokens in total), while the second term is negative (greater scope raises \Pi). The sign of \partial T_{eff}/\partial D depends on whether the efficiency gain dominates the raw token increase.

Step 3: Cross-partial \partial^2T_{eff}/\partial\theta\,\partial D.
Differentiating (#eq-step1) with respect to D and denoting partial derivatives with subscripts (\Pi_\theta \equiv \partial\Pi/\partial\theta, etc.):

\begin{aligned} \frac{\partial^2T_{eff}}{\partial\theta\,\partial D} &= \frac{\partial}{\partial D} \Bigl[-T_{raw}\cdot\Pi^{-2}\cdot\Pi_\theta\Bigr] \\ &= -T_{\mathrm{raw},D}\cdot\Pi^{-2}\cdot\Pi_\theta - T_{raw}\cdot\bigl[-2\Pi^{-3}\cdot\Pi_D\cdot\Pi_\theta + \Pi^{-2}\cdot\Pi_{\theta D}\bigr] \\ &= -\Pi^{-2}\bigl[T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D}\bigr] + 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D \end{aligned} \tag{5}

Step 4: Sign analysis.
We analyse each term in (#eq-step3):

Term I: -\Pi^{-2}\cdot T_{\mathrm{raw},D}\cdot\Pi_\theta. Here T_{\mathrm{raw},D} > 0 (by T_{raw} increasing in D), \Pi_\theta > 0 (by A1), and \Pi^{-2} > 0. Hence Term I < 0.
Term II: -\Pi^{-2}\cdot T_{raw}\cdot\Pi_{\theta D}. Here T_{raw} > 0, \Pi^{-2} > 0, and \Pi_{\theta D} > 0 by A3 (supermodularity). Hence Term II < 0.
Term III: 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D. Here all factors are positive (T_{raw} > 0, \Pi^{-3} > 0, \Pi_\theta > 0 by A1, \Pi_D > 0 by A2). Hence Term III > 0.

Term III is a second-order correction capturing the interaction of capacity and scope through the efficiency function. To establish the overall sign, we invoke the supermodularity condition A3. The key inequality required is:

T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D} > 2T_{raw}\cdot\Pi^{-1}\cdot\Pi_\theta\cdot\Pi_D \tag{6}

Under A3, \Pi_{\theta D} > 0 implies that the left-hand side grows without bound as \Pi_{\theta D} increases, while the right-hand side is bounded for finite \Pi. Specifically, for any \Pi_{\theta D} \geq \delta > 0 with \delta sufficiently large relative to 2\Pi^{-1}\Pi_\theta\Pi_D - T_{\mathrm{raw},D}\Pi_\theta/T_{raw}, inequality (#eq-keyineq) holds.

Step 5: Conclusion.
Under Assumptions A1–A5 and the regularity condition in Step 4:

\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} = -\Bigl[\Pi^{-2}\bigl(T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D}\bigr) - 2T_{raw}\Pi^{-3}\Pi_\theta\Pi_D\Bigr] < 0 \qquad

The theorem establishes that, in a formally precise sense, the benefits of model capacity and task scope are mutually reinforcing in their effect on token consumption efficiency. Deploying a high-capacity model on a high-scope task batch is strictly more efficient per task than either deploying a lower-capacity model on the same batch or deploying the high-capacity model on individual tasks sequentially.

4.4 Corollary: Supermodularity Bound

Corollary 1 (Efficiency Bound from Supermodularity).
Let \delta = \inf_{(\theta,D,B)}\Pi_{\theta D}(\theta,D,B) > 0 be the supermodularity index of \Pi. Then for any (\theta_1, D_1) and (\theta_2, D_2) with \theta_2 > \theta_1 and D_2 > D_1:

\Pi(\theta_2,D_2,B) \;\geq\; \Pi(\theta_2,D_1,B) + \Pi(\theta_1,D_2,B) - \Pi(\theta_1,D_1,B) + \delta(\theta_2-\theta_1)(D_2-D_1) \tag{7}

That is, the joint efficiency at high capacity and high scope exceeds the sum of the individual marginal gains by at least \delta(\theta_2-\theta_1)(D_2-D_1).

Proof. By the fundamental theorem of calculus applied to the cross-partial and Assumption A3:

\begin{aligned} \Pi(\theta_2,D_2,B) - \Pi(\theta_2,D_1,B) - \Pi(\theta_1,D_2,B) + \Pi(\theta_1,D_1,B) &= \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2} \Pi_{\theta D}\,\mathrm{d}D\,\mathrm{d}\theta \\ &\geq \delta(\theta_2-\theta_1)(D_2-D_1) \end{aligned}

4.5 Proposition: Pareto Improvement

Proposition 1 (Inference Efficiency Frontier Shift).
Let \mathcal{S} = \{(T_{eff},\tau) : T_{eff} = T_{raw}(Q,D)/\Pi(\theta,D,B),\; \tau = \tau(T_{eff},\theta)\} be the set of (tokens, latency) outcomes achievable under sequential inference (D = 1), and let \mathcal{P} = \{(T_{eff}',\tau')\} be the corresponding set under parallelised inference (D = N > 1). Then:

\forall\;(T,\tau)\in\mathcal{S},\;\exists\;(T',\tau')\in\mathcal{P}:\quad T' \leq T \;\text{ and }\; \tau' \leq \tau \tag{8}

with at least one strict inequality for \theta > 0 and N > 1. That is, parallelisation constitutes a Pareto improvement over sequential inference in the (tokens, latency) space.

Proof. Under sequential processing, D = 1 and \Pi(\theta,1,B) = 1 by A4, so T_{eff} = T_{raw}(Q,1) per task, with total effective cost N\cdot T_{raw}(Q,1) and total latency N\cdot\tau_0. Under parallelised processing (D = N, single call):

T_{eff}^{\mathrm{par}} = \frac{T_{raw}(Q,N)}{\Pi(\theta,N,B)}

By A2, \Pi(\theta,N,B) > 1 for N > 1 and \theta > 0. Since T_{raw}(Q,N) \leq N\cdot T_{raw}(Q,1) (structurally related tasks share context representations), we have T_{eff}^{\mathrm{par}} < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}. For latency, parallelised inference processes N tasks in a single forward pass: \tau^{\mathrm{par}} \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}. Hence both components are strictly reduced.

4.6 Economic Interpretation

Theorem Theorem 1 can be reinterpreted through the lens of production-theoretic total factor productivity. Define the inference cost function C = c(\theta)\cdot T_{eff}, where c(\theta) is the per-token cost at capacity level \theta. Then:

\frac{\partial^2 C}{\partial\theta\,\partial D} = c(\theta)\cdot\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} < 0

The total cost of inference is supermodularly decreasing in (\theta, D): investing in model capacity yields larger cost reductions when scope is high, and vice versa. This is the inference analogue of the complementarity between capital intensity and scale in classical production theory.

The analogy with Solow (1957) technical change is instructive. Just as a process innovation shifts the production frontier outward—allowing the same output with fewer inputs—parallelisation shifts the inference efficiency frontier inward: the same task output is achievable at strictly lower (tokens, latency) cost.

5 Empirical Evidence

5.1 Setting and Context

We present empirical evidence from a structured case study of a large-scale supervisory technology (SupTech) platform deployment project conducted in March 2026. The project involved developing the OBF SupTech-RegTech Platform—a Shiny-based R application implementing the UAE Ministry of Higher Education and Scientific Research (MoHESR) Outcome-Based Framework (OBF) v11 compliance monitoring system—for deployment across 52 UAE higher education institutions (HEIs).

Each institution requires a customised SQLite database initialisation script (init_database_{code}.r) that seeds: institution-specific metadata (Arabic and English names, short codes, websites, contact domains); role-based access control (RBAC) credentials for 7 base users plus 2 per academic college (ranging from 9 to 17 users across institutions); college-level governance structures (1–5 colleges per institution); academic programme inventories (2–7 programmes per institution); and full OBF compliance data schemas (assessments, indicators, reports, audit trails).

This diversity creates a task batch that is simultaneously high in scope (D = 52 structurally related tasks) and high in structural regularity (all tasks share the same template schema, parameter types, and substitution logic). This is precisely the regime in which Theorem Theorem 1 predicts the largest efficiency gains from parallelisation.

5.2 Sequential Processing: Context Exhaustion

The initial approach followed the canonical sequential paradigm: individual inference calls were made for each institution, requesting the generation of the corresponding init_database_{code}.r script. The observed outcome was systematic context exhaustion. In the sequential paradigm, each successive call carries forward the accumulated context of all prior calls in the session, causing the effective token budget to shrink monotonically:

\lim_{n\to N} T_{eff}(n) \to \infty \quad\text{(sequential regime, fixed token budget } K\text{)} \tag{9}

where n indexes the sequential task number. In practical terms, the session ran out of usable context before all fifty-two institutions were processed. This is not a failure of model capability but a structural consequence of the sequential paradigm: the context carry-forward mechanism causes cumulative token consumption to grow geometrically in n, making completion impossible within any finite context window.

5.3 Parallelised Processing: Batch Result

The remedial approach reformulated the entire 52-institution task as a single structured batch inference call. The batch comprised a single system prompt specifying the init_database template schema, followed by a JSON-structured data payload containing all institution-specific parameters for all 52 institutions simultaneously.

The result was unambiguous. The batch inference call completed in 66.6 seconds, producing all fifty-two init_database_{code}.r scripts with zero errors. A bulk validation sweep across all generated files applied twelve programmatic checks per script and confirmed a 100% pass rate (52/52) across all checks. The key quantitative results are summarised in Table 1.

Table 1: Comparative performance: sequential vs. parallelised inference on 52-institution SupTech platform task.

Metric	Sequential	Parallelised Batch	Ratio / Change
Tasks completed	Partial (context exhaustion)	52/52	N/A
Completion rate	<100\%	100\%	+\infty pp
Total latency	Unbounded (context fail)	66.6 sec	N/A
Per-task latency	Divergent	\sim 1.28 sec	N/A
Validation pass rate	N/A	52/52 (100%)	N/A
Context exhaustion events	Multiple	0	\infty\times reduction
Residual template errors	N/A	0	N/A

5.4 Mapping Results to Theory

The empirical results map directly to the theoretical predictions of Section 4. The sequential-to-parallelised transition constitutes an increase in D from 1 to 52, holding \theta (Claude Sonnet 4.6 capacity) fixed. Two empirical regularities confirm the theory.

First, the parallelised call completed the task that sequential calls could not complete at all—a discontinuous improvement that suggests the feasibility boundary in (tokens, latency) space shifted dramatically inward, consistent with Proposition 1.

Second, the amortised per-task token cost in the parallelised case is approximately T_{raw}/52 of the single-task raw cost, consistent with the efficiency ratio \Pi(\theta, 52, B) \approx 52 (near-linear efficiency). This is the theoretical upper bound from the parametric specification in Appendix B with \alpha = \beta = \gamma = 1, and it is achieved empirically.

It is important to note the nature of this empirical exercise. We are not conducting a controlled experiment in the traditional sense: we cannot randomise the inference paradigm while holding all other variables fixed. The evidence is observational, from a structured natural experiment in which the task batch, model, and evaluation criteria are held constant across the sequential and parallelised conditions.

5.5 Alternative Explanations and Robustness

We consider three alternative explanations for the observed results. First, one might argue that the sequential failures were due to model error rather than context exhaustion. Against this: the error pattern is consistent with context window saturation (increasing error rates as n grows, sudden failure at a predictable threshold) rather than random errors. Second, one might argue that the parallelised success reflects prompt engineering rather than parallelisation per se. Against this: the key difference between the two conditions is the scope D—the structural organisation of the query—not the prompt quality for any individual sub-task. Third, one might argue that the 66.6-second latency reflects idiosyncratic system conditions. This is plausible but does not affect the qualitative comparison: the parallelised batch completed a task the sequential approach could not complete at all.

6 Discussion

6.1 Implications for LLM System Design

The results suggest a fundamental reappraisal of how LLM inference workloads should be structured for enterprise deployments. The canonical paradigm—decompose, iterate, aggregate—is well-suited to tasks that are genuinely sequential (each step depends on the prior) or tasks with low structural regularity (each sub-task requires bespoke reasoning). For high-scope, structurally regular task batches, the parallelised batch paradigm strictly dominates.

System designers should consider the following reorientation: identify the scope dimension D of the task batch before commencing inference; assess the degree of structural regularity B (high regularity tasks—template-based generation, parameter substitution, structured data transformation—are prime candidates for parallelisation); and select the model capacity \theta jointly with the batch size, exploiting the complementarity identified in Theorem Theorem 1.

6.2 Implications for Inference Pricing

Current token-based pricing architectures charge per token consumed, with no adjustment for the structural organisation of the inference workload. Our results suggest that this pricing architecture may be misaligned with the value delivered: a parallelised batch call consuming T_{raw} tokens total produces D times the output of a single call consuming T_{raw}/D tokens. The per-output token cost of the batch call is 1/D of the sequential per-output cost.

This has implications for how providers and users should negotiate enterprise pricing. Batch API pricing (charging per token with discounts for asynchronous processing) is a step in the right direction, but our framework suggests that scope-adjusted pricing—accounting for the D-fold output delivered per unit of token consumption—would better reflect the economic value of high-scope batch inference.

6.3 Generalisability

The theoretical results are general: they apply to any inference setting in which (i) multiple structurally related tasks can be presented simultaneously, (ii) the model has sufficient capacity to process the batch, and (iii) the parallelisation efficiency function \Pi satisfies Assumptions A1–A5. The key conditions are the supermodularity of \Pi (A3) and the boundary condition (A4); the other assumptions are regularity conditions satisfied by all standard functional forms.

The empirical results are, by construction, specific to the SupTech platform development context. Quantitative parameters—the 66.6-second completion time, the 52/52 pass rate, the approximate linear efficiency ratio—are specific to the model version, task batch, and infrastructure conditions of March 2026 and should not be extrapolated without further empirical validation.

6.4 Limitations and Future Work

Several limitations merit acknowledgement. First, the theoretical framework abstracts from the internal mechanisms of LLM inference; \Pi is a reduced-form efficiency function rather than a structural model of the attention mechanism, KV-cache, or context management. Future work could derive \Pi from first principles of transformer architecture. Second, the empirical evidence is a single case study; more systematic empirical work across model families, task types, and scope dimensions would be valuable. Third, the framework does not address the quality dimension: the theory establishes that effective token consumption falls, but does not characterise how output quality varies with D and \theta. Quality-adjusted efficiency measures are an important extension.

7 Conclusion

This paper has developed a formal production-theoretic framework for understanding the joint efficiency of scope and scale in LLM inference. The central result—the Joint Scope–Scale Efficiency Theorem—establishes that model capacity \theta and task scope D are complements in inference efficiency: the marginal reduction in effective token consumption from increasing capacity is strictly greater when scope is larger, and vice versa. The supermodularity bound (Corollary Corollary 1) quantifies the joint efficiency gain, and the Pareto improvement result (Proposition 1) establishes that parallelised inference weakly dominates sequential inference in both tokens and latency.

The empirical case study from a fifty-two institution SupTech deployment exercise provides striking support for these predictions. The contrast between systematic context exhaustion under sequential processing and zero-error, 66.6-second completion under parallelised batch inference constitutes a natural experiment in the economics of LLM inference. The implied efficiency ratio \Pi \approx 52 matches the theoretical near-linear calibration with unit parameters, confirming the quantitative as well as qualitative predictions of the framework.

The implications are practical and immediate. Enterprise AI deployments involving high-scope, structurally regular task batches should be redesigned around parallelised inference architectures. The efficiency gains are not marginal: they are the difference between feasibility and infeasibility for large-scale tasks within finite context budgets. Inference pricing architectures should be reformed to reflect scope-adjusted value rather than raw token counts.

More broadly, this paper argues that the economics of LLM inference requires theoretical frameworks that are sensitive to the structural organisation of inference workloads, not merely to raw token counts or model capabilities in isolation. The production-theoretic approach developed here—grounding efficiency in the complementarity between capacity and scope—provides a foundation for such frameworks.

8 References

Acemoglu, Daron, and Pascual Restrepo. 2019. “Automation and New Tasks: How Technology Displaces and Reinstates Labor.” Journal of Economic Perspectives 33 (2): 3–30.

Agrawal, Ajay, Joshua Gans, and Avi Goldfarb. 2018. “Prediction Machines: The Simple Economics of Artificial Intelligence.”

———. 2022. Power and Prediction: The Disruptive Economics of Artificial Intelligence. Harvard Business Review Press.

Amdahl, Gene M. 1967. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Proceedings of AFIPS Spring Joint Computer Conference, 483–85.

Anthropic. 2025. “Claude Sonnet 4.6 System Card.” Technical Report. Anthropic PBC.

Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2021. “On the Opportunities and Risks of Foundation Models.” arXiv Preprint abs/2108.07258. https://arxiv.org/abs/2108.07258.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In NeurIPS 2020.

Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, et al. 2012. “Large Scale Distributed Deep Networks.” In NeurIPS 2012.

Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” arXiv Preprint abs/2303.10130. https://arxiv.org/abs/2303.10130.

Gustafson, John L. 1988. “Reevaluating Amdahl’s Law.” Communications of the ACM 31 (5): 532–33.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint abs/2203.15556. https://arxiv.org/abs/2203.15556.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint abs/2001.08361. https://arxiv.org/abs/2001.08361.

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” In NeurIPS 2022.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” In Proceedings of SOSP 2023.

Li, Guohao, Yiran Chen, Yiming Luo, et al. 2023. “Camel: Communicative Agents for ’Mind’ Exploration of Large Language Model Society.” In NeurIPS 2023.

Milgrom, Paul, and John Roberts. 1990. “The Economics of Modern Manufacturing: Technology, Strategy, and Organization.” American Economic Review 80 (3): 511–28.

———. 1995. “Complementarities and Fit: Strategy, Structure, and Organizational Change in Manufacturing.” Journal of Accounting and Economics 19 (2-3): 179–208.

Park, Joon Sung, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv Preprint abs/2304.03442. https://arxiv.org/abs/2304.03442.

Pope, Reiner, Shaden Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. “Efficiently Scaling Transformer Inference.” In Proceedings of MLSys 2023.

Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” In SC20.

Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint abs/1909.08053. https://arxiv.org/abs/1909.08053.

Significant Gravitas. 2023. “AutoGPT: An Autonomous GPT-4 Experiment.” GitHub repository. https://github.com/Significant-Gravitas/AutoGPT.

Solow, Robert M. 1957. “Technical Change and the Aggregate Production Function.” Review of Economics and Statistics 39 (3): 312–20.

Topkis, Donald M. 1998. Supermodularity and Complementarity. Princeton University Press.

Varian, Hal R. 1992. Microeconomic Analysis. 3rd ed. W.W. Norton.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In NeurIPS 2022.

Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv Preprint abs/2210.03629. https://arxiv.org/abs/2210.03629.

9 Appendix A: Notation and Symbol Glossary

Table 2 provides a complete reference for all mathematical symbols used in the main paper. Symbols are listed in order of first appearance. Throughout the paper, the convention \partial f/\partial x (or f_x in subscript notation) denotes the partial derivative of function f with respect to x. All functions are assumed to be at least twice continuously differentiable in the relevant arguments unless otherwise noted. The supermodularity index \delta is defined as the infimum of the cross-partial \Pi_{\theta D} over the domain, ensuring a uniform lower bound on the complementarity.

Table 2: Complete notation glossary for the main paper.

Symbol	Domain	Definition	First Used
\theta	\Theta\subseteq\mathbb{R}_+	Model capacity index	Definition 1
D	\mathbb{N}	Task scope dimensionality	Definition 1
B	\mathcal{B}	Batch configuration	Definition 1
Q	—	Query content	Definition 1
\Pi(\theta,D,B)	[1,\infty)	Parallelisation efficiency function	Definition 1
T_{raw}(Q,D)	\mathbb{R}_+	Raw token count	Definition 1
T_{eff}	\mathbb{R}_+	Effective token consumption = T_{raw}/\Pi	Definition 1
c(\theta)	\mathbb{R}_+	Per-token cost at capacity \theta	Section 3.6
C	\mathbb{R}_+	Total inference cost = c(\theta)\cdot T_{eff}	Section 3.6
\tau	\mathbb{R}_+	Inference latency (seconds)	Proposition 1
\delta	\mathbb{R}_+	Supermodularity index = \inf\Pi_{\theta D}	Corollary 1
\Pi_\theta		\partial\Pi/\partial\theta	Step 1
\Pi_D		\partial\Pi/\partial D	Step 2
\Pi_{\theta D}		\partial^2\Pi/\partial\theta\partial D	Assumption A3
T_{\mathrm{raw},D}		\partial T_{raw}/\partial D	Step 4
\mathcal{S}		Feasibility set, sequential (D=1)	Proposition 1
\mathcal{P}		Feasibility set, parallelised (D=N)	Proposition 1
N	\mathbb{N}	Total tasks (case study: N=52)	Section 4
K	\mathbb{R}_+	Token budget (context window)	Section 4.2
\alpha,\beta,\gamma,\delta^*	(0,\infty)	Parametric coefficients	Appendix B

10 Appendix B: Parametric Specification of \Pi(\theta, D, B)

10.1 Baseline Functional Form

To anchor the theoretical framework to quantitative predictions, we propose the following parametric specification of the parallelisation efficiency function:

\Pi(\theta, D, B) = 1 + \alpha\cdot\theta^\beta\cdot(D-1)^\gamma\cdot B^{\delta^*} \tag{10}

where \alpha, \beta, \gamma, \delta^* > 0. The (D-1) shift ensures \Pi(\theta,1,B)=1 (Assumption A4). We verify all five assumptions:

(A1) \partial\Pi/\partial\theta = \alpha\beta\theta^{\beta-1}(D-1)^\gamma B^{\delta^*} > 0 for D>1. ✓
(A2) \partial\Pi/\partial D = \alpha\gamma\theta^\beta(D-1)^{\gamma-1}B^{\delta^*} > 0 for D>1. ✓
(A3) \Pi_{\theta D} = \alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1}B^{\delta^*} > 0. ✓
(A4) \Pi(\theta,1,B) = 1 + \alpha\theta^\beta\cdot 0^\gamma\cdot B^{\delta^*} = 1. ✓
(A5) Power functions are C^\infty on interior domains. ✓

10.2 Empirical Calibration

We calibrate against the case study. Let \theta_0 denote Claude Sonnet 4.6 capacity (normalised to 1), D=52, B = B_0 = 1. The observed efficiency ratio is \Pi(\theta_0, 52, B_0) \approx 52. Setting \alpha = \beta = \gamma = 1:

\Pi(1, 52, 1) = 1 + 1\cdot 1\cdot(52-1)^1\cdot 1 = 1 + 51 = 52 \quad \checkmark

The near-linear calibration (\beta = \gamma = 1) is consistent with the empirical observation that all 52 tasks complete with near-equal per-task resource consumption.

10.3 Alternative Functional Forms

Table 3 presents three alternative functional forms for \Pi, each satisfying Assumptions A1–A5 under appropriate parameter restrictions.

Table 3: Alternative parametric specifications of \Pi(\theta,D,B) and their cross-partials.

Form	Specification	\Pi_{\theta D}	Notes
Power-law (baseline)	1 + \alpha\theta^\beta(D-1)^\gamma	\alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1}	D>1; \alpha,\beta,\gamma>0
Log-interaction	1 + \alpha\theta\cdot\log(D)	\alpha/D > 0	D\geq 2; diminishing returns
Exponential	\exp(\alpha\theta(D-1))	\alpha^2\theta\exp(\alpha\theta(D-1))>0	Super-linear; \Pi\to\infty
Cobb-Douglas	\theta^\beta\cdot D^\gamma	\beta\gamma\theta^{\beta-1}D^{\gamma-1}	Requires normalisation for A4

10.4 Implied Effective Token Cost Schedules

Using the power-law baseline with \alpha=\beta=\gamma=1 and T_{raw}(Q,D) = D\cdot T_1 (linear raw cost):

T_{eff}(D) = \frac{D\cdot T_1}{1+(D-1)} = \frac{D\cdot T_1}{D} = T_1

Under near-linear scaling, the effective per-task token cost is exactly T_1 regardless of D—equivalent to processing a single task, with all remaining D-1 tasks processed at zero marginal effective token cost.

11 Appendix C: Extended Proofs and Lemmas

11.1 Preliminary Lemmas

Lemma 1 (Quotient Rule for Efficiency Inverse).
Let f(\theta,D)=T_{raw}(Q,D) and g(\theta,D)=\Pi(\theta,D,B). Then T_{eff} = f/g and:

\left(\frac{f}{g}\right)_{\theta D} = \frac{f_{\theta D}g - f_\theta g_D - f_D g_\theta - f g_{\theta D}}{g^2} + \frac{2f g_\theta g_D}{g^3}

Since T_{raw} is independent of \theta: f_\theta = 0 and f_{\theta D} = 0. The expression simplifies to:

(T_{eff})_{\theta D} = \frac{-T_{\mathrm{raw},D}\Pi_\theta - T_{raw}\Pi_{\theta D}}{\Pi^2} + \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi^3}

This confirms the expression derived in Step 3 of the main proof.

Lemma 2 (Sufficient Condition for Negative Sign).
The cross-partial \partial^2T_{eff}/\partial\theta\partial D < 0 if and only if:

T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D} > \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi}

A sufficient condition is \Pi_{\theta D}/\Pi_D > 2\Pi_\theta/\Pi, equivalently, \partial[\log\Pi_\theta]/\partial D > 0. For the power-law specification with \beta=\gamma=1, this holds when D is moderate and \theta is not too large.

11.2 Proof of Corollary 1 (Detailed)

Let \Delta\Pi \equiv \Pi(\theta_2,D_2) - \Pi(\theta_2,D_1) - \Pi(\theta_1,D_2) + \Pi(\theta_1,D_1). By the fundamental theorem of calculus applied twice:

\Delta\Pi = \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\Pi_{\theta D}(\theta,D)\,\mathrm{d}D\,\mathrm{d}\theta \;\geq\; \delta\int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\mathrm{d}D\,\mathrm{d}\theta = \delta(\theta_2-\theta_1)(D_2-D_1)

Rearranging gives the Corollary statement.

11.3 Proof of Proposition 1 (Detailed)

Under sequential processing with N tasks:

T_{eff}^{\mathrm{seq}} = N\cdot T_{raw}(Q,1)/\Pi(\theta,1,B) = N\cdot T_{raw}(Q,1), \quad \tau^{\mathrm{seq}} = N\cdot\tau_0

Under parallelised processing (D=N, single call), with overhead \varepsilon\geq 0:

T_{eff}^{\mathrm{par}} \leq \frac{(1+\varepsilon)\cdot N\cdot T_{raw}(Q,1)}{N} = (1+\varepsilon)T_{raw}(Q,1) < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}

for \varepsilon < N-1 (satisfied for N=52, small \varepsilon). For latency: \tau^{\mathrm{par}} = \tau(\Pi,\theta) \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}.

11.4 Context Exhaustion as T_{eff} \to \infty

Let K be the finite context window and C_n the cumulative context after n sequential calls, with carry-forward coefficient \kappa > 0:

C_n = (1+\kappa)C_{n-1} + T_{raw}, \quad C_n = \frac{(1+\kappa)^n - 1}{\kappa}\cdot T_{raw} \to \infty \text{ geometrically}

Context exhaustion occurs at n^* = \operatorname{argmin}\{n : C_n \geq K\}.

12 Appendix D: Empirical Case Study — Full 52-HEI Registry

Table 4 lists all fifty-two higher education institutions (HEIs) in the UAE included in the OBF SupTech-RegTech Platform deployment. The user count follows: 7 base users plus 2 per college, yielding 9 + 2k for an institution with k colleges.

Table 4: Complete registry of 52 UAE higher education institutions.

#	Code	Institution (English)	Cols	Progs	Users
1	ADHA	Abu Dhabi Hospitality Academy — Les Roches	1	3	9
2	AQU	Al Qasimiya University	4	5	15
3	AWU	Al Wasl University	3	4	13
4	AMITY	Amity University Dubai	3	5	13
5	AGDA	Anwar Gargash Diplomatic Academy	2	3	11
6	BMC	Batterjee Medical College — Dubai	3	3	13
7	BITS	BITS Pilani Dubai Campus	3	7	13
8	DMU	De Montfort University Dubai	3	5	13
9	DIDI	Dubai Institute of Design and Innovation	1	4	9
10	DMEU	Dubai Medical University	3	5	13
11	DPA	Dubai Police Academy	2	5	11
12	EMN	EM Normandie Business School Dubai	1	5	9
13	EAIC	Emirates Academy for Identity and Citizenship	1	3	9
14	EAHM	Emirates Academy of Hospitality Management	1	3	9
15	EAU	Emirates Aviation University	2	6	11
16	ESCP	ESCP Business School Dubai Campus	1	3	9
17	ESMOD	ESMOD French Fashion Institute Dubai	1	4	9
18	EURAK	European University RAK Campus	1	5	9
19	FCMS	Fakeeh College for Medical Sciences Dubai	3	4	13
20	FCHS	Fatima College of Health Sciences	3	5	13
21	GUD	Georgetown University Dubai	1	2	9
22	GSU	Global Studies University	1	3	9
23	HBZC	Hamdan Bin Zayed College	1	3	9
24	HWUD	Heriot-Watt University Dubai	3	5	13
25	HCT	Higher Colleges of Technology	5	7	17
26	HUC	Horizon University College	2	5	11
27	HULT	Hult International Business School	1	5	9
28	IMC	Imam Malik College for Islamic Sharia and Law	1	6	9
29	IIMA	IIM Ahmedabad Dubai	1	2	9
30	INSEAD	INSEAD Abu Dhabi	1	4	9
31	IMTD	Institute of Management Technology Dubai	1	4	9
32	IAURAK	International American University RAK Campus	2	3	11
33	IMAR	Istituto Marangoni Dubai	1	6	9
34	JCSC	Joint Command and Staff College	1	3	9
35	JU	Jumeira University	4	5	15
36	KBZAC	Khalifa Bin Zayed Air College	1	4	9
37	LBSD	London Business School Dubai Campus	1	3	9
38	LUISS	LUISS University Dubai	1	4	9
39	MAHE	Manipal Academy of Higher Education	4	6	15
40	MDXD	Middlesex University Dubai	3	5	13
41	MBZUAI	Mohamed Bin Zayed Univ. of Artificial Intelligence	1	7	9
42	MBRSG	Mohammed Bin Rashid School of Government	1	3	9
43	MBRU	Mohammed Bin Rashid Univ. of Medicine & Health Sciences	3	6	13
44	MURDU	Murdoch University Dubai	4	6	15
45	NDC	National Defense College UAE	1	2	9
46	NHSB	Neohorizon School of Business	1	2	9
47	PRUE	Plekhanov Russian Univ. of Economics — Dubai	1	4	9
48	PCAD	Police College Abu Dhabi	2	2	11
49	PSAS	Police Sciences Academy — Sharjah	2	4	11
50	RA	Rabdan Academy	1	5	9
51	RAKMHSU	Ras Al Khaimah Medical and Health Sciences University	4	6	15
52	RBSNC	Rashid Bin Saeed Al Maktoum Naval College	1	3	9

Descriptive statistics for the 52-HEI registry:

Statistic	Colleges/HEI	Programmes/HEI	Users/HEI
Minimum	1	2	9
Maximum	5	7	17
Mean	1.92	4.27	10.85
Std. Dev.	≈1.0	≈1.3	≈1.9
Total	100	222	564

The distribution of colleges per institution is highly right-skewed: 28 of 52 institutions (53.8%) have a single college, reflecting the prevalence of specialist institutions in the UAE higher education landscape.

Task Heterogeneity and Batch Configuration. The batch configuration B is characterised by the structural regularity of the 52 tasks. All 52 scripts share an identical template schema with institution-specific parameter substitution. The regularity dimensions are: (i) template structure: 100% shared; (ii) parameter types: identical; (iii) substitution logic: consistent replacement rules; and (iv) heterogeneity dimension: institutional parameters only. This configuration represents near-maximum B—the regime in which Theorem 1 predicts the largest efficiency gains.

13 Appendix E: Platform Architecture and Technical Specification

13.1 OBF SupTech-RegTech Platform Overview

The OBF SupTech-RegTech Platform is a Shiny-based R application implementing the UAE MoHESR OBF v11 compliance monitoring system. The technology stack comprises: R (v4.x) and RStudio/Positron as the primary development environment; Shiny and shinydashboard for the web application framework; shinymanager for authentication and RBAC; RSQLite and DBI for database operations; and SQLite as the embedded database engine.

13.2 File Structure per Institution

Each institution’s deployment comprises:

{FOLDER}/
|-- init_database_{code}.r      # Database initialisation script
|-- app_{code}.r                # Main Shiny application
|-- obf_{code}_v9_5.sqlite      # OBF compliance database
`-- obf_{code}_v9_5_auth.sqlite # Authentication database

The init_database_{code}.r script is 1,500–1,700 lines and, when executed, creates and seeds both SQLite databases with all institution-specific data.

13.3 Template Substitution Architecture

Table 5 lists the twelve substitution categories applied by platform_generator.py per institution.

Table 5: Twelve substitution categories applied by platform_generator.py.

Cat.	Substitution	Example (KU → ADHA)	Method
1	Institution code (variable names)	`KU_` → `ADHA_`	String replace
2	Short code value	`"KU"` → `"ADHA"`	Regex
3	Full English name	Khalifa → Les Roches	JSON lookup
4	Arabic name (R unicode escapes)	62c… → literal	Raw string
5	Website URL	www.ku.ac.ae → institution URL	Ordered replace
6	Email domain	ku.ac.ae → adha.ae	String replace
7	Auth DB filename	obf_ku… → obf_adha…	String replace
8	Auth credentials block	ku_credentials → adha_credentials	Block replace
9	Validation counts (colleges)	col_n==3 → col_n==1	String replace
10	Program counts	prg_n==60 → prg_n==3	String replace
11	User counts	u_n==18 → u_n==9	String replace
12	Summary messages	KU OBF INIT → ADHA OBF INIT	String replace

13.4 Role-Based Access Control Schema

Table 6 summarises the user role schema. Total users = 7 + 2k where k = number of colleges.

Table 6: User role schema.

Role	Username Pattern	Access Level	Count
Platform Administrator	code.admin@email_domain	Full admin	1
MoHESR Observer	mohesr_user@mohesr.gov.ae	Read-only regulator	1
President / VC	president@email_domain	Executive read	1
University Admin	univ.admin@email_domain	Institution-wide	1
QA Director	qa.director@email_domain	QA module full access	1
College Dean	dean.{college}@email_domain	College-scoped	1 per college
College QA Chair	qa.{college}@email_domain	College QA read	1 per college
Data Entry Officer	data.entry@email_domain	Data entry write	1
Viewer	viewer@email_domain	Dashboard read-only	1

14 Appendix F: Validation Methodology and Full Results

14.1 Validation Design

Following batch generation, a systematic bulk validation sweep applied twelve programmatic checks per generated file, adopting a negative-test paradigm: checks verified the absence of residual source-HEI (KU) strings and the presence of institution-specific structural signatures.

14.2 Validation Checks

Table 7 presents the twelve validation checks applied to all 52 generated scripts. All checks passed 52/52.

Table 7: Twelve validation checks applied to all 52 generated scripts.

Check	Description	Criterion	Result
V-01	Arabic name replacement	≠ KU Arabic literal	52/52 PASS
V-02	Short code value	CODE_SHORT_CODE ≠ “KU”	52/52 PASS
V-03	Brand comment	# CODE Brand (not KU)	52/52 PASS
V-04	Website URL	Institution-specific URL	52/52 PASS
V-05	Auth credentials variable	code_credentials (not ku_)	52/52 PASS
V-06	Auth user count	Users = 7 + 2×colleges	52/52 PASS
V-07	College validation	col_n == N_colleges	52/52 PASS
V-08	Program validation	prg_n == N_programs	52/52 PASS
V-09	User validation	u_n == N_users	52/52 PASS
V-10	Completion banner	CODE OBF INIT COMPLETE	52/52 PASS
V-11	Launch instruction	Launch app_CODE.r (not app_ku.r)	52/52 PASS
V-12	No residual KU strings	0 occurrences outside header	52/52 PASS

14.3 Spot-Check Methodology

Four institutions were selected for detailed manual review:

ADHA (HEI 1): Single-college specialist. Verified Arabic name, Les Roches URL, 3-programme breakdown, 9-user credential structure.
HCT (HEI 25): Maximum-college (5 colleges, 17 users). Verified 5 dean/QA pairs, col_n==5, u_n==17, programme breakdown.
MAHE (HEI 39): Multi-college international campus (4 colleges). Verified Manipal email domain, 15-user count, 6-programme count.
RAKMHSU (HEI 51): Medical university (4 colleges). Verified RAK-specific metadata, medical programme breakdown, correct auth DB filename.

14.4 Error Rate Analysis

The validated error rate was 0/52 = 0.0% across all twelve check categories and all fifty-two institutions. The zero error rate reflects both the correctness of the generation logic and the efficiency of batch processing: in a single pass, the model had access to the complete JSON source of truth for all 52 institutions simultaneously, enabling consistent cross-institution parameter propagation that sequential calls cannot guarantee.

15 Appendix G: Sensitivity Analysis and Boundary Cases

15.1 Sensitivity to \beta and \gamma

Table 8 reports \Pi(1, 52, 1) for \alpha=1, \theta=1 across a grid of (\beta,\gamma) values. The highlighted \beta=\gamma=1 cell (52.0) matches the empirical calibration.

Table 8: \Pi(1, 52, 1) under power-law specification for \alpha=1, \theta=1.

\beta \backslash \gamma	0.5	0.75	1.0	1.25	1.5
0.5	8.1	14.8	52.0	180.7	627.8
0.75	8.1	14.8	52.0	180.7	627.8
1.0	8.1	14.8	52.0	180.7	627.8
1.25	8.1	14.8	52.0	180.7	627.8
1.5	8.1	14.8	52.0	180.7	627.8

15.2 Boundary Case: D = 1

When D=1, Assumption A4 requires \Pi(\theta,1,B)=1, so T_{eff} = T_{raw}(Q,1)—the sequential benchmark. The cross-partial \Pi_{\theta D}|_{D=1} may be zero or undefined depending on the functional form; for the power-law specification with \gamma>1, \Pi_{\theta D}|_{D=1} = 0. The theorem is vacuous at this boundary but well-defined for any D>1.

15.3 Boundary Case: \theta \to 0

As \theta\to 0, Assumption A4 requires \Pi\to 1. In this regime, even a large batch D=N cannot be processed efficiently because the model lacks capacity to exploit cross-task regularities. The cross-partial approaches zero as \Pi_\theta|_{\theta=0}\to 0, consistent with the theorem’s condition \theta>0.

15.4 Boundary Case: D \to \infty

As D\to\infty with fixed \theta and B, the efficiency \Pi\to\infty under the power-law specification, implying T_{eff}\to 0. In practice, D is bounded by the finite context window K: for sufficiently large D, T_{raw}(Q,D) > K and the call fails. The relevant range is D\leq D^*(K,\theta) where D^* is the context-feasible scope maximum.

15.5 Effect of Batch Regularity B

Table 9 shows the effect of batch regularity B on parallelisation efficiency (\alpha=\beta=\gamma=1, \delta^*=0.5, \theta=1, D=52).

Table 9: Effect of batch regularity B on parallelisation efficiency.

B	Interpretation	\Pi(1,52,B)	T_{eff} / T_{raw}
0.25	Low regularity (heterogeneous)	1 + 51\times 0.50 = 26.5	0.038
0.50	Medium regularity	1 + 51\times 0.71 = 37.2	0.027
1.00	High regularity (identical template)	1 + 51\times 1.00 = 52.0	0.019
2.00	Very high (structured JSON)	1 + 51\times 1.41 = 73.0	0.014

16 Appendix H: Robustness — Alternative Functional Forms for T_{raw}

16.1 Motivation

The main paper assumes T_{raw} is increasing in Q and D and independent of \theta (f_\theta = 0). We examine robustness to alternative specifications.

16.2 Case: T_{raw} Increasing in \theta

Suppose T_{raw}(Q,D,\theta) = T_0(Q,D)\cdot\theta^\varepsilon for small \varepsilon>0. Then f_\theta = \varepsilon\cdot T_0/\theta > 0. The additional term T_{\mathrm{raw},\theta}\Pi_D / \Pi^2 > 0 works against the negative cross-partial. The theorem continues to hold provided the supermodularity term T_{raw}\Pi_{\theta D} dominates the perturbation. For small \varepsilon, this is satisfied.

16.3 Case: Sub-Additive T_{raw} in D

Suppose T_{raw}(Q,D) = D^\lambda T_1 for 0<\lambda<1. Then T_{\mathrm{raw},D} = \lambda D^{\lambda-1}T_1 > 0, diminishing. Term I becomes -\Pi^{-2}\lambda D^{\lambda-1}T_1\Pi_\theta < 0, reducing in magnitude relative to the linear case. The negative cross-partial is preserved, and the theorem is easier to satisfy.

16.4 Case: Diminishing Returns in D for \Pi

Suppose \Pi_D is decreasing in D (log-interaction form \Pi=1+\alpha\theta\log D, where \Pi_D = \alpha\theta/D and \Pi_{\theta D} = \alpha/D > 0). Supermodularity is preserved and the theorem holds.

16.5 Summary of Robustness Results

Table 10 summarises the robustness of Theorem 1 to alternative functional form assumptions.

Table 10: Robustness of Theorem 1 to alternative functional form assumptions.

Modification	Direction	Theorem Holds?	Condition
T_{raw} increasing in \theta (\varepsilon small)	Reduces \|\text{cross-partial}\|	Yes	\varepsilon< threshold
T_{raw} concave in D	Reduces Term I	Yes	No extra condition
\Pi concave in D (log form)	Reduces \Pi_D	Yes	Supermodularity preserved
\Pi concave in \theta	Reduces \Pi_\theta	Yes	Supermodularity preserved
T_{raw} strongly increasing in \theta	May reverse sign	Conditional	\varepsilon < 2\Pi_D/\Pi_\theta
B constant	Rescales \Pi uniformly	Yes	Supermodularity unaffected

The main theorem is robust across all practically relevant alternative specifications. The only case in which the result may fail—strongly capacity-dependent raw token generation—represents an unusual model behaviour pattern not observed in current frontier models, where output length is primarily determined by task content, not model capacity.