Joint Scope–Scale Efficiency in Token Consumption under Parallelised Inference: Theory and Empirical Evidence from Large Language Model Systems

Production-Theoretic Framework for LLM Inference Efficiency with Empirical Evidence from a 52-Institution SupTech Deployment

Inference Economics
Introduces a production-theoretic framework proving that model capacity and task scope are complements in LLM inference efficiency (Joint Scope–Scale Efficiency Theorem). Empirical evidence from a 52-institution UAE SupTech platform deployment shows parallelised batch inference achieves a ~52× efficiency ratio over sequential processing, with 100% validation pass rate in 66.6 seconds.
Author
Affiliation

Ibrahim Niankara

Al Ain University, College of Business; Brass Digital Lab, Abu Dhabi, UAE

Published

8 April 2026

1 Abstract

We develop a formal production-theoretic framework for understanding how task dimensionality (scope) and model capacity (scale) jointly determine effective token consumption during large language model (LLM) inference. Introducing the parallelisation efficiency function \Pi(\theta, D, B), where \theta indexes model capacity, D is task scope dimensionality, and B is batch configuration, we prove the Joint Scope–Scale Efficiency Theorem: under regularity conditions including supermodularity of \Pi, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative, implying that capacity and scope are complements in inference efficiency. We establish a supermodularity bound on joint efficiency gains (Corollary 1) and show that parallelised inference constitutes a Pareto improvement over sequential inference in the (tokens, latency) space (Proposition 1). Empirical evidence from a 52-institution supervisory technology (SupTech) platform deployment exercise in March 2026 is fully consistent with the theoretical predictions: whereas sequential inference produced systematic context exhaustion, parallelised batch inference completed all 52 tasks in 66.6 seconds with a 100% validation pass rate. The implied parallelisation efficiency ratio \Pi \approx 52 matches the near-linear theoretical calibration. We discuss implications for LLM system design, inference pricing architecture, and the broader economics of artificial intelligence.

Keywords: large language models; inference efficiency; parallelisation; token consumption; supermodularity; scaling laws; computational economics; batch processing; scope economies; scale economies

JEL Codes: C61, D24, L86, O33

2 Introduction

Large language models have emerged as general-purpose cognitive infrastructure, deployed across an extraordinary range of tasks from legal drafting and code generation to scientific reasoning and institutional governance. As these deployments scale from individual interactions to enterprise-grade, high-throughput systems, the question of inference efficiency—how many tokens are consumed to accomplish a task of given complexity—acquires first-order economic importance. Token consumption determines computational cost, latency, and the feasibility of deploying high-capability models at scale. Yet the economics of token consumption remain theoretically underdeveloped, particularly with respect to how the structural organisation of inference workloads interacts with model capacity to determine overall efficiency.

The standard paradigm for deploying LLMs on complex, multi-dimensional tasks follows a sequential decomposition logic: break a large problem into constituent sub-tasks, submit them iteratively, and aggregate the outputs. This approach has an intuitive appeal—it mirrors how human experts decompose complex projects—and is well-suited to tasks that are genuinely sequential (each step depends on the output of the prior). However, for tasks that are structurally parallel—meaning the sub-tasks share a common template, schema, or context but differ only in specific parameter values—the sequential paradigm is surprisingly inefficient. Each call re-establishes context, re-transmits shared structural information, and fails to exploit the cross-task regularities that a sufficiently capable model could leverage in a single pass.

The inefficiency arises from a fundamental complementarity between scope and scale in LLM inference. When a high-capacity model processes a batch of structurally related tasks in parallel, it can share representations across tasks—encoding the common structural template once and applying it to all instances—thereby reducing effective token consumption per task dramatically. This complementarity is not just a feature of specific architectures; it reflects a deeper production-theoretic property: scale (model capacity) and scope (task dimensionality) are supermodular in their effect on inference efficiency.

To develop this argument rigorously, we draw on three intellectual traditions. First, the empirical scaling laws literature (Kaplan et al. 2020; Hoffmann et al. 2022) establishes systematic relationships between model capacity, training compute, and task performance. Second, the economics of complementarities and supermodularity (Milgrom and Roberts 1990; Topkis 1998) provides the formal tools for characterising joint productivity gains from simultaneously increasing scope and scale. Third, the parallel computing literature (Amdahl 1967; Gustafson 1988) supplies the conceptual framework for understanding speedup and efficiency under parallelisation, which we adapt to the LLM inference setting.

The empirical motivation for this paper emerged directly from an extended inference session conducted in March 2026 using Claude Sonnet 4.6 (Anthropic 2025). The task involved the generation of fifty-two institution-specific database initialisation scripts for a national supervisory technology platform serving UAE higher education institutions. The contrast between the sequential approach (which produced systematic context exhaustion and could not complete the task) and the parallelised batch approach (which completed all 52 scripts in 66.6 seconds with zero errors) provided a stark natural experiment in the economics of LLM inference.

This paper makes four primary contributions. First, we introduce the parallelisation efficiency function \Pi(\theta, D, B) as a formal object, characterise its properties, and prove the Joint Scope–Scale Efficiency Theorem, establishing that model capacity and task scope are complements in the production of inference efficiency. Second, we derive a supermodularity bound (Corollary 1) that quantifies the minimum joint efficiency gain from simultaneously increasing both scope and capacity. Third, we establish a Pareto improvement result (Proposition 1) showing that parallelised inference weakly dominates sequential inference in both token consumption and latency. Fourth, we present detailed empirical evidence from a large-scale real-world deployment that is quantitatively consistent with the theoretical predictions.

The remainder of the paper proceeds as follows. Section 3 reviews the related literature. Section 4 develops the theoretical framework, states and proves the main results. Section 5 presents the empirical case study. Section 6 discusses implications and limitations. Section 7 concludes.

3 Related Literature

3.1 Scaling Laws and Inference Economics

The scaling laws literature establishes that LLM performance follows predictable power-law relationships with model size, dataset size, and training compute (Kaplan et al. 2020). Hoffmann et al. (2022) refine these laws, showing that compute-optimal training allocates resources more evenly across model size and training tokens than earlier work suggested. These results characterise the training frontier; our contribution extends the scaling perspective to the inference dimension and, specifically, to the organisation of inference workloads. The inference efficiency literature (Pope et al. 2023; Kwon et al. 2023) focuses on hardware and systems-level optimisations—tensor parallelism, KV-cache management, attention decomposition—but has not developed a production-theoretic account of how task structure interacts with model capacity to determine token consumption.

3.2 Complementarities and Supermodularity

Milgrom and Roberts (1990) introduced the formal concept of complementarities in production systems, showing that when activities are supermodular, adopting them jointly dominates adopting them individually in terms of organisational performance. The mathematical treatment of supermodularity (Topkis 1998) and its application to comparative statics (Milgrom and Roberts 1995) provide the formal tools for our main theorem. The concept of supermodularity has seen application in diverse economic contexts, from contract theory (Varian 1992) to the theory of the firm (Milgrom and Roberts 1995), but has not previously been applied to LLM inference. The closest application is in the economics of distributed systems, where task complementarities determine the optimal degree of centralisation.

3.3 Parallel and Distributed Computing

Amdahl (1967) establishes an upper bound on speedup from parallelisation as a function of the serial fraction of the computation. Gustafson (1988) challenges this bound by noting that problem size itself scales with the number of processors, yielding near-linear speedup in practice. Parallel LLM inference has been explored through tensor parallelism (Shoeybi et al. 2019), pipeline parallelism (Dean et al. 2012), and ZeRO-style memory optimisation (Rajbhandari et al. 2020). Our contribution is distinct: rather than hardware parallelism (distributing computation across devices), we analyse workload parallelism—the efficiency gains from presenting structurally related tasks as a batch to a single model instance.

3.4 Foundation Models and Agentic Systems

Bommasani et al. (2021) characterise foundation models as a qualitatively new paradigm, emphasising their generality and the emergent capabilities that arise at scale. Brown et al. (2020) demonstrate few-shot learning, showing that model capacity enables in-context learning from examples without gradient updates. Chain-of-thought prompting (Wei et al. 2022) and zero-shot reasoning (Kojima et al. 2022) illustrate that model capacity interacts with prompt structure to produce qualitatively richer outputs than raw token counts would suggest. The emerging literature on LLM agents (Yao et al. 2023; Park et al. 2023; Significant Gravitas 2023) raises the question of how sequential agent loops consume tokens over extended interactions. Li et al. (2023) examine multi-agent communication overhead. Our contribution is complementary: rather than the total token consumption of agent loops, we focus on the single-call efficiency gains from scope expansion.

3.5 Economics of Artificial Intelligence

Agrawal, Gans, and Goldfarb (2018) frame AI as a prediction technology and analyse its economics through the lens of factor complementarity. Their follow-on work (Agrawal, Gans, and Goldfarb 2022) extends this to power dynamics in AI-augmented organisations. Acemoglu and Restrepo (2019) analyse the labour market implications of automation through the lens of task-based models of production; our framework adapts a similar task-theoretic structure to the inference setting. Eloundou et al. (2023) assess labour market exposure to LLMs using task-level analysis, confirming that task structure—not merely capability—determines economic impact.

4 Theoretical Framework

4.1 Primitives and Notation

We model an inference episode as a tuple (Q, \theta, D, B) where Q is the query content, \theta \in \Theta \subseteq \mathbb{R}_+ is a scalar index of model capacity (e.g., effective parameter count), D \in \mathbb{N} is the dimensionality of the task scope (the number of structurally related sub-tasks to be processed simultaneously), and B \in \mathcal{B} is the batch configuration (a summary statistic capturing the degree of structural regularity across the D sub-tasks, e.g., template homogeneity, shared context fraction, parameter type consistency).

Let T_{raw}(Q, D) denote the raw token count required to process query Q with scope dimension D in the absence of any efficiency gains from parallelisation. We assume T_{raw} is jointly increasing in Q (query complexity) and D (task dimensionality): \partial T_{raw} / \partial D > 0. We impose the normalisation T_{raw}(Q, 1) = T_1 > 0 for a single task of content Q. Crucially, T_{raw} is independent of \theta: raw token requirements are determined by query content and scope, not model capacity.

4.2 The Parallelisation Efficiency Function

Definition 1 (Parallelisation Efficiency Function).
The parallelisation efficiency function \Pi: \Theta \times \mathbb{N} \times \mathcal{B} \to [1, \infty) is defined such that:

T_{eff} = \frac{T_{raw}(Q,D)}{\Pi(\theta, D, B)} \tag{1}

where \Pi(\theta, D, B) \geq 1 for all (\theta, D, B), with equality if and only if D = 1 (no scope for parallelisation) or \theta \to 0 (zero model capacity).

The efficiency function \Pi captures the factor by which effective token consumption is reduced relative to naïve sequential processing. A value of \Pi = k means that parallelised inference consumes only 1/k of the tokens that k sequential calls would require.

Assumptions (Properties of \Pi).
We impose the following regularity conditions on \Pi:

  • (A1) Monotonicity in capacity: \partial\Pi/\partial\theta > 0. Higher capacity models generate weakly greater parallelisation efficiency, all else equal.
  • (A2) Monotonicity in scope: \partial\Pi/\partial D > 0. Greater task scope generates weakly greater parallelisation efficiency, all else equal.
  • (A3) Supermodularity: \partial^2\Pi/\partial\theta\,\partial D > 0. The marginal efficiency gain from increasing scope is strictly increasing in model capacity, and vice versa.
  • (A4) Boundary conditions: \Pi(\theta, 1, B) = 1 for all \theta (no gains from scope = 1) and \lim_{\theta\to 0}\Pi(\theta, D, B) = 1 for all D.
  • (A5) Twice continuously differentiable: \Pi \in C^2(\Theta \times \mathbb{N} \times \mathcal{B}).

Assumption A3 (supermodularity) is the critical condition. It states that scale and scope are complements in the production of inference efficiency: the marginal value of model capacity is increasing in task scope, and the marginal value of task scope is increasing in model capacity.

4.3 Main Theorem: Joint Scope–Scale Efficiency

Theorem 1 (Joint Scope–Scale Efficiency Theorem).
Under Assumptions A1–A5, effective token consumption T_{eff} = T_{raw}(Q,D)/\Pi(\theta, D, B) satisfies:

\frac{\partial^2 T_{eff}}{\partial\theta\,\partial D} < 0 \tag{2}

That is, the cross-partial derivative of effective token consumption with respect to model capacity and task scope is strictly negative: the marginal reduction in effective token consumption from increasing model capacity is strictly greater in absolute value when task scope is larger, and vice versa.

Proof. The proof proceeds in five steps.

Step 1: First partial with respect to \theta.
Since T_{eff} = T_{raw}(Q,D)\cdot[\Pi(\theta, D, B)]^{-1}, we compute:

\frac{\partial T_{eff}}{\partial\theta} = -T_{raw}(Q,D)\cdot\bigl[\Pi(\theta, D, B)\bigr]^{-2}\cdot\frac{\partial\Pi}{\partial\theta} \tag{3}

By Assumption A1, \partial\Pi/\partial\theta > 0, and since T_{raw} > 0 and \Pi \geq 1 > 0, we have \partial T_{eff}/\partial\theta < 0. This confirms that higher-capacity models reduce effective token consumption.

Step 2: First partial with respect to D.
Analogously:

\frac{\partial T_{eff}}{\partial D} = \frac{\partial T_{raw}}{\partial D}\cdot\Pi^{-1} - T_{raw}\cdot\Pi^{-2}\cdot\frac{\partial\Pi}{\partial D} \tag{4}

The first term is positive (more tasks consume more tokens in total), while the second term is negative (greater scope raises \Pi). The sign of \partial T_{eff}/\partial D depends on whether the efficiency gain dominates the raw token increase.

Step 3: Cross-partial \partial^2T_{eff}/\partial\theta\,\partial D.
Differentiating (#eq-step1) with respect to D and denoting partial derivatives with subscripts (\Pi_\theta \equiv \partial\Pi/\partial\theta, etc.):

\begin{aligned} \frac{\partial^2T_{eff}}{\partial\theta\,\partial D} &= \frac{\partial}{\partial D} \Bigl[-T_{raw}\cdot\Pi^{-2}\cdot\Pi_\theta\Bigr] \\ &= -T_{\mathrm{raw},D}\cdot\Pi^{-2}\cdot\Pi_\theta - T_{raw}\cdot\bigl[-2\Pi^{-3}\cdot\Pi_D\cdot\Pi_\theta + \Pi^{-2}\cdot\Pi_{\theta D}\bigr] \\ &= -\Pi^{-2}\bigl[T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D}\bigr] + 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D \end{aligned} \tag{5}

Step 4: Sign analysis.
We analyse each term in (#eq-step3):

  • Term I: -\Pi^{-2}\cdot T_{\mathrm{raw},D}\cdot\Pi_\theta. Here T_{\mathrm{raw},D} > 0 (by T_{raw} increasing in D), \Pi_\theta > 0 (by A1), and \Pi^{-2} > 0. Hence Term I < 0.
  • Term II: -\Pi^{-2}\cdot T_{raw}\cdot\Pi_{\theta D}. Here T_{raw} > 0, \Pi^{-2} > 0, and \Pi_{\theta D} > 0 by A3 (supermodularity). Hence Term II < 0.
  • Term III: 2T_{raw}\cdot\Pi^{-3}\cdot\Pi_\theta\cdot\Pi_D. Here all factors are positive (T_{raw} > 0, \Pi^{-3} > 0, \Pi_\theta > 0 by A1, \Pi_D > 0 by A2). Hence Term III > 0.

Term III is a second-order correction capturing the interaction of capacity and scope through the efficiency function. To establish the overall sign, we invoke the supermodularity condition A3. The key inequality required is:

T_{\mathrm{raw},D}\cdot\Pi_\theta + T_{raw}\cdot\Pi_{\theta D} > 2T_{raw}\cdot\Pi^{-1}\cdot\Pi_\theta\cdot\Pi_D \tag{6}

Under A3, \Pi_{\theta D} > 0 implies that the left-hand side grows without bound as \Pi_{\theta D} increases, while the right-hand side is bounded for finite \Pi. Specifically, for any \Pi_{\theta D} \geq \delta > 0 with \delta sufficiently large relative to 2\Pi^{-1}\Pi_\theta\Pi_D - T_{\mathrm{raw},D}\Pi_\theta/T_{raw}, inequality (#eq-keyineq) holds.

Step 5: Conclusion.
Under Assumptions A1–A5 and the regularity condition in Step 4:

\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} = -\Bigl[\Pi^{-2}\bigl(T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D}\bigr) - 2T_{raw}\Pi^{-3}\Pi_\theta\Pi_D\Bigr] < 0 \qquad

The theorem establishes that, in a formally precise sense, the benefits of model capacity and task scope are mutually reinforcing in their effect on token consumption efficiency. Deploying a high-capacity model on a high-scope task batch is strictly more efficient per task than either deploying a lower-capacity model on the same batch or deploying the high-capacity model on individual tasks sequentially.

4.4 Corollary: Supermodularity Bound

Corollary 1 (Efficiency Bound from Supermodularity).
Let \delta = \inf_{(\theta,D,B)}\Pi_{\theta D}(\theta,D,B) > 0 be the supermodularity index of \Pi. Then for any (\theta_1, D_1) and (\theta_2, D_2) with \theta_2 > \theta_1 and D_2 > D_1:

\Pi(\theta_2,D_2,B) \;\geq\; \Pi(\theta_2,D_1,B) + \Pi(\theta_1,D_2,B) - \Pi(\theta_1,D_1,B) + \delta(\theta_2-\theta_1)(D_2-D_1) \tag{7}

That is, the joint efficiency at high capacity and high scope exceeds the sum of the individual marginal gains by at least \delta(\theta_2-\theta_1)(D_2-D_1).

Proof. By the fundamental theorem of calculus applied to the cross-partial and Assumption A3:

\begin{aligned} \Pi(\theta_2,D_2,B) - \Pi(\theta_2,D_1,B) - \Pi(\theta_1,D_2,B) + \Pi(\theta_1,D_1,B) &= \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2} \Pi_{\theta D}\,\mathrm{d}D\,\mathrm{d}\theta \\ &\geq \delta(\theta_2-\theta_1)(D_2-D_1) \end{aligned}

4.5 Proposition: Pareto Improvement

Proposition 1 (Inference Efficiency Frontier Shift).
Let \mathcal{S} = \{(T_{eff},\tau) : T_{eff} = T_{raw}(Q,D)/\Pi(\theta,D,B),\; \tau = \tau(T_{eff},\theta)\} be the set of (tokens, latency) outcomes achievable under sequential inference (D = 1), and let \mathcal{P} = \{(T_{eff}',\tau')\} be the corresponding set under parallelised inference (D = N > 1). Then:

\forall\;(T,\tau)\in\mathcal{S},\;\exists\;(T',\tau')\in\mathcal{P}:\quad T' \leq T \;\text{ and }\; \tau' \leq \tau \tag{8}

with at least one strict inequality for \theta > 0 and N > 1. That is, parallelisation constitutes a Pareto improvement over sequential inference in the (tokens, latency) space.

Proof. Under sequential processing, D = 1 and \Pi(\theta,1,B) = 1 by A4, so T_{eff} = T_{raw}(Q,1) per task, with total effective cost N\cdot T_{raw}(Q,1) and total latency N\cdot\tau_0. Under parallelised processing (D = N, single call):

T_{eff}^{\mathrm{par}} = \frac{T_{raw}(Q,N)}{\Pi(\theta,N,B)}

By A2, \Pi(\theta,N,B) > 1 for N > 1 and \theta > 0. Since T_{raw}(Q,N) \leq N\cdot T_{raw}(Q,1) (structurally related tasks share context representations), we have T_{eff}^{\mathrm{par}} < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}. For latency, parallelised inference processes N tasks in a single forward pass: \tau^{\mathrm{par}} \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}. Hence both components are strictly reduced.

4.6 Economic Interpretation

Theorem Theorem 1 can be reinterpreted through the lens of production-theoretic total factor productivity. Define the inference cost function C = c(\theta)\cdot T_{eff}, where c(\theta) is the per-token cost at capacity level \theta. Then:

\frac{\partial^2 C}{\partial\theta\,\partial D} = c(\theta)\cdot\frac{\partial^2T_{eff}}{\partial\theta\,\partial D} < 0

The total cost of inference is supermodularly decreasing in (\theta, D): investing in model capacity yields larger cost reductions when scope is high, and vice versa. This is the inference analogue of the complementarity between capital intensity and scale in classical production theory.

The analogy with Solow (1957) technical change is instructive. Just as a process innovation shifts the production frontier outward—allowing the same output with fewer inputs—parallelisation shifts the inference efficiency frontier inward: the same task output is achievable at strictly lower (tokens, latency) cost.

5 Empirical Evidence

5.1 Setting and Context

We present empirical evidence from a structured case study of a large-scale supervisory technology (SupTech) platform deployment project conducted in March 2026. The project involved developing the OBF SupTech-RegTech Platform—a Shiny-based R application implementing the UAE Ministry of Higher Education and Scientific Research (MoHESR) Outcome-Based Framework (OBF) v11 compliance monitoring system—for deployment across 52 UAE higher education institutions (HEIs).

Each institution requires a customised SQLite database initialisation script (init_database_{code}.r) that seeds: institution-specific metadata (Arabic and English names, short codes, websites, contact domains); role-based access control (RBAC) credentials for 7 base users plus 2 per academic college (ranging from 9 to 17 users across institutions); college-level governance structures (1–5 colleges per institution); academic programme inventories (2–7 programmes per institution); and full OBF compliance data schemas (assessments, indicators, reports, audit trails).

This diversity creates a task batch that is simultaneously high in scope (D = 52 structurally related tasks) and high in structural regularity (all tasks share the same template schema, parameter types, and substitution logic). This is precisely the regime in which Theorem Theorem 1 predicts the largest efficiency gains from parallelisation.

5.2 Sequential Processing: Context Exhaustion

The initial approach followed the canonical sequential paradigm: individual inference calls were made for each institution, requesting the generation of the corresponding init_database_{code}.r script. The observed outcome was systematic context exhaustion. In the sequential paradigm, each successive call carries forward the accumulated context of all prior calls in the session, causing the effective token budget to shrink monotonically:

\lim_{n\to N} T_{eff}(n) \to \infty \quad\text{(sequential regime, fixed token budget } K\text{)} \tag{9}

where n indexes the sequential task number. In practical terms, the session ran out of usable context before all fifty-two institutions were processed. This is not a failure of model capability but a structural consequence of the sequential paradigm: the context carry-forward mechanism causes cumulative token consumption to grow geometrically in n, making completion impossible within any finite context window.

5.3 Parallelised Processing: Batch Result

The remedial approach reformulated the entire 52-institution task as a single structured batch inference call. The batch comprised a single system prompt specifying the init_database template schema, followed by a JSON-structured data payload containing all institution-specific parameters for all 52 institutions simultaneously.

The result was unambiguous. The batch inference call completed in 66.6 seconds, producing all fifty-two init_database_{code}.r scripts with zero errors. A bulk validation sweep across all generated files applied twelve programmatic checks per script and confirmed a 100% pass rate (52/52) across all checks. The key quantitative results are summarised in Table 1.

Table 1: Comparative performance: sequential vs. parallelised inference on 52-institution SupTech platform task.
Metric Sequential Parallelised Batch Ratio / Change
Tasks completed Partial (context exhaustion) 52/52 N/A
Completion rate <100\% 100\% +\infty pp
Total latency Unbounded (context fail) 66.6 sec N/A
Per-task latency Divergent \sim 1.28 sec N/A
Validation pass rate N/A 52/52 (100%) N/A
Context exhaustion events Multiple 0 \infty\times reduction
Residual template errors N/A 0 N/A

5.4 Mapping Results to Theory

The empirical results map directly to the theoretical predictions of Section 4. The sequential-to-parallelised transition constitutes an increase in D from 1 to 52, holding \theta (Claude Sonnet 4.6 capacity) fixed. Two empirical regularities confirm the theory.

First, the parallelised call completed the task that sequential calls could not complete at all—a discontinuous improvement that suggests the feasibility boundary in (tokens, latency) space shifted dramatically inward, consistent with Proposition 1.

Second, the amortised per-task token cost in the parallelised case is approximately T_{raw}/52 of the single-task raw cost, consistent with the efficiency ratio \Pi(\theta, 52, B) \approx 52 (near-linear efficiency). This is the theoretical upper bound from the parametric specification in Appendix B with \alpha = \beta = \gamma = 1, and it is achieved empirically.

It is important to note the nature of this empirical exercise. We are not conducting a controlled experiment in the traditional sense: we cannot randomise the inference paradigm while holding all other variables fixed. The evidence is observational, from a structured natural experiment in which the task batch, model, and evaluation criteria are held constant across the sequential and parallelised conditions.

5.5 Alternative Explanations and Robustness

We consider three alternative explanations for the observed results. First, one might argue that the sequential failures were due to model error rather than context exhaustion. Against this: the error pattern is consistent with context window saturation (increasing error rates as n grows, sudden failure at a predictable threshold) rather than random errors. Second, one might argue that the parallelised success reflects prompt engineering rather than parallelisation per se. Against this: the key difference between the two conditions is the scope D—the structural organisation of the query—not the prompt quality for any individual sub-task. Third, one might argue that the 66.6-second latency reflects idiosyncratic system conditions. This is plausible but does not affect the qualitative comparison: the parallelised batch completed a task the sequential approach could not complete at all.

6 Discussion

6.1 Implications for LLM System Design

The results suggest a fundamental reappraisal of how LLM inference workloads should be structured for enterprise deployments. The canonical paradigm—decompose, iterate, aggregate—is well-suited to tasks that are genuinely sequential (each step depends on the prior) or tasks with low structural regularity (each sub-task requires bespoke reasoning). For high-scope, structurally regular task batches, the parallelised batch paradigm strictly dominates.

System designers should consider the following reorientation: identify the scope dimension D of the task batch before commencing inference; assess the degree of structural regularity B (high regularity tasks—template-based generation, parameter substitution, structured data transformation—are prime candidates for parallelisation); and select the model capacity \theta jointly with the batch size, exploiting the complementarity identified in Theorem Theorem 1.

6.2 Implications for Inference Pricing

Current token-based pricing architectures charge per token consumed, with no adjustment for the structural organisation of the inference workload. Our results suggest that this pricing architecture may be misaligned with the value delivered: a parallelised batch call consuming T_{raw} tokens total produces D times the output of a single call consuming T_{raw}/D tokens. The per-output token cost of the batch call is 1/D of the sequential per-output cost.

This has implications for how providers and users should negotiate enterprise pricing. Batch API pricing (charging per token with discounts for asynchronous processing) is a step in the right direction, but our framework suggests that scope-adjusted pricing—accounting for the D-fold output delivered per unit of token consumption—would better reflect the economic value of high-scope batch inference.

6.3 Generalisability

The theoretical results are general: they apply to any inference setting in which (i) multiple structurally related tasks can be presented simultaneously, (ii) the model has sufficient capacity to process the batch, and (iii) the parallelisation efficiency function \Pi satisfies Assumptions A1–A5. The key conditions are the supermodularity of \Pi (A3) and the boundary condition (A4); the other assumptions are regularity conditions satisfied by all standard functional forms.

The empirical results are, by construction, specific to the SupTech platform development context. Quantitative parameters—the 66.6-second completion time, the 52/52 pass rate, the approximate linear efficiency ratio—are specific to the model version, task batch, and infrastructure conditions of March 2026 and should not be extrapolated without further empirical validation.

6.4 Limitations and Future Work

Several limitations merit acknowledgement. First, the theoretical framework abstracts from the internal mechanisms of LLM inference; \Pi is a reduced-form efficiency function rather than a structural model of the attention mechanism, KV-cache, or context management. Future work could derive \Pi from first principles of transformer architecture. Second, the empirical evidence is a single case study; more systematic empirical work across model families, task types, and scope dimensions would be valuable. Third, the framework does not address the quality dimension: the theory establishes that effective token consumption falls, but does not characterise how output quality varies with D and \theta. Quality-adjusted efficiency measures are an important extension.

7 Conclusion

This paper has developed a formal production-theoretic framework for understanding the joint efficiency of scope and scale in LLM inference. The central result—the Joint Scope–Scale Efficiency Theorem—establishes that model capacity \theta and task scope D are complements in inference efficiency: the marginal reduction in effective token consumption from increasing capacity is strictly greater when scope is larger, and vice versa. The supermodularity bound (Corollary Corollary 1) quantifies the joint efficiency gain, and the Pareto improvement result (Proposition 1) establishes that parallelised inference weakly dominates sequential inference in both tokens and latency.

The empirical case study from a fifty-two institution SupTech deployment exercise provides striking support for these predictions. The contrast between systematic context exhaustion under sequential processing and zero-error, 66.6-second completion under parallelised batch inference constitutes a natural experiment in the economics of LLM inference. The implied efficiency ratio \Pi \approx 52 matches the theoretical near-linear calibration with unit parameters, confirming the quantitative as well as qualitative predictions of the framework.

The implications are practical and immediate. Enterprise AI deployments involving high-scope, structurally regular task batches should be redesigned around parallelised inference architectures. The efficiency gains are not marginal: they are the difference between feasibility and infeasibility for large-scale tasks within finite context budgets. Inference pricing architectures should be reformed to reflect scope-adjusted value rather than raw token counts.

More broadly, this paper argues that the economics of LLM inference requires theoretical frameworks that are sensitive to the structural organisation of inference workloads, not merely to raw token counts or model capabilities in isolation. The production-theoretic approach developed here—grounding efficiency in the complementarity between capacity and scope—provides a foundation for such frameworks.

8 References

Acemoglu, Daron, and Pascual Restrepo. 2019. “Automation and New Tasks: How Technology Displaces and Reinstates Labor.” Journal of Economic Perspectives 33 (2): 3–30.
Agrawal, Ajay, Joshua Gans, and Avi Goldfarb. 2018. “Prediction Machines: The Simple Economics of Artificial Intelligence.”
———. 2022. Power and Prediction: The Disruptive Economics of Artificial Intelligence. Harvard Business Review Press.
Amdahl, Gene M. 1967. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities.” Proceedings of AFIPS Spring Joint Computer Conference, 483–85.
Anthropic. 2025. “Claude Sonnet 4.6 System Card.” Technical Report. Anthropic PBC.
Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2021. “On the Opportunities and Risks of Foundation Models.” arXiv Preprint abs/2108.07258. https://arxiv.org/abs/2108.07258.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In NeurIPS 2020.
Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, et al. 2012. “Large Scale Distributed Deep Networks.” In NeurIPS 2012.
Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” arXiv Preprint abs/2303.10130. https://arxiv.org/abs/2303.10130.
Gustafson, John L. 1988. “Reevaluating Amdahl’s Law.” Communications of the ACM 31 (5): 532–33.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint abs/2203.15556. https://arxiv.org/abs/2203.15556.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint abs/2001.08361. https://arxiv.org/abs/2001.08361.
Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” In NeurIPS 2022.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” In Proceedings of SOSP 2023.
Li, Guohao, Yiran Chen, Yiming Luo, et al. 2023. “Camel: Communicative Agents for ’Mind’ Exploration of Large Language Model Society.” In NeurIPS 2023.
Milgrom, Paul, and John Roberts. 1990. “The Economics of Modern Manufacturing: Technology, Strategy, and Organization.” American Economic Review 80 (3): 511–28.
———. 1995. “Complementarities and Fit: Strategy, Structure, and Organizational Change in Manufacturing.” Journal of Accounting and Economics 19 (2-3): 179–208.
Park, Joon Sung, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv Preprint abs/2304.03442. https://arxiv.org/abs/2304.03442.
Pope, Reiner, Shaden Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. “Efficiently Scaling Transformer Inference.” In Proceedings of MLSys 2023.
Rajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” In SC20.
Shoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv Preprint abs/1909.08053. https://arxiv.org/abs/1909.08053.
Significant Gravitas. 2023. “AutoGPT: An Autonomous GPT-4 Experiment.” GitHub repository. https://github.com/Significant-Gravitas/AutoGPT.
Solow, Robert M. 1957. “Technical Change and the Aggregate Production Function.” Review of Economics and Statistics 39 (3): 312–20.
Topkis, Donald M. 1998. Supermodularity and Complementarity. Princeton University Press.
Varian, Hal R. 1992. Microeconomic Analysis. 3rd ed. W.W. Norton.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” In NeurIPS 2022.
Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv Preprint abs/2210.03629. https://arxiv.org/abs/2210.03629.

9 Appendix A: Notation and Symbol Glossary

Table 2 provides a complete reference for all mathematical symbols used in the main paper. Symbols are listed in order of first appearance. Throughout the paper, the convention \partial f/\partial x (or f_x in subscript notation) denotes the partial derivative of function f with respect to x. All functions are assumed to be at least twice continuously differentiable in the relevant arguments unless otherwise noted. The supermodularity index \delta is defined as the infimum of the cross-partial \Pi_{\theta D} over the domain, ensuring a uniform lower bound on the complementarity.

Table 2: Complete notation glossary for the main paper.
Symbol Domain Definition First Used
\theta \Theta\subseteq\mathbb{R}_+ Model capacity index Definition 1
D \mathbb{N} Task scope dimensionality Definition 1
B \mathcal{B} Batch configuration Definition 1
Q Query content Definition 1
\Pi(\theta,D,B) [1,\infty) Parallelisation efficiency function Definition 1
T_{raw}(Q,D) \mathbb{R}_+ Raw token count Definition 1
T_{eff} \mathbb{R}_+ Effective token consumption = T_{raw}/\Pi Definition 1
c(\theta) \mathbb{R}_+ Per-token cost at capacity \theta Section 3.6
C \mathbb{R}_+ Total inference cost = c(\theta)\cdot T_{eff} Section 3.6
\tau \mathbb{R}_+ Inference latency (seconds) Proposition 1
\delta \mathbb{R}_+ Supermodularity index = \inf\Pi_{\theta D} Corollary 1
\Pi_\theta \partial\Pi/\partial\theta Step 1
\Pi_D \partial\Pi/\partial D Step 2
\Pi_{\theta D} \partial^2\Pi/\partial\theta\partial D Assumption A3
T_{\mathrm{raw},D} \partial T_{raw}/\partial D Step 4
\mathcal{S} Feasibility set, sequential (D=1) Proposition 1
\mathcal{P} Feasibility set, parallelised (D=N) Proposition 1
N \mathbb{N} Total tasks (case study: N=52) Section 4
K \mathbb{R}_+ Token budget (context window) Section 4.2
\alpha,\beta,\gamma,\delta^* (0,\infty) Parametric coefficients Appendix B

10 Appendix B: Parametric Specification of \Pi(\theta, D, B)

10.1 Baseline Functional Form

To anchor the theoretical framework to quantitative predictions, we propose the following parametric specification of the parallelisation efficiency function:

\Pi(\theta, D, B) = 1 + \alpha\cdot\theta^\beta\cdot(D-1)^\gamma\cdot B^{\delta^*} \tag{10}

where \alpha, \beta, \gamma, \delta^* > 0. The (D-1) shift ensures \Pi(\theta,1,B)=1 (Assumption A4). We verify all five assumptions:

  • (A1) \partial\Pi/\partial\theta = \alpha\beta\theta^{\beta-1}(D-1)^\gamma B^{\delta^*} > 0 for D>1. ✓
  • (A2) \partial\Pi/\partial D = \alpha\gamma\theta^\beta(D-1)^{\gamma-1}B^{\delta^*} > 0 for D>1. ✓
  • (A3) \Pi_{\theta D} = \alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1}B^{\delta^*} > 0. ✓
  • (A4) \Pi(\theta,1,B) = 1 + \alpha\theta^\beta\cdot 0^\gamma\cdot B^{\delta^*} = 1. ✓
  • (A5) Power functions are C^\infty on interior domains. ✓

10.2 Empirical Calibration

We calibrate against the case study. Let \theta_0 denote Claude Sonnet 4.6 capacity (normalised to 1), D=52, B = B_0 = 1. The observed efficiency ratio is \Pi(\theta_0, 52, B_0) \approx 52. Setting \alpha = \beta = \gamma = 1:

\Pi(1, 52, 1) = 1 + 1\cdot 1\cdot(52-1)^1\cdot 1 = 1 + 51 = 52 \quad \checkmark

The near-linear calibration (\beta = \gamma = 1) is consistent with the empirical observation that all 52 tasks complete with near-equal per-task resource consumption.

10.3 Alternative Functional Forms

Table 3 presents three alternative functional forms for \Pi, each satisfying Assumptions A1–A5 under appropriate parameter restrictions.

Table 3: Alternative parametric specifications of \Pi(\theta,D,B) and their cross-partials.
Form Specification \Pi_{\theta D} Notes
Power-law (baseline) 1 + \alpha\theta^\beta(D-1)^\gamma \alpha\beta\gamma\theta^{\beta-1}(D-1)^{\gamma-1} D>1; \alpha,\beta,\gamma>0
Log-interaction 1 + \alpha\theta\cdot\log(D) \alpha/D > 0 D\geq 2; diminishing returns
Exponential \exp(\alpha\theta(D-1)) \alpha^2\theta\exp(\alpha\theta(D-1))>0 Super-linear; \Pi\to\infty
Cobb-Douglas \theta^\beta\cdot D^\gamma \beta\gamma\theta^{\beta-1}D^{\gamma-1} Requires normalisation for A4

10.4 Implied Effective Token Cost Schedules

Using the power-law baseline with \alpha=\beta=\gamma=1 and T_{raw}(Q,D) = D\cdot T_1 (linear raw cost):

T_{eff}(D) = \frac{D\cdot T_1}{1+(D-1)} = \frac{D\cdot T_1}{D} = T_1

Under near-linear scaling, the effective per-task token cost is exactly T_1 regardless of D—equivalent to processing a single task, with all remaining D-1 tasks processed at zero marginal effective token cost.

11 Appendix C: Extended Proofs and Lemmas

11.1 Preliminary Lemmas

Lemma 1 (Quotient Rule for Efficiency Inverse).
Let f(\theta,D)=T_{raw}(Q,D) and g(\theta,D)=\Pi(\theta,D,B). Then T_{eff} = f/g and:

\left(\frac{f}{g}\right)_{\theta D} = \frac{f_{\theta D}g - f_\theta g_D - f_D g_\theta - f g_{\theta D}}{g^2} + \frac{2f g_\theta g_D}{g^3}

Since T_{raw} is independent of \theta: f_\theta = 0 and f_{\theta D} = 0. The expression simplifies to:

(T_{eff})_{\theta D} = \frac{-T_{\mathrm{raw},D}\Pi_\theta - T_{raw}\Pi_{\theta D}}{\Pi^2} + \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi^3}

This confirms the expression derived in Step 3 of the main proof.

Lemma 2 (Sufficient Condition for Negative Sign).
The cross-partial \partial^2T_{eff}/\partial\theta\partial D < 0 if and only if:

T_{\mathrm{raw},D}\Pi_\theta + T_{raw}\Pi_{\theta D} > \frac{2T_{raw}\Pi_\theta\Pi_D}{\Pi}

A sufficient condition is \Pi_{\theta D}/\Pi_D > 2\Pi_\theta/\Pi, equivalently, \partial[\log\Pi_\theta]/\partial D > 0. For the power-law specification with \beta=\gamma=1, this holds when D is moderate and \theta is not too large.

11.2 Proof of Corollary 1 (Detailed)

Let \Delta\Pi \equiv \Pi(\theta_2,D_2) - \Pi(\theta_2,D_1) - \Pi(\theta_1,D_2) + \Pi(\theta_1,D_1). By the fundamental theorem of calculus applied twice:

\Delta\Pi = \int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\Pi_{\theta D}(\theta,D)\,\mathrm{d}D\,\mathrm{d}\theta \;\geq\; \delta\int_{\theta_1}^{\theta_2}\int_{D_1}^{D_2}\mathrm{d}D\,\mathrm{d}\theta = \delta(\theta_2-\theta_1)(D_2-D_1)

Rearranging gives the Corollary statement.

11.3 Proof of Proposition 1 (Detailed)

Under sequential processing with N tasks:

T_{eff}^{\mathrm{seq}} = N\cdot T_{raw}(Q,1)/\Pi(\theta,1,B) = N\cdot T_{raw}(Q,1), \quad \tau^{\mathrm{seq}} = N\cdot\tau_0

Under parallelised processing (D=N, single call), with overhead \varepsilon\geq 0:

T_{eff}^{\mathrm{par}} \leq \frac{(1+\varepsilon)\cdot N\cdot T_{raw}(Q,1)}{N} = (1+\varepsilon)T_{raw}(Q,1) < N\cdot T_{raw}(Q,1) = T_{eff}^{\mathrm{seq}}

for \varepsilon < N-1 (satisfied for N=52, small \varepsilon). For latency: \tau^{\mathrm{par}} = \tau(\Pi,\theta) \ll N\cdot\tau_0 = \tau^{\mathrm{seq}}.

11.4 Context Exhaustion as T_{eff} \to \infty

Let K be the finite context window and C_n the cumulative context after n sequential calls, with carry-forward coefficient \kappa > 0:

C_n = (1+\kappa)C_{n-1} + T_{raw}, \quad C_n = \frac{(1+\kappa)^n - 1}{\kappa}\cdot T_{raw} \to \infty \text{ geometrically}

Context exhaustion occurs at n^* = \operatorname{argmin}\{n : C_n \geq K\}.

12 Appendix D: Empirical Case Study — Full 52-HEI Registry

Table 4 lists all fifty-two higher education institutions (HEIs) in the UAE included in the OBF SupTech-RegTech Platform deployment. The user count follows: 7 base users plus 2 per college, yielding 9 + 2k for an institution with k colleges.

Table 4: Complete registry of 52 UAE higher education institutions.
# Code Institution (English) Cols Progs Users
1 ADHA Abu Dhabi Hospitality Academy — Les Roches 1 3 9
2 AQU Al Qasimiya University 4 5 15
3 AWU Al Wasl University 3 4 13
4 AMITY Amity University Dubai 3 5 13
5 AGDA Anwar Gargash Diplomatic Academy 2 3 11
6 BMC Batterjee Medical College — Dubai 3 3 13
7 BITS BITS Pilani Dubai Campus 3 7 13
8 DMU De Montfort University Dubai 3 5 13
9 DIDI Dubai Institute of Design and Innovation 1 4 9
10 DMEU Dubai Medical University 3 5 13
11 DPA Dubai Police Academy 2 5 11
12 EMN EM Normandie Business School Dubai 1 5 9
13 EAIC Emirates Academy for Identity and Citizenship 1 3 9
14 EAHM Emirates Academy of Hospitality Management 1 3 9
15 EAU Emirates Aviation University 2 6 11
16 ESCP ESCP Business School Dubai Campus 1 3 9
17 ESMOD ESMOD French Fashion Institute Dubai 1 4 9
18 EURAK European University RAK Campus 1 5 9
19 FCMS Fakeeh College for Medical Sciences Dubai 3 4 13
20 FCHS Fatima College of Health Sciences 3 5 13
21 GUD Georgetown University Dubai 1 2 9
22 GSU Global Studies University 1 3 9
23 HBZC Hamdan Bin Zayed College 1 3 9
24 HWUD Heriot-Watt University Dubai 3 5 13
25 HCT Higher Colleges of Technology 5 7 17
26 HUC Horizon University College 2 5 11
27 HULT Hult International Business School 1 5 9
28 IMC Imam Malik College for Islamic Sharia and Law 1 6 9
29 IIMA IIM Ahmedabad Dubai 1 2 9
30 INSEAD INSEAD Abu Dhabi 1 4 9
31 IMTD Institute of Management Technology Dubai 1 4 9
32 IAURAK International American University RAK Campus 2 3 11
33 IMAR Istituto Marangoni Dubai 1 6 9
34 JCSC Joint Command and Staff College 1 3 9
35 JU Jumeira University 4 5 15
36 KBZAC Khalifa Bin Zayed Air College 1 4 9
37 LBSD London Business School Dubai Campus 1 3 9
38 LUISS LUISS University Dubai 1 4 9
39 MAHE Manipal Academy of Higher Education 4 6 15
40 MDXD Middlesex University Dubai 3 5 13
41 MBZUAI Mohamed Bin Zayed Univ. of Artificial Intelligence 1 7 9
42 MBRSG Mohammed Bin Rashid School of Government 1 3 9
43 MBRU Mohammed Bin Rashid Univ. of Medicine & Health Sciences 3 6 13
44 MURDU Murdoch University Dubai 4 6 15
45 NDC National Defense College UAE 1 2 9
46 NHSB Neohorizon School of Business 1 2 9
47 PRUE Plekhanov Russian Univ. of Economics — Dubai 1 4 9
48 PCAD Police College Abu Dhabi 2 2 11
49 PSAS Police Sciences Academy — Sharjah 2 4 11
50 RA Rabdan Academy 1 5 9
51 RAKMHSU Ras Al Khaimah Medical and Health Sciences University 4 6 15
52 RBSNC Rashid Bin Saeed Al Maktoum Naval College 1 3 9

Descriptive statistics for the 52-HEI registry:

Statistic Colleges/HEI Programmes/HEI Users/HEI
Minimum 1 2 9
Maximum 5 7 17
Mean 1.92 4.27 10.85
Std. Dev. ≈1.0 ≈1.3 ≈1.9
Total 100 222 564

The distribution of colleges per institution is highly right-skewed: 28 of 52 institutions (53.8%) have a single college, reflecting the prevalence of specialist institutions in the UAE higher education landscape.

Task Heterogeneity and Batch Configuration. The batch configuration B is characterised by the structural regularity of the 52 tasks. All 52 scripts share an identical template schema with institution-specific parameter substitution. The regularity dimensions are: (i) template structure: 100% shared; (ii) parameter types: identical; (iii) substitution logic: consistent replacement rules; and (iv) heterogeneity dimension: institutional parameters only. This configuration represents near-maximum B—the regime in which Theorem 1 predicts the largest efficiency gains.

13 Appendix E: Platform Architecture and Technical Specification

13.1 OBF SupTech-RegTech Platform Overview

The OBF SupTech-RegTech Platform is a Shiny-based R application implementing the UAE MoHESR OBF v11 compliance monitoring system. The technology stack comprises: R (v4.x) and RStudio/Positron as the primary development environment; Shiny and shinydashboard for the web application framework; shinymanager for authentication and RBAC; RSQLite and DBI for database operations; and SQLite as the embedded database engine.

13.2 File Structure per Institution

Each institution’s deployment comprises:

{FOLDER}/
|-- init_database_{code}.r      # Database initialisation script
|-- app_{code}.r                # Main Shiny application
|-- obf_{code}_v9_5.sqlite      # OBF compliance database
`-- obf_{code}_v9_5_auth.sqlite # Authentication database

The init_database_{code}.r script is 1,500–1,700 lines and, when executed, creates and seeds both SQLite databases with all institution-specific data.

13.3 Template Substitution Architecture

Table 5 lists the twelve substitution categories applied by platform_generator.py per institution.

Table 5: Twelve substitution categories applied by platform_generator.py.
Cat. Substitution Example (KU → ADHA) Method
1 Institution code (variable names) KU_ADHA_ String replace
2 Short code value "KU""ADHA" Regex
3 Full English name Khalifa → Les Roches JSON lookup
4 Arabic name (R unicode escapes) 62c… → literal Raw string
5 Website URL www.ku.ac.ae → institution URL Ordered replace
6 Email domain ku.ac.ae → adha.ae String replace
7 Auth DB filename obf_ku… → obf_adha… String replace
8 Auth credentials block ku_credentials → adha_credentials Block replace
9 Validation counts (colleges) col_n==3 → col_n==1 String replace
10 Program counts prg_n==60 → prg_n==3 String replace
11 User counts u_n==18 → u_n==9 String replace
12 Summary messages KU OBF INIT → ADHA OBF INIT String replace

13.4 Role-Based Access Control Schema

Table 6 summarises the user role schema. Total users = 7 + 2k where k = number of colleges.

Table 6: User role schema.
Role Username Pattern Access Level Count
Platform Administrator code.admin@email_domain Full admin 1
MoHESR Observer mohesr_user@mohesr.gov.ae Read-only regulator 1
President / VC president@email_domain Executive read 1
University Admin univ.admin@email_domain Institution-wide 1
QA Director qa.director@email_domain QA module full access 1
College Dean dean.{college}@email_domain College-scoped 1 per college
College QA Chair qa.{college}@email_domain College QA read 1 per college
Data Entry Officer data.entry@email_domain Data entry write 1
Viewer viewer@email_domain Dashboard read-only 1

14 Appendix F: Validation Methodology and Full Results

14.1 Validation Design

Following batch generation, a systematic bulk validation sweep applied twelve programmatic checks per generated file, adopting a negative-test paradigm: checks verified the absence of residual source-HEI (KU) strings and the presence of institution-specific structural signatures.

14.2 Validation Checks

Table 7 presents the twelve validation checks applied to all 52 generated scripts. All checks passed 52/52.

Table 7: Twelve validation checks applied to all 52 generated scripts.
Check Description Criterion Result
V-01 Arabic name replacement ≠ KU Arabic literal 52/52 PASS
V-02 Short code value CODE_SHORT_CODE ≠ “KU” 52/52 PASS
V-03 Brand comment # CODE Brand (not KU) 52/52 PASS
V-04 Website URL Institution-specific URL 52/52 PASS
V-05 Auth credentials variable code_credentials (not ku_) 52/52 PASS
V-06 Auth user count Users = 7 + 2×colleges 52/52 PASS
V-07 College validation col_n == N_colleges 52/52 PASS
V-08 Program validation prg_n == N_programs 52/52 PASS
V-09 User validation u_n == N_users 52/52 PASS
V-10 Completion banner CODE OBF INIT COMPLETE 52/52 PASS
V-11 Launch instruction Launch app_CODE.r (not app_ku.r) 52/52 PASS
V-12 No residual KU strings 0 occurrences outside header 52/52 PASS

14.3 Spot-Check Methodology

Four institutions were selected for detailed manual review:

  • ADHA (HEI 1): Single-college specialist. Verified Arabic name, Les Roches URL, 3-programme breakdown, 9-user credential structure.
  • HCT (HEI 25): Maximum-college (5 colleges, 17 users). Verified 5 dean/QA pairs, col_n==5, u_n==17, programme breakdown.
  • MAHE (HEI 39): Multi-college international campus (4 colleges). Verified Manipal email domain, 15-user count, 6-programme count.
  • RAKMHSU (HEI 51): Medical university (4 colleges). Verified RAK-specific metadata, medical programme breakdown, correct auth DB filename.

14.4 Error Rate Analysis

The validated error rate was 0/52 = 0.0% across all twelve check categories and all fifty-two institutions. The zero error rate reflects both the correctness of the generation logic and the efficiency of batch processing: in a single pass, the model had access to the complete JSON source of truth for all 52 institutions simultaneously, enabling consistent cross-institution parameter propagation that sequential calls cannot guarantee.

15 Appendix G: Sensitivity Analysis and Boundary Cases

15.1 Sensitivity to \beta and \gamma

Table 8 reports \Pi(1, 52, 1) for \alpha=1, \theta=1 across a grid of (\beta,\gamma) values. The highlighted \beta=\gamma=1 cell (52.0) matches the empirical calibration.

Table 8: \Pi(1, 52, 1) under power-law specification for \alpha=1, \theta=1.
\beta \backslash \gamma 0.5 0.75 1.0 1.25 1.5
0.5 8.1 14.8 52.0 180.7 627.8
0.75 8.1 14.8 52.0 180.7 627.8
1.0 8.1 14.8 52.0 180.7 627.8
1.25 8.1 14.8 52.0 180.7 627.8
1.5 8.1 14.8 52.0 180.7 627.8

15.2 Boundary Case: D = 1

When D=1, Assumption A4 requires \Pi(\theta,1,B)=1, so T_{eff} = T_{raw}(Q,1)—the sequential benchmark. The cross-partial \Pi_{\theta D}|_{D=1} may be zero or undefined depending on the functional form; for the power-law specification with \gamma>1, \Pi_{\theta D}|_{D=1} = 0. The theorem is vacuous at this boundary but well-defined for any D>1.

15.3 Boundary Case: \theta \to 0

As \theta\to 0, Assumption A4 requires \Pi\to 1. In this regime, even a large batch D=N cannot be processed efficiently because the model lacks capacity to exploit cross-task regularities. The cross-partial approaches zero as \Pi_\theta|_{\theta=0}\to 0, consistent with the theorem’s condition \theta>0.

15.4 Boundary Case: D \to \infty

As D\to\infty with fixed \theta and B, the efficiency \Pi\to\infty under the power-law specification, implying T_{eff}\to 0. In practice, D is bounded by the finite context window K: for sufficiently large D, T_{raw}(Q,D) > K and the call fails. The relevant range is D\leq D^*(K,\theta) where D^* is the context-feasible scope maximum.

15.5 Effect of Batch Regularity B

Table 9 shows the effect of batch regularity B on parallelisation efficiency (\alpha=\beta=\gamma=1, \delta^*=0.5, \theta=1, D=52).

Table 9: Effect of batch regularity B on parallelisation efficiency.
B Interpretation \Pi(1,52,B) T_{eff} / T_{raw}
0.25 Low regularity (heterogeneous) 1 + 51\times 0.50 = 26.5 0.038
0.50 Medium regularity 1 + 51\times 0.71 = 37.2 0.027
1.00 High regularity (identical template) 1 + 51\times 1.00 = 52.0 0.019
2.00 Very high (structured JSON) 1 + 51\times 1.41 = 73.0 0.014

16 Appendix H: Robustness — Alternative Functional Forms for T_{raw}

16.1 Motivation

The main paper assumes T_{raw} is increasing in Q and D and independent of \theta (f_\theta = 0). We examine robustness to alternative specifications.

16.2 Case: T_{raw} Increasing in \theta

Suppose T_{raw}(Q,D,\theta) = T_0(Q,D)\cdot\theta^\varepsilon for small \varepsilon>0. Then f_\theta = \varepsilon\cdot T_0/\theta > 0. The additional term T_{\mathrm{raw},\theta}\Pi_D / \Pi^2 > 0 works against the negative cross-partial. The theorem continues to hold provided the supermodularity term T_{raw}\Pi_{\theta D} dominates the perturbation. For small \varepsilon, this is satisfied.

16.3 Case: Sub-Additive T_{raw} in D

Suppose T_{raw}(Q,D) = D^\lambda T_1 for 0<\lambda<1. Then T_{\mathrm{raw},D} = \lambda D^{\lambda-1}T_1 > 0, diminishing. Term I becomes -\Pi^{-2}\lambda D^{\lambda-1}T_1\Pi_\theta < 0, reducing in magnitude relative to the linear case. The negative cross-partial is preserved, and the theorem is easier to satisfy.

16.4 Case: Diminishing Returns in D for \Pi

Suppose \Pi_D is decreasing in D (log-interaction form \Pi=1+\alpha\theta\log D, where \Pi_D = \alpha\theta/D and \Pi_{\theta D} = \alpha/D > 0). Supermodularity is preserved and the theorem holds.

16.5 Summary of Robustness Results

Table 10 summarises the robustness of Theorem 1 to alternative functional form assumptions.

Table 10: Robustness of Theorem 1 to alternative functional form assumptions.
Modification Direction Theorem Holds? Condition
T_{raw} increasing in \theta (\varepsilon small) Reduces |\text{cross-partial}| Yes \varepsilon< threshold
T_{raw} concave in D Reduces Term I Yes No extra condition
\Pi concave in D (log form) Reduces \Pi_D Yes Supermodularity preserved
\Pi concave in \theta Reduces \Pi_\theta Yes Supermodularity preserved
T_{raw} strongly increasing in \theta May reverse sign Conditional \varepsilon < 2\Pi_D/\Pi_\theta
B constant Rescales \Pi uniformly Yes Supermodularity unaffected

The main theorem is robust across all practically relevant alternative specifications. The only case in which the result may fail—strongly capacity-dependent raw token generation—represents an unusual model behaviour pattern not observed in current frontier models, where output length is primarily determined by task content, not model capacity.

Back to top