The Missing Middle: Infrastructure-Aware Intelligence and the Economics of Good-Enough AI
Frontier Intelligence vs. the Intelligence of the Last Mile
A $600-billion hyperscale arms race is racing ahead on one track while 2.2 billion people remain offline. The engineering gap between the two worlds—measured in watts, tokens, and dollars per gigabyte—is now the defining challenge of applied AI.
■ Bottom Line Up Front (BLUF)
Artificial intelligence is fracturing into two irreconcilable paradigms. The first—hyperscale, centralized, energy-intensive frontier AI—is consuming electricity at 17% annual growth and absorbing more than $680 billion in capital expenditure in 2026 alone, yet remains physically and economically inaccessible to the majority of humanity. The second—distributed, quantized, bandwidth-aware edge AI—is deploying small language models (SLMs) on sub-4 GB RAM hardware at near-zero marginal cost per inference. The SLM edge deployment market is growing at 30% CAGR toward $12.85 billion by 2030. Engineering decisions made in the next three years—in silicon architecture, model compression, prompt engineering, and network economics—will determine whether AI becomes a universal technology or an accelerant of existing global inequality. Neither track is a substitute for the other; both are permanent structural features of the AI landscape.
I. Two Tracks, One Technology
The capital markets tell one story about AI. Amazon Web Services, Microsoft Azure, Google Cloud, Meta, and Oracle collectively plan to spend more than $680 billion on AI infrastructure in 2026—a figure that, if treated as national GDP, would rank among the top forty economies on earth. Alphabet alone is doubling its infrastructure capital expenditure to between $175 billion and $185 billion this year, primarily to defend its search franchise and absorb a $240-billion Google Cloud revenue backlog. The International Energy Agency (IEA) reports that data center electricity demand jumped 17% in 2025 and is on course to double by 2030, with AI-specific facilities growing even faster. By one projection, global data center electricity consumption will approach 1,050 terawatt-hours by 2026—a load that, were it a country, would rank fifth globally in energy consumption, between Japan and Russia.
The telecommunications data tell a different story entirely. As of late 2025, the ITU reports that 6 billion people are online—but 2.2 billion remain completely offline, the overwhelming majority in low- and middle-income countries. Even among the connected, GSMA Intelligence's 2025 State of Mobile Internet Connectivity Report identifies affordability of handsets and mobile data as the primary barriers to meaningful internet adoption. In roughly 60% of low- and middle-income countries, mobile broadband remains unaffordable by the ITU's own standard of 2% of monthly GNI per capita for 1 GB of data. For users in these regions, querying a frontier cloud model is not merely slow—it is a direct, metered, often prohibitive financial transaction.
These two realities are not temporary mismatches that deployment will eventually reconcile. They represent structurally distinct computational architectures, driven by different physics, different economics, and increasingly different engineering communities. As one industry analysis framed it: "The emerging architecture is bifurcated: centralized systems dominate training, while distributed systems handle physical-world intelligence and real-time inference at the edge. The implications are structural rather than incremental."
"Frontier intelligence raises the ceiling of machine cognition. Infrastructure-aware intelligence raises the floor of access. Both are permanent features of the AI landscape—and neither is sufficient alone."
— IEEE Spectrum Analysis, June 2026II. The Hyperscale Track: Capital, Power, and Structural Fragility
The $680-Billion Build-Out
The numbers governing the hyperscale track are staggering in both magnitude and growth rate. Synergy Research Group counted 1,297 operational hyperscale data centers worldwide as of late 2025—nearly triple the count from early 2018—with a pipeline of 770 additional facilities under development. A January 2026 Bloom analysis projects that the nameplate capacity of facilities either under construction or in planning will nearly double between 2025 and 2028, from 80 to 150 gigawatts. Meta has guided 2026 capital expenditure of $115 billion to $135 billion, up from $71.8 billion in 2025, and carries $103.8 billion in non-cancellable data-center lease obligations through 2030.
Financing the build-out has required a structural pivot from internal cash generation to debt markets. Since late 2024, the five largest hyperscalers have tapped capital markets for more than $137.5 billion in debt, an historic surge for the technology sector. As Allianz Research noted in March 2026, hyperscalers that had ignored debt markets for a decade were already issuing more debt in the first quarter of 2026 than they had in all of 2025. Oracle, which pledged $300 billion in AI infrastructure for OpenAI in September 2025, saw its stock price fall more than 57% in the ensuing months; its capital expenditure for the first half of fiscal 2026 equaled 66% of revenues.
Energy: The Binding Physical Constraint
Power availability, not capital, has emerged as the binding constraint on hyperscale expansion. A January 2026 report found the capacity of data centers either blocked or delayed by community opposition had reached $162 billion across 36 projects as of June 2025; a further 25 projects were canceled in 2025 alone in response to local resistance. In Ireland, data centers already consume 21% of national electricity, with IEA projections suggesting a rise to 32% by 2026. In Fairfax County, Virginia—the world's densest data center market—26% of all electricity is consumed by facilities that serve computing workloads nationally. The IEA projects that, in a rapid-growth scenario, U.S. AI and data infrastructure could account for 7.4% of all national electricity consumption by 2030 and up to 15% by 2050.
Each generation of frontier model inference imposes a substantial per-query energy burden. A modern data center GPU operating within a 300W–700W thermal envelope processes tokens at the cost of that entire facility's overhead: the computation, the cooling infrastructure (quantified as Power Usage Effectiveness, or PUE), and the telecommunications network energy tax of approximately 5 kWh per gigabyte of data transmitted. These costs are real, recurring, and scale with every additional user query.
III. The Inference Inequality Framework
To understand why hyperscale AI cannot simply be universalized through infrastructure deployment, it is necessary to examine what engineers are increasingly calling Inference Inequality: the structural disparity in computational access driven by hardware, network economics, and physical geography. Inference Inequality operates through three interlocking constraints.
Hardware: The Memory Wall
The most frequently overlooked constraint is not bandwidth or latency—it is RAM. Generative AI inference requires keeping model weights in fast random-access memory during processing. Global smartphone shipments declined 2.9% in early 2026, breaking a ten-quarter growth streak, not due to slack demand but because a severe memory chip shortage is driving bill-of-materials costs up 20–30% for lower-end devices. Memory component prices are projected to rise an additional 40% through mid-2026. The practical consequence: OEMs are abandoning the sub-$100 smartphone segment entirely, leaving the entry-level market underserved by devices with insufficient RAM for even lightly quantized on-device inference. In several emerging markets, retail smartphone prices have already surged 40–50%.
For edge AI, the practical RAM envelope is 2–4 GB. Below that threshold, standard Android memory management—specifically the Low Memory Killer Daemon (LMKD), which uses Pressure Stall Information (PSI) monitors introduced in Android 10—will terminate the AI process to protect baseline system functionality. Running a 3-billion-parameter model even at aggressive 4-bit integer (INT4) quantization requires approximately 2 GB of RAM; attempting to load it on a constrained device triggers PSI spikes that the OS resolves by killing the inference process.
Network Economics: The Data Tax
For users who might bypass local hardware limitations by routing queries to cloud models, mobile data economics impose a regressive tax on AI access. The Alliance for Affordable Internet and the ITU define affordability as 1 GB of mobile data costing no more than 2% of monthly GNI per capita. This threshold is violated in roughly 60% of low- and middle-income countries. In six African nations, 1 GB of data exceeded 10% of average monthly income in 2023; in South Sudan, Zimbabwe, and the Central African Republic, that figure exceeded 30%. In island nations such as São Tomé and Príncipe, 1 GB can cost $29.50. At these prices, a multimodal AI query involving an image is not a utility—it is a discretionary, costly financial decision.
Geography: AI Desert Regions
The physics of fiber optics impose latency penalties that no amount of capital can fully eliminate. Routing data from an AWS Cape Town node to Hong Kong incurs median round-trip latencies of 237–240 ms; connections to Tokyo reach 281 ms. These delays, combined with inter-cloud routing inefficiencies, create what network analysts have termed AI Deserts—geographic regions where the combination of physical distance to compute and weak last-mile connectivity makes synchronous, interactive AI inference functionally unusable. Telecommunications operators are now explicitly segmenting their user bases into "High-Bandwidth AI Creators" and "Low-Bandwidth AI Consumers," acknowledging a permanent tier structure in capability access.
IV. The Edge AI Track: Engineering Frugal Intelligence
Silicon Convergence at the Edge
While hyperscalers race to build larger GPU clusters, a parallel hardware revolution is occurring inside smartphones, laptops, and single-board computers. Qualcomm's Snapdragon 8 Gen 5 (shipping in 2026 flagships) delivers a 46% AI throughput improvement over its predecessor and processes up to 70 tokens per second on quantized LLMs—sufficient for genuine offline assistant experiences. The Snapdragon X2 Elite Extreme, showcased at CES 2026, features an 80 TOPS NPU, nearly double the previous generation's 45 TOPS. Apple's M4 Neural Engine delivers 38 TOPS with deep hardware-software integration. Intel's Core Ultra 300 series, built on the 18A (2nm) process, integrates NPUs delivering 45–60 TOPS for mainstream Windows laptops. Deloitte estimates the market for inference-optimized chips will exceed $50 billion in 2026, up from $20 billion in 2025.
The consequence is a rapidly closing performance gap. Google's LiteRT benchmarks on Snapdragon 8 Elite Gen 5 demonstrated that more than 56 models run in under 5 ms on the NPU, with one vision-language model achieving time-to-first-token of 0.12 seconds on high-resolution images—performance imperceptible to the human user. As the Edge AI and Vision Alliance documented in January 2026, where 7-billion parameters once seemed the minimum for coherent generation, sub-billion models now handle many practical tasks, with architecture and training data quality proving more decisive than parameter count at small scales.
Model Compression: The Mathematics of Deployability
The engineering bridge between frontier capability and edge deployability is Post-Training Quantization (PTQ). By reducing weight precision from 16-bit floating-point (FP16) to 4-bit integer (INT4), engineers achieve 75–93% reductions in model footprint. The GGUF format and inference engines such as llama.cpp have standardized Q4_K_M quantization; a full PTQ workflow applied to Llama 3.2 3B shrinks the model to approximately 2 GB on disk, runnable via Ollama within Android's Termux environment.
The major labs have converged on this paradigm. Meta's Llama 3.2 (1B and 3B parameter variants), Google's Gemma 3 (down to 270M parameters), Microsoft's Phi-4 mini (3.8B), HuggingFace's SmolLM2 (135M–1.7B), and Alibaba's Qwen 2.5 (0.5B–1.5B) all target efficient on-device deployment. Collectively, open-weight SLMs from these families have surpassed 300 million downloads. The SLM for edge deployment market is projected to grow from $3.42 billion in 2025 to $12.85 billion by 2030 at a 30.27% CAGR, driven by enterprise demand for privacy-first, low-latency solutions.
Beyond Quantization: DRAM-Flash Swapping
Even with INT4 quantization, models that exceed a device's available DRAM trigger Android's LMKD kill sequence. Architectural frameworks such as ActiveFlow (arXiv:2502.xxxxx) address this through adaptive DRAM-flash swapping. Rather than relying on the CPU-intensive zRAM compression cycle that degrades performance and drains batteries, ActiveFlow uses three specialized techniques: cross-layer active weight preloading, which uses current-layer activations to predict which weights will be required in subsequent layers and preloads them from flash while computation proceeds; sparsity-aware self-distillation, which compensates mathematically for approximations introduced during memory swapping; and pipeline orchestration, which dynamically allocates available DRAM among a hot weight cache, preloaded weights, and active computation weights based on real-time memory pressure. By keeping the inference process within LMKD thresholds, ActiveFlow enables models to run on devices that would otherwise terminate the process entirely.
V. The Economics: Cognition per Dollar and Joules per Token
The divergence between the two AI tracks becomes absolute when expressed in the metrics that ultimately determine deployment at scale: energy consumed per token generated, and useful cognitive output per dollar spent.
Energy-per-Inference
A data center GPU operating a frontier model runs within a 300–700W thermal envelope. This figure excludes the facility cooling overhead (PUE) and the network transmission energy penalty of approximately 5 kWh per gigabyte of data moved. Research published on arXiv assessing hybrid edge-cloud architectures found that a standard centralized architecture consumes an estimated 1,927 kWh per device per year; shifting appropriate workloads to the edge collapses this to approximately 674 kWh—a 65% reduction. At the micro level, benchmarking edge inference on a Raspberry Pi 4 (8 GB RAM) running Qwen 2.5 0.5B (quantized) measured 0.54 Joules per token for straightforward tasks and 3.13 Joules per token for complex reasoning tasks—orders of magnitude below the equivalent GPU energy expenditure. In emerging markets where grid electricity is rationed or unavailable, this is not an efficiency preference—it is a survival constraint for the deployment.
Cognition-per-Dollar
Frontier API pricing structures reflect the cost reality of centralized compute. GPT-4o is priced at $5.00 per million input tokens and $15.00 per million output tokens; Claude 3.5 Sonnet at $3.00 and $15.00 respectively; Gemini 1.5 Flash offers a more aggressive $0.35/$1.05 structure. For a non-governmental organization deploying an educational chatbot to 100,000 rural students generating 50,000 monthly queries, the raw frontier API cost approaches $1,250 per month—excluding telecommunications routing costs and end-user bandwidth expenditure. In a pure L3 edge deployment running an open-weight SLM on locally provisioned hardware, the marginal cost per query is effectively zero after the upfront hardware investment. SLMs running on-device can cut cloud costs by up to 70% in hybrid configurations, according to industry benchmarking.
VI. Bandwidth-Aware AI: Bridging the Gap
For the billions of users whose devices cannot run even the smallest SLMs locally, and whose connectivity cannot support synchronous cloud queries, a third approach has emerged: bandwidth-aware AI, which optimizes the transmission of intelligence across severely constrained networks.
Algorithmic Prompt Compression
Microsoft's LLMLingua and its successor LLMLingua-2 address the bandwidth problem at the prompt level rather than the model level. The original LLMLingua deploys a compact edge model to calculate the perplexity of each token in a prompt, then applies a coarse-to-fine dynamic budget controller to prune non-essential tokens while preserving semantic content. Compression ratios of 20x are routinely achieved; experimental configurations have demonstrated ratios up to 480x while retaining 72% of original model capability.
LLMLingua-2 resolves the critical latency problem of the original: compressing 48,000 tokens at 1.5x required 21 seconds of algorithmic overhead in the first version, negating any bandwidth savings. The successor replaces the sequential bottleneck with a direct token selection mechanism, reducing overhead to under 3 seconds. In practical terms, LLMLingua-2 can reduce cloud API costs by up to 80% for bandwidth-constrained deployments while accelerating prefill latency by up to 2.6x.
Asynchronous Gateway Architectures
For users on 2G or 3G networks without local inference capability, Level 2 architectures deliver AI through highly compressed asynchronous text interfaces—primarily the WhatsApp Business API, USSD, or structured SMS. A regional gateway absorbs the cloud API cost and manages the connection on the user's behalf. In India, WhatsApp Business API service messages cost approximately ₹0.29 ($0.003); in African markets, the cost scales higher but remains a fraction of the bandwidth expenditure of a direct cloud query. These architectures have enabled agricultural advisory deployments across Africa and India, where farmers submit queries via basic smartphones and receive AI-generated crop disease diagnoses through systems that check local cached databases before escalating to cloud inference only when necessary.
VII. Real-World Deployments: Frugal AI in the Field
The theoretical frameworks of infrastructure-aware AI are manifesting in documented field deployments across agriculture, healthcare, and education in the Global South. TechnoServe, CropIn, and platforms such as Kisan Mitra AI have demonstrated that AI advisory services delivered through L1/L2 architectures—offline edge gateways, mesh Wi-Fi, cached databases, and asynchronous WhatsApp interfaces—can serve farming communities where broadband is either absent or unaffordable. Boston Consulting Group's 2025 "AI for ALL" analysis documented how human escalation architectures, in which AI handles routine queries while routing anomalous or high-risk cases to scarce human experts, can extend the effective reach of a single agronomist or clinician across a much larger population.
In healthcare, hardware-accelerated single-board computers with dedicated NPUs are being installed in rural primary health centers, running open-source quantized medical SLMs to provide nursing staff with administrative assistance, preliminary decision support, and clinical documentation in offline environments. These deployments sidestep the data privacy vulnerabilities, latency barriers, and connectivity requirements of cloud-hosted medical APIs. When cases exceed the SLM's confidence threshold, the system queues them for asynchronous review by a connected clinician—a design pattern that exploits the asymmetry between the high volume of routine queries and the low volume of genuinely complex cases.
VIII. The Satellite Counterargument and Its Limits
The most common objection to the infrastructure-aware AI framework is the imminent universalization of connectivity through Low Earth Orbit (LEO) satellite constellations such as Starlink. The argument holds that if bandwidth is ubiquitous, edge AI is a stopgap, and the architectural investments of the last mile will become obsolete as hyperscale access extends to every geography.
This argument has three structural weaknesses. First, satellite terminals represent significant capital expenditure and monthly subscription costs that far exceed the 2% GNI affordability threshold for the underserved majority the argument purports to serve; universal coverage does not imply universal affordability. Second, the energy economics remain hostile even with satellite connectivity: the 5 kWh/GB network transmission penalty applies regardless of whether the signal travels through fiber or satellite uplink, while edge inference consumes 0.15–3.13 Joules per token locally. Third, and most fundamentally, satellite internet cannot inject RAM into a $100 Android device. The hardware memory constraint is a supply chain reality independent of connectivity; no improvement in network availability changes the physics of on-device inference on memory-constrained hardware.
"No amount of satellite internet can inject RAM into a depreciating $100 Android device. The hardware memory constraint is a supply-chain reality that connectivity cannot resolve."
— Infrastructure-Aware AI Framework Analysis, arXiv (2026)IX. Market Structure and Emerging Engineering Categories
The bifurcation of AI scaling creates a distinct market for deployment engineering—startups and products focused not on building larger models but on making existing models deployable under physical and economic constraints. Several categories are consolidating.
Prompt compression middleware companies are building commercial wrappers around LLMLingua-2 and similar algorithms, positioned between low-bandwidth user channels (Twilio, WhatsApp Business API) and hyperscale LLM backends. By running compression at the gateway layer, these services reduce client cloud API costs while maintaining semantic fidelity—a cost structure that becomes increasingly attractive as API volume scales.
Low-memory OS orchestration layers are enabling budget smartphone OEMs to market devices capable of running SLMs without triggering LMKD termination events. These software layers manage RAM pressure, model loading sequences, and system stability during local inference in ways that the stock Android memory manager was not designed to handle.
AI edge caching appliances—ruggedized, often solar-powered single-board computers pre-loaded with compact models and domain-specific knowledge bases—are being designed for offline-first deployment in schools, agricultural cooperatives, and clinics. The hardware investment is upfront capital expenditure; the ongoing operational expenditure for data transmission is effectively zero.
Multilingual localization infrastructure addresses the language gap that frontier models largely ignore. While models such as GPT-4o and Claude 3.5 Sonnet are optimized for English and a handful of high-resource languages, Qwen 2.5 supports more than 29 languages natively and has demonstrated superior translation capability in low-resource language benchmarks. Edge deployments in African and South Asian markets depend critically on this multilingual capability at small model scales.
X. Conclusion: The ROI of Deployability
The prevailing narrative of artificial intelligence—that its destiny is defined by the largest models running in the largest data centers—reflects a deep proximity bias toward the infrastructure conditions of advanced economies. It is a narrative sustained by the engineers and investors who live within those conditions and who naturally measure progress by the metrics that matter in that context: benchmark performance, context window size, reasoning capability on graduate-level examinations.
A different set of metrics applies to the other track: tokens per watt on a 4 GB Android device; cost per query through a WhatsApp Business API gateway; inference latency on a quantized SLM running on a solar-powered Raspberry Pi in a rural health clinic with no internet connection. By these metrics, the progress of the last two years has been as remarkable as anything achieved in the hyperscale track. Models that would have required a 40-GB GPU in 2023 now run within 2 GB of RAM at 32 tokens per second—invisible latency to a human user—on hardware that costs less than a week's wages in the markets where it matters most.
The IEA projects that data center electricity consumption will double by 2030. The ITU projects that 2.2 billion people will remain offline through 2026, with billions more "under-connected" in ways that exclude them from meaningful AI utility. These two projections are not contradictions—they are the coordinates of the bifurcation. The highest societal return on investment from artificial intelligence will not be measured by the size of the parameter count or the rank on a reasoning benchmark. It will be measured by the geographic and economic breadth of deployability: by whether the defining technology of this era functions as an equalizer or as an accelerant of existing inequality. Both tracks are now permanent. The engineering choices made on the second one will matter as much as anything happening in the hyperscale data centers of Virginia, Ireland, and Singapore.
■ Verified References & Citations
- International Energy Agency (IEA). Key Questions on Energy and AI. April 16, 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025
- Brookings Institution. "Global energy demands within the AI regulatory landscape." Updated April 2, 2026. https://www.brookings.edu/articles/global-energy-demands-within-the-ai-regulatory-landscape/
- CoStar Group. "Hyperscalers' $680 Billion AI Capital Expenditure Investment Raises the Stakes." February 12, 2026. https://www.costar.com/article/907046102
- Futuriom. "Is Hyperscaler AI Spending Sustainable?" April 6, 2026. https://www.futuriom.com/articles/news/ctp-is-hyperscaler-ai-spending-sustainable/2026/04
- Allianz Research. "AI Capex Cycle: War-Proof for Now." March 25, 2026. https://www.allianz.com/…/2026_03_25_AI.pdf
- Data Center Knowledge. "Hyperscalers in 2026: What's Next for the World's Largest Data Center Operators." March 13, 2026. https://www.datacenterknowledge.com/hyperscalers/hyperscalers-in-2026
- ITU. Facts and Figures 2025: Global Number of Internet Users Increases, but Disparities Deepen. November 17, 2025. https://techxplore.com/news/2025-11-global-internet-users-disparities-deepen.html
- GSMA Intelligence. State of Mobile Internet Connectivity 2025. Cited in DataReportal, Digital 2026 Mid-Year Global Update. April 22, 2026. https://datareportal.com/reports/digital-2026-mid-year-global-update-report
- Development Aid. "Bridging the digital divide: Why connectivity alone is not enough." January 28, 2026. https://www.developmentaid.org/news-stream/post/204008/bridging-the-digital-divide
- World Bank. Atlas of Global Development 2026: Inequalities in Use of and Exposure to Artificial Intelligence. 2026. https://data360.worldbank.org/en/atlas/internet-access/
- Marqstats. Small Language Model (SLM) for Edge Deployment Market Size, Share & Forecast 2026–2030. April 7, 2026. https://marqstats.com/reports/small-language-model-edge-deployment-market/
- AI2Work / Deloitte. "On-Device AI Arrives: Edge Inference Chips Hit Consumer Hardware." March 10, 2026. https://ai2.work/blog/on-device-ai-arrives-edge-inference-chips-hit-consumer-hardware
- Edge AI and Vision Alliance. "On-Device LLMs in 2026: What Changed, What Matters, What's Next." January 28, 2026. https://www.edge-ai-vision.com/2026/01/on-device-llms-in-2026-what-changed-what-matters-whats-next/
- Zylos Research. "Small Language Models and Edge AI: The 2026 Shift to Local Intelligence." February 7, 2026. https://zylos.ai/research/2026-02-07-small-language-models-edge-ai
- ZEDEDA. "2026 Predictions: How Edge AI Is Reshaping Industrial Operations." January 20, 2026. https://zededa.com/blog/2026-predictions-how-edge-ai-is-reshaping-industrial-operations/
- Dell Technologies. "The Power of Small: Edge AI Predictions for 2026." January 7, 2026. https://www.dell.com/en-us/blog/the-power-of-small-edge-ai-predictions-for-2026/
- CTech / Calcalist. "Physical AI Is Breaking the Hyperscale Model." May 2026. https://www.calcalistech.com/ctechnews/article/b1jibpocwx
- Pew Research Center. "What We Know About Energy Use at U.S. Data Centers Amid the AI Boom." October 24, 2025. https://www.pewresearch.org/short-reads/2025/10/24/what-we-know-about-energy-use-at-us-data-centers
- Carbon Brief. "AI: Five Charts That Put Data-Centre Energy Use—and Emissions—into Context." September 17, 2025. https://www.carbonbrief.org/ai-five-charts…
- Consumer Reports. "AI Data Centers: Big Tech's Impact on Electric Bills, Water, and More." March 20, 2026. https://www.consumerreports.org/data-centers/…
- arXiv. "Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash." 2025. https://arxiv.org/
- arXiv. "Quantifying Energy and Cost Benefits of Hybrid Edge Cloud: Analysis of Traditional and Agentic Workloads." 2025. https://arxiv.org/
- arXiv. "Cloud to Edge: Benchmarking LLM Inference on Hardware-Accelerated Single-Board Computers." 2025. https://arxiv.org/
- arXiv. "LLMLingua-2 / Prompt Compression in the Wild." 2025. https://arxiv.org/
- StartUs Insights. "12 New Technology Trends in 2026." February 23, 2026. https://www.startus-insights.com/innovators-guide/new-technology-trends/
- Gaurav Kumar Singh. "The Missing Middle: Infrastructure-Aware Intelligence and the Economics of Good-Enough AI." AI Advances, Medium, June 2026. [Source document for this analysis]




