GenAI started running on publicly accessible data centers full of GPUs, mostly Nvidia chips. Now we are on the cusp of LLMs (Large Language Models — the software engines of GenAI) moving to edge devices, notebooks, and smart phones. The technical transition, concerns about IP leakage, plus cost factors, will make this a disruptive change, indicating a need for caution in making large financial commitments.
In November of 2024 Bloomberg, in an article titled “Wall Street’s elites are piling into a massive AI gamble,“ reported on a dinner meeting organized by senior people at Morgan Stanley with leaders of the largest private equity funds. Bloomberg estimates that it will require trillions of dollars to build upcoming GenAI data centers, associated nuclear power plants, and communications networks. The Morgan Stanley organizer reportedly told the group that the financing required was beyond the capacity of the banks and suggested partnering with the large private equity funds.
It appears that this economic analysis depends on two key assumptions: 1) inference will continue to be run on large data centers accessed via telecommunication networks; and 2) there will be a large number of teams simultaneously training new fundamental models. Both assumptions seem open to question.
In the form of GenAI prevalent today, there are two very different types of processing: Inference and Training. Inference is the process of using a fully trained LLM. It is called inference because the trained LLM is asked a question. In answering the question, the LLM “infers” the answer based on the question, the block of “attention” data that provides context, and its training. In most publicly available LLMs the inference response time is in the one second range with thousands of inferences performed per minute. Although more processing intensive than a database look up, inference is many orders of magnitude less processing intensive than training. Once trained, some GenAI systems can be run on 2023/24 generation notebook computers.
By mid 2025, Nvidia is expected to have a chip designed for PCs. By Q4 2025, Apple is expected to have an M5 chip fully optimized to run high performing LLMs on its notebooks. There is early work on compressing LLMs to run on Smart Phones. This work shows that output quality suffers with compression. By late 2026 it is reasonable to expect cell phone chips capable of running LLMs with less sacrifice of quality. The appearance of these chips will bring into question the first assumption of continuing to run inference primarily in large data centers.
It appears that the number of groups creating new fundamental models is currently declining. A recent analysis of IaaS (Infrastructure as a Service) costs found that the cost of renting one Nvidia GPU for an hour has been declining. At the peak it was $8.00/hr. It has declined to less than $2.00/hr which is $0.50 below cost.
The largest portion of this cost decline is a result of the reduction in the number of teams creating new fundamental LLMs. The analysis recognizes that a small portion of the price decline is due to relief of a temporary Nvidia chip shortage. As parameter set sizes have grown to hundreds of billions of parameters, training costs have become prohibitively high. Many teams funded to create new fundamental models have switched to either tuning existing LLMs or creating medium sized LLMs. The analysis indicates that the number of teams globally creating fundamental LLMs is less than 50 and may be continuing to decline.
Without high and growing training demand for data center resources to do training, the demand for GenAI data center capacity may shrink. This brings the second assumption into question.