It’s important to remember that even modest AIs today are already more knowledgeable and much more capable in some ways than even the smartest among us.
For a text answer to a question, this kind of response time may be challenging. For example, ~3 minutes for a one-sentence response. For a one paragraph response, ~30 minutes. Two paragraphs ~1
hour. Etc. For one where the model shows its reasoning process, many hours.
One way to think about the potential of extremely slow but extremely intelligent AI inference is to ask the following question: Would an extremely slow Einstein or Steven Hawking be valuable to a
user? If a user could ask this virtual Hawking an important question, go to sleep, and wake up to a genius answer, how much value could that bring?
For an intelligent agent application that needs only five tokens per transaction and a 1-minute transaction time, this latency may not be a problem. While the improved quality from these large
frontier models may be very valuable.
It’s important to remember that even modest AIs today are already more knowledgeable and much more capable in some ways than even the smartest among us. Over time, AIs of all sizes are increasing
in capability.
So, while Edge hardware will be able to run increasingly good models with interactive speed (say, 15 tokens/second and faster), with SSD streaming, it is also possible to run the best open-weights
models at sub-interactive speed. One possible use case is to use a speedy, interactive model to refine the prompt. That is, a user can refine the phrasing that gives the best results, so that the
model understands the best. Then submit the refined prompt to the extremely slow but extremely intelligent “thinks overnight” model.
Impact
Running GenAI on commodity hardware means that it can be run at the network Edge. Operation at the Edge has some intrinsic advantages that include: reliability, privacy/IP protection, and network
latency. There may also be financial drivers as well.
The recent
Amazon outage is a good example of
what can happen when people or organizations depend on a data center network-accessible AI. Having an Edge implementation, either as a standalone or a backup, can overcome these outage
problems.
Working with vendor-provided data center AI has some inherent privacy and IP (Intellectual Property) exposures. For some applications, these exposures can be quite important. Many people think of
nation-state organizations as being the most protective of secrets. But experience shows that automated factory, Pharma, automotive, etc., industries can be more paranoid. For them, the fact that
their data can be used in training LLMs, in the context windows of other users, etc., may be too great a concern. The data may not go to others in exactly the form received by the data center. But
it may be used in training. Thus, it is part of the reasoning data that the GenAI system uses. Resulting in what is termed ‘IP Leakage’.
Some organizations may meet this concern by implementing their own private data center. However, this still has a data exposure risk on the network that accesses the data center. Plus, Edge systems
may be more cost-effective, have more predictable expense profiles, etc. Finally, there may be latency issues. Especially for intelligent agents, time can be critical. Just the round-trip network
communication time to and back from the data center may be problematic. For these reasons, and possibly just convenience, users may prefer running GenAI locally on Edge systems.
Likely Future Developments
One way to think about the future of AI is to look at the pattern experienced in the introduction of the microprocessor. When the first microprocessors appeared, they were used simply to build
less expensive mini-computers. At that time, few could envisage that there would be greeting cards with a small computer used to reproduce an audio greeting message. So, it is hard to say exactly
in what forms GenAI will emerge, but it is clear that the move to the Edge is well underway. This doesn’t mean that data center GenAI will disappear, but it does mean that there will be a balance
of Edge and data center usage.
Conclusion
Today, GenAI is dominated by large data center deployments. AI inference at the edge has started. It is driven by requirements for reliability, privacy, IP protection, latency, convenience, and
financial concerns. As time goes on, there are likely to be further significant advancements in Edge systems.