Edge AI: Changing GenAI Balance
Between Edge and Data Center

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

Recently, a significant technical advancement has allowed computers at the network edge to run very large LLMs (Large Language Models, which are the engines that drive generative AI systems). The result has important implications for privacy, IP protection, and availability.

Background

This capability comes from a new AI inference application called simply Inferencer. It allows even modest Apple Mac computers with as little as the base amount of DRAM memory (working memory sometimes called scratch pad memory) to run the largest open-source LLMs (Large Language Models, which are the “brain” of generative AI systems). This ability is not without compromise, of course: the rate of token generation can be very slow. Where LLMs in data centers typically generate at least high tens or even hundreds of tokens per second, Inferencer will generate a few tokens per minute when running the largest models available, such as the full Deepseek R1 at 671B parameters, GLM-4.6 at 357B parameters, or even Kimi K2 at 1Trillion parameters.

Last year, we wrote about the ability to run interactive inference of modest but useful LLMs on laptops. Interactive means entering a prompt and getting a generated response within seconds to minutes. These modest models we covered could range from low billions to thirty billion parameters. Back then, the majority of open-source models were “dense” models, which means simply that all of the billions of parameters in the model are evaluated for every token that is generated. Since then, almost all top-ranked models—regardless of their total size or their status as open or proprietary—are “sparse” models, also called mixture-of-experts or MoE models. In an MoE model, only a small subset of the total parameters are activated (evaluated) for any given token that is generated. For example, even though Kimi K2 has 1Trillion parameters, only 32B are evaluated for any generated token. Thus, this MoE model architecture dramatically increases the speed of token generation during inference, even though it does not reduce memory requirements.

In last year’s article about LLM inference on laptops, low quantization (low number of bits used to represent a value) of model weights was already popular to accomplish two things: reduce the AI’s memory requirements and speed up inference. With quantization, the range of model sizes that could be usefully run on laptops is increased. For example, Meta’s LLaMa 70B would not fit on an Apple Mac with 64GB of memory—it couldn’t run at all—but with 4-bit quantization, the memory requirement is reduced to about 40GB, which is well within the capabilities of a 64GB laptop. Quantization in this case has the additional benefit of increasing token generation speed to a comfortable level. In the past, there has been a quality of output reduction with lower quantization levels.

In addition to the industry moving to MoE models with quantization, researchers in all the AI labs have made significant improvements in pre-training (which now refers to what used to be called simply “training”) and post-training. The result is models of all sizes having more intelligence and capabilities. In addition, training is now often done with quantization in mind; that is, the model builders conduct all aspects of training knowing that the final model that will be run will be in a low-quantized form, whether in a data center or at the Edge. At the Edge, users are concerned with a small computer being able to run inference at all and with acceptable speed; in the data center, everyone is concerned with speed, but also size. Because small, quick models consume the minimum of expensive and often scarce data-center resources. In summary, the goals of data-center inference and Edge inference are very aligned.

When the AI race got started with the introduction of ChatGPT back in late 2022, model weights (parameters) were typically represented with 32 or 16 bits. Now, it is widely recognized that a properly trained and quantized model can achieve very nearly the same quality of output with only 4 bits per weight. This fact is so accepted that Nvidia and other hardware vendors have added hardware support for multiple 8-bit and 4-bit representations. In doing so, they have cast quantization into the ‘stone’ of their hardware.

Nonetheless, the largest models were still out of reach for Edge users because these trends were not powerful enough to overcome the sheer size of the biggest models.

Follow @PipelineWire

Edge AI: Changing GenAI BalanceBetween Edge and Data Center

Background

Edge AI: Changing GenAI Balance
Between Edge and Data Center