SUBSCRIBE NOW
IN THIS ISSUE
PIPELINE RESOURCES

LLMs on Commodity Hardware: What,
Where, Why AI PCs Are Already Here


...the Nvidia GPU platform yields the best performance. If the A100 were combined with a lower-power desktop chip, the efficiency, as expected, would be better, but it would still significantly lag the Apple M-series...
measured as tokens generated per second (tokens/second), power dissipation (the sum of CPU and GPU when applicable), and efficiency measured as speed/Watt (tokens/second/Watt). Tables 2 and 3 are for inference with the combination of CPU and GPU (the GPU is used to speed up matrix multiplication; a CPU is always needed to coordinate overall execution of the model). Table 4 shows results for CPU-only inference. With modern multicore CPUs, LLM inference can still achieve practical, useful speed. As is to be expected, smaller models inference more quickly. As shown in Tables 2, 3, and 4, we measure inference speed in tokens-per-second (tokens/sec). A token is, in concept, a word; in practice, however, it’s actually about three-quarters of a word on average.

The results show, as expected, that the Nvidia GPU platform yields the best performance. If the A100 were combined with a lower-power desktop chip, the efficiency, tokens/second/Watt, would be better, but it would still significantly lag the Apple M-series platforms. Specifically, we expect a desktop CPU would dissipate on the order of 100W instead of the data-center configuration’s 300W. The Nvidia A100 (or the rough consumer equivalent, the GeForce 4090) would dissipate the same power regardless.

The general trend of results shows that smaller but still capable models, e.g., the llama-2-7B-chat with seven billion parameters, run with very good performance on most consumer hardware even without the help of a GPU. Despite being smaller than the LLaMA-2-70B-chat model, Mixtral-8x7B-instruct exceeds it on objective evaluations of accuracy (response quality, not shown). Mixtral has faster inference than LLaMA-2-70B because it uses the modern MoE architecture, which engages only one-fourth of the model’s parameters at any one time. The Xeon is a good stand-in for modern consumer x86 CPUs since the core micro-architectures are similar.


Table 4. CPU-only inference speed and performance-per-Watt for all model sizes that can run on each platform (limitation due to memory capacity on the low-end laptops). Note the high-end laptop outperforms the server on all but the largest model.

The phi-2-2.7B model is an example of a smaller model that nonetheless produces high-quality results. Microsoft created this model with the explicit goal of training on higher-quality data to see if that could endow a smaller model with higher-quality output, and results show that it does. Phi-2-2.7B produces accuracy on par or



FEATURED SPONSOR:

Latest Updates





Subscribe to our YouTube Channel