With SSD streaming, the Inferencer app streams (loads) model weights from the SSD into working memory, where they are used once to compute the results of the current layer in the model.
For example, Kimi K2 at 4 bits per weight quantization still requires 500GB just to hold the model weights, even though only 16GB is needed for any specific token generation. The MoE architecture
reduces the size of the set of active weights but not the size of the model that has to be stored.
One solution to the sheer size requirement is to simply increase the DRAM memory in a laptop. This will happen over time, but today, a half-terabyte or even a full-terabyte of DRAM has very
challenging cost impacts for Edge systems.
SSD Streaming
The Inferencer solution to this problem is to leave all the model weights on the laptop’s SSD (Solid State Disk - also known as the “hard disk”) and stream them into DRAM as needed. This can
work, but it results in token generation being limited by the read speed of the SSD. To understand the tradeoffs of SSD streaming, it is helpful to understand a few details about the structure
and operation of LLMs.
An LLM is built from a stack of layers; each layer implements some very large matrix operations (chiefly multiplies). Large LLMs, such as Deepseek R1 or Kimi K2, might consist of as many as a
hundred or two hundred layers. Each layer computes some results that are passed on to the next layer in the stack. These intermediate results are not huge; they are on the order of the context
(background data used by the LLM in producing the desired output) size (the number of tokens in the context window) times the embedding (how tokens are encoded) dimension. This could be as many as
ten thousand numbers (embedding dimension) for each of a hundred thousand positions (tokens) in the context. That is, it might be on the order of 10,000 x 100,000 or 1,000,000,000 (one
billion).
With SSD streaming, the Inferencer app streams (loads) model weights from the SSD into working memory, where they are used once to compute the results of the current layer in the model. This is
repeated for each layer until the final layer, which computes the probabilities for the next token to be generated.
This has the desirable properties of: (1) allowing nearly any size of model to be run by the laptop (or other Edge hardware) and (2) allowing the model to use any quantization level up to
full-precision (16-bit) representation for the model weights. Thus, with Inferencer’s SSD streaming, it is possible for a laptop to run the full-size, unquantized Deepseek R1 model. But, at full
16-bit precision, just storing the model requires 1.3TB (1,342,000,000 bytes). So, a laptop or other hardware does still need a hefty amount of free SSD storage. These days, however, 8-bit
quantization is considered to deliver full quality from LLMs, so in actuality, at most, only 671GB (671,000,000,000 bytes) of free storage would be required. With 4-bit quantization, storage
requirements are reduced further to a “mere” 336GB. As discussed above, 4-bit quantization can deliver excellent inference quality when training and quantization have been done with low
quantization in mind.
Since Deepseek R1 is an MoE model with 37B active parameters, “only" 37 billion out of the total of 671 billion weights need to be streamed in from storage to generate any given next token.
Further, at 4-bit quantization, only ~18.5 billion bytes need to be streamed in from the SSD.
SSD Streaming Latency
The performance of running an LLM with Inferencer will be limited by one of two things: (1) the speed of the streaming weights from SSD into the laptop’s working memory, or (2) the speed of
the matrix-multiply compute in the processor. Compute, however, for a modern laptop will not be the bottleneck because even an entry-level laptop CPU can execute hundreds of billions of
multiply-accumulate operations per second.
For a modern laptop, the internal SSD can stream at least 3GB/s (some will be significantly faster, some slower). Given such an SSD, the speed of inference on Deepseek R1 with 37B active parameters
can be approximated as follows: at 4-bit quantization, the size of the set of active weights is ~19GB (37GB times one-half since 4 bits is a half byte). We then divide 19GB by 3GB/s to get about 6
seconds per trip through the set of active weights; that is, ~6 seconds per token. So, that’s about 10 tokens per minute, or 0.16 tokens/second inference speed.