ZStack Cloud Platform
Single Server, Free Trial for One Year
In the series of articles Deep Understanding of DeepSeek and Enterprise Practices (Part 1): Distillation, Deployment, and Evaluation and Deep Understanding of DeepSeek and Enterprise Practices (Part 2): Principles, Hardware Cooling, and Performance Testing of 32B Multi-GPU Inference, we introduced the relationships between different DeepSeek R1 models, core metrics, and completed the deployment and evaluation of several distilled models on ZStack AIOS. According to our tests, distilled versions often outperform the original models in areas like mathematics and coding. However, for more complex tasks (e.g., writing hundreds of lines of code), their performance may fall short. At this point, we need to consider using the DeepSeek-R1 671B model, commonly referred to online as the “full-power version.”
However, with 671B parameters, the model is massive. If the hardware doesn’t support FP8, the model weights alone require 1.3TB, making costs prohibitively high. Thus, this article focuses on deploying the near-trillion-parameter DeepSeek-R1 671B model at the lowest cost, assessing the real-world performance of quantized versions, evaluating the losses compared to unquantized versions, and analyzing the cost-effectiveness and suitable scenarios for different hardware configurations.
Currently, there are many quantization schemes for DeepSeek R1 671B. We won’t delve into the specific meanings of methods like IQ_1_S or AWQ here. Instead, we directly compare several typical schemes and their VRAM requirements below:
(Note: VRAM requirements include the minimum KV cache and system overhead and represent the lowest needs. Actual requirements depend on context window size, KV cache precision, etc., for accurate estimation. GGUF and safetensor formats differ in VRAM usage due to inference engines and parallelism methods, so they cannot be directly compared.)
From the table above, it’s clear that a single 3090 8-GPU server perfectly meets the minimum requirements for 671B-1.58b! Additionally, all weights can be loaded onto the GPU, ensuring decent inference speed. However, note that since the format is GGUF, the llama.cpp inference framework is required. In ZStack AIOS, multiple inference frameworks are supported, allowing users to choose based on their needs.
1. Environment Preparation: Install ZStack AIOS and ensure the system meets operational requirements.
2. One-Click Deployment:
aUse ZStack AIOS to select the model and link it to an appropriate inference template (llama.cpp) and image.
b. Specify the GPU and compute specifications for running the model, then deploy.
3. Test Run: Try a dialogue experience in the interactive window or integrate via API into other applications.
We successfully ran 671B-1.58b! Unfortunately, we were limited to a 4K context due to llama.cpp’s layered GPU loading approach, where each layer requires the full context size to operate. Moreover, due to llama.cpp’s inference mechanics, increasing concurrency doesn’t boost total throughput and even reduces per-session context size. Thus, we didn’t pursue higher concurrency.
However, a 4K context is far too small for DeepSeek-R1. Responses are easily truncated, making it impossible to complete standard evaluations like MMLU or C-EVAL. To increase context size, we tested multi-machine deployment of DeepSeek-R1-671B-1.58b.
Since the context on a single 3090 was too limited, we expanded the 671B-1.58b model’s context via a cluster. By distributing weights across 16 cards, we theoretically gained nearly 14GB of space for KV Cache, equating to about 14K context space. (Note: Multi-machine parallel inference with llama.cpp isn’t recommended for production; this is for testing only.)
Long-context scenarios show a slight increase in model throughput.
Due to llama.cpp’s architecture and the 3090’s bandwidth limits, higher concurrency doesn’t effectively improve GPU utilization.
Using ZStack AIOS’s service evaluation feature, we tested DeepSeek-R1-671B-1.58b on MMLU, C-Eval, HumanEval, etc., comparing its quantized performance across dimensions and against distilled versions.
Due to excessively long runtimes, some evaluations were sampled. Results are summarized below:
*Data marked as tested in the ZStack experimental environment, not official paper data.
**With only 14K context, AIME24 testing couldn’t be completed normally.
From the data, 1.58 quantization impacts performance somewhat, but the effect is less severe than expected. It still holds significant advantages over GPT-4o and Claude-3.5 in English comprehension, Chinese comprehension, and coding ability.
Online tips for “identifying a full-power model” were tested, showing some distinguishing power:
Limited by llama.cpp’s design and 1.58 quantization’s data structure, further performance gains are challenging. For enterprises, while one or two 3090s are cost-effective, context window and concurrency capabilities are heavily constrained.
Thus, we explored another quantization method—AWQ. Theoretically, AWQ requires only 8 GPUs with 64GB+ VRAM. Due to resource constraints, we tested AWQ performance with H20 96GB GPUs.
Environment Setup
Performance Results
The figure below shows the performance of DeepSeek-R1-AWQ dialogues on ZStack AIOS:
Performance Analysis
1. AWQ quantization doesn’t support MLA or FP8 acceleration, limiting performance in multi-concurrency scenarios.
2. As concurrency rises, throughput can reach nearly 400 Token/s, but per-session throughput drops sharply, suiting only offline use.
3. Due to strict testing methods, KV Cache hit rates were low. With similar prompts to boost hit rates, throughput could reach up to 910 TPS.
Capability Comparison:
Similar to 1.58, we ran online “full-power version tests”:
Earlier, we saw that DeepSeek-R1-671B quantization has minimal impact on code generation. However, a HumanEval score of 90 may not be intuitive for readers. Following Unsloth’s testing approach, we tasked the model with creating the Flappy Bird game three times (pass@3). We scored it based on 10 criteria (e.g., random colors, random shapes, successful execution) and averaged the results. Temperature was set to 0.6.
Below is the Prompt used:
Create a Flappy Bird game in Python. You must include these things:You must use pygame.The background color should be randomly chosen and is a light shade. Start with a light blue color.Pressing SPACE multiple times will accelerate the bird.The bird’s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.Place on the bottom some land colored as dark brown or yellow chosen randomly.Make a score shown on the top right side. Increment if you pass pipes and don’t hit them.Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.
The results below reflect the best outcome from three runs, with the average score:
Even in longer code generation, AWQ quantization shows a slight decline, but the overall performance remains close to DeepSeek-R1.
_ |
Execution Result |
Score |
DeepSeek-R1-671B-1.58b Quantized |
|
78.5% |
DeepSeek-R1-671B-AWQ Quantized |
|
91.0% |
DeepSeek-R1-671B Unquantized |
|
92.5% |
Even in longer code generation, AWQ quantization shows a slight decline, but the overall performance remains close to DeepSeek-R1.
GPU model prices vary widely due to hardware specs and market supply. Here, we estimate based on popular online computing platforms like AutoDL:
Note that monthly rental and purchase costs aren’t always proportional. For example, an H20 8-GPU server’s monthly rent is six times that of a 3090 8-GPU server, but its purchase cost isn’t necessarily six times higher.
A single 3090 suits simple testing, not tasks like math or coding (which need longer contexts).
Multi-3090 setups are ideal for individuals or tiny-scale code generation, with 14K context sufficient for coding but not complex math reasoning.
While a single 3090 is low-cost, it lacks advantages in batch scenarios. Higher configurations offer better cost-effectiveness for larger context windows and parallelism.
Due to the massive compute demands of the model’s parameters, a single user struggles to exceed 30–50 TPS inference speed.
However, since DeepSeek is relatively new, inference engines may see optimization, potentially shifting future outcomes.
Through this exploration, we’ve gained deep insights into how DeepSeek-R1-671B reduces experience costs via 1.58 quantization and ultra-long context costs via AWQ quantization. We tested runtime performance and capabilities, noting that complex math reasoning is most prone to loss during quantization. Additionally, context size in deployment critically affects performance.
In future articles, we’ll explore:
Full-Precision Deployment Strategies: How to properly deploy models in high-performance computing environments to maximize large-model capabilities and hardware resources.
By comparing models of different scales and precisions, we aim to provide comprehensive, detailed deployment solutions for enterprise applications, helping more industries adopt large language model technology swiftly and realize business value.
Note: Some data in this article is illustrative. Actual conditions may vary, so detailed testing and validation are recommended during implementation.