Deep Understanding of DeepSeek and Enterprise Practices (Part 3): 671B Ultra-Low-Cost Deployment Methods and Performance Evaluation

2025-03-18 11:50

Preface

In the series of articles Deep Understanding of DeepSeek and Enterprise Practices (Part 1): Distillation, Deployment, and Evaluation and Deep Understanding of DeepSeek and Enterprise Practices (Part 2): Principles, Hardware Cooling, and Performance Testing of 32B Multi-GPU Inference, we introduced the relationships between different DeepSeek R1 models, core metrics, and completed the deployment and evaluation of several distilled models on ZStack AIOS. According to our tests, distilled versions often outperform the original models in areas like mathematics and coding. However, for more complex tasks (e.g., writing hundreds of lines of code), their performance may fall short. At this point, we need to consider using the DeepSeek-R1 671B model, commonly referred to online as the “full-power version.”

However, with 671B parameters, the model is massive. If the hardware doesn’t support FP8, the model weights alone require 1.3TB, making costs prohibitively high. Thus, this article focuses on deploying the near-trillion-parameter DeepSeek-R1 671B model at the lowest cost, assessing the real-world performance of quantized versions, evaluating the losses compared to unquantized versions, and analyzing the cost-effectiveness and suitable scenarios for different hardware configurations.

1. DeepSeek-R1 671B Quantized Versions Analysis

Currently, there are many quantization schemes for DeepSeek R1 671B. We won’t delve into the specific meanings of methods like IQ_1_S or AWQ here. Instead, we directly compare several typical schemes and their VRAM requirements below:

(Note: VRAM requirements include the minimum KV cache and system overhead and represent the lowest needs. Actual requirements depend on context window size, KV cache precision, etc., for accurate estimation. GGUF and safetensor formats differ in VRAM usage due to inference engines and parallelism methods, so they cannot be directly compared.)

2. Extreme Compression Solution: DeepSeek-R1 671B 1.58-bit Deployment Practice

From the table above, it’s clear that a single 3090 8-GPU server perfectly meets the minimum requirements for 671B-1.58b! Additionally, all weights can be loaded onto the GPU, ensuring decent inference speed. However, note that since the format is GGUF, the llama.cpp inference framework is required. In ZStack AIOS, multiple inference frameworks are supported, allowing users to choose based on their needs.

Environment Setup

Deployment Process

1. Environment Preparation: Install ZStack AIOS and ensure the system meets operational requirements.

2. One-Click Deployment:

aUse ZStack AIOS to select the model and link it to an appropriate inference template (llama.cpp) and image.
b. Specify the GPU and compute specifications for running the model, then deploy.

3. Test Run: Try a dialogue experience in the interactive window or integrate via API into other applications.

Performance Evaluation

We successfully ran 671B-1.58b! Unfortunately, we were limited to a 4K context due to llama.cpp’s layered GPU loading approach, where each layer requires the full context size to operate. Moreover, due to llama.cpp’s inference mechanics, increasing concurrency doesn’t boost total throughput and even reduces per-session context size. Thus, we didn’t pursue higher concurrency.

However, a 4K context is far too small for DeepSeek-R1. Responses are easily truncated, making it impossible to complete standard evaluations like MMLU or C-EVAL. To increase context size, we tested multi-machine deployment of DeepSeek-R1-671B-1.58b.

3. Multi-Machine Scaling Solution: 3090 Cluster Deployment of 671B-1.58b Evaluation

Since the context on a single 3090 was too limited, we expanded the 671B-1.58b model’s context via a cluster. By distributing weights across 16 cards, we theoretically gained nearly 14GB of space for KV Cache, equating to about 14K context space. (Note: Multi-machine parallel inference with llama.cpp isn’t recommended for production; this is for testing only.)

Environment Setup

Performance Results

Bottleneck Analysis

Long-context scenarios show a slight increase in model throughput.

Due to llama.cpp’s architecture and the 3090’s bandwidth limits, higher concurrency doesn’t effectively improve GPU utilization.

Model Capability Evaluation

Using ZStack AIOS’s service evaluation feature, we tested DeepSeek-R1-671B-1.58b on MMLU, C-Eval, HumanEval, etc., comparing its quantized performance across dimensions and against distilled versions.

Due to excessively long runtimes, some evaluations were sampled. Results are summarized below:

*Data marked as tested in the ZStack experimental environment, not official paper data.
**With only 14K context, AIME24 testing couldn’t be completed normally.

From the data, 1.58 quantization impacts performance somewhat, but the effect is less severe than expected. It still holds significant advantages over GPT-4o and Claude-3.5 in English comprehension, Chinese comprehension, and coding ability.

Online tips for “identifying a full-power model” were tested, showing some distinguishing power:

4. High-Performance Solution: H20×8 + AWQ Quantized Deployment

Limited by llama.cpp’s design and 1.58 quantization’s data structure, further performance gains are challenging. For enterprises, while one or two 3090s are cost-effective, context window and concurrency capabilities are heavily constrained.

Thus, we explored another quantization method—AWQ. Theoretically, AWQ requires only 8 GPUs with 64GB+ VRAM. Due to resource constraints, we tested AWQ performance with H20 96GB GPUs.

Environment Setup

Performance Results

The figure below shows the performance of DeepSeek-R1-AWQ dialogues on ZStack AIOS:

Performance Analysis

1. AWQ quantization doesn’t support MLA or FP8 acceleration, limiting performance in multi-concurrency scenarios.

2. As concurrency rises, throughput can reach nearly 400 Token/s, but per-session throughput drops sharply, suiting only offline use.

3. Due to strict testing methods, KV Cache hit rates were low. With similar prompts to boost hit rates, throughput could reach up to 910 TPS.

Capability Comparison:

Similar to 1.58, we ran online “full-power version tests”:

5. Real-World Case Study: Flappy Bird Code Generation

Earlier, we saw that DeepSeek-R1-671B quantization has minimal impact on code generation. However, a HumanEval score of 90 may not be intuitive for readers. Following Unsloth’s testing approach, we tasked the model with creating the Flappy Bird game three times (pass@3). We scored it based on 10 criteria (e.g., random colors, random shapes, successful execution) and averaged the results. Temperature was set to 0.6.

Below is the Prompt used:

Create a Flappy Bird game in Python. You must include these things:You must use pygame.The background color should be randomly chosen and is a light shade. Start with a light blue color.Pressing SPACE multiple times will accelerate the bird.The bird’s shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.Place on the bottom some land colored as dark brown or yellow chosen randomly.Make a score shown on the top right side. Increment if you pass pipes and don’t hit them.Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

The results below reflect the best outcome from three runs, with the average score:
Even in longer code generation, AWQ quantization shows a slight decline, but the overall performance remains close to DeepSeek-R1.

_	Execution Result	Score
DeepSeek-R1-671B-1.58b Quantized		78.5%
DeepSeek-R1-671B-AWQ Quantized		91.0%
DeepSeek-R1-671B Unquantized		92.5%

Even in longer code generation, AWQ quantization shows a slight decline, but the overall performance remains close to DeepSeek-R1.

6. Cost-Benefit Analysis and Conclusion

GPU model prices vary widely due to hardware specs and market supply. Here, we estimate based on popular online computing platforms like AutoDL:

Note that monthly rental and purchase costs aren’t always proportional. For example, an H20 8-GPU server’s monthly rent is six times that of a 3090 8-GPU server, but its purchase cost isn’t necessarily six times higher.

Conclusion:

A single 3090 suits simple testing, not tasks like math or coding (which need longer contexts).

Multi-3090 setups are ideal for individuals or tiny-scale code generation, with 14K context sufficient for coding but not complex math reasoning.

While a single 3090 is low-cost, it lacks advantages in batch scenarios. Higher configurations offer better cost-effectiveness for larger context windows and parallelism.

Due to the massive compute demands of the model’s parameters, a single user struggles to exceed 30–50 TPS inference speed.

However, since DeepSeek is relatively new, inference engines may see optimization, potentially shifting future outcomes.

7. Outlook: Deployment Strategies for Larger Parameter Models

Through this exploration, we’ve gained deep insights into how DeepSeek-R1-671B reduces experience costs via 1.58 quantization and ultra-long context costs via AWQ quantization. We tested runtime performance and capabilities, noting that complex math reasoning is most prone to loss during quantization. Additionally, context size in deployment critically affects performance.

In future articles, we’ll explore:

Full-Precision Deployment Strategies: How to properly deploy models in high-performance computing environments to maximize large-model capabilities and hardware resources.

By comparing models of different scales and precisions, we aim to provide comprehensive, detailed deployment solutions for enterprise applications, helping more industries adopt large language model technology swiftly and realize business value.

Note: Some data in this article is illustrative. Actual conditions may vary, so detailed testing and validation are recommended during implementation.

AI DeepSeek

Private Cloud Platform

ZStack Cloud Platform

ZStack ZSphere Virtualization Platform

ZStack HCI

ZStack Software-Defined Storage

Data Center Management

Edge Orchestration

Cloud-Native Platform

Database Management

Private AI

Advanced Infrastructure Platform

ZStack Cloud Platform

ZStack ZSphere Virtualization Platform

By Scenario

By Industry

Documentation&Tools

Support & Services

Training & Certification

Content

VMware-to-ZStack Case Collection

Blog

Deep Understanding of DeepSeek and Enterprise Practices (Part 3): 671B Ultra-Low-Cost Deployment Methods and Performance Evaluation

Table of Contents

Preface

1. DeepSeek-R1 671B Quantized Versions Analysis

2. Extreme Compression Solution: DeepSeek-R1 671B 1.58-bit Deployment Practice

Environment Setup

Deployment Process

Performance Evaluation

3. Multi-Machine Scaling Solution: 3090 Cluster Deployment of 671B-1.58b Evaluation

Environment Setup

Performance Results

Bottleneck Analysis

Model Capability Evaluation

4. High-Performance Solution: H20×8 + AWQ Quantized Deployment

5. Real-World Case Study: Flappy Bird Code Generation

6. Cost-Benefit Analysis and Conclusion

Conclusion:

7. Outlook: Deployment Strategies for Larger Parameter Models

Label

Popular Blogs