Deep Understanding of DeepSeek and Enterprise Practices (Part 4): 671B Full-Power Deployment and Performance Optimization Strategies

2025-03-20 11:30

Table of Contents

Preface

In the previous articles of this series over the past few days, we’ve deeply explored DeepSeek’s distillation techniques, quantization strategies, and the deployment essentials and performance evaluations of the 7B, 32B, and 671B quantized models. This has helped readers select suitable model deployment solutions under varying resource constraints.

  • Deep Understanding of DeepSeek and Enterprise Practices (Part 1): Distillation, Deployment, and Evaluation
  • Deep Understanding of DeepSeek and Enterprise Practices (Part 2): Principles, Hardware Cooling, and Performance Testing of 32B Multi-GPU Inference
  • Deep Understanding of DeepSeek and Enterprise Practices (Part 3): 671B Ultra-Low-Cost Deployment Methods and Performance Evaluation

As enterprises deepen their exploration of AI applications, the DeepSeek series’ 671B full-power model, with its exceptional reasoning capabilities for ultra-complex tasks, has become a key asset for boosting competitiveness. However, its massive parameter size means single-GPU or single-machine deployments cannot fully unleash its potential. Multi-machine, multi-GPU deployments combined with the ZStack AIOS platform are critical to unlocking its capabilities. This article will detail the practical process of deploying the 671B full-power model on the AIOS platform using multiple machines and GPUs, analyze its performance, and provide robust support and guidance for enterprises adopting AI technology.

1. Theoretical Analysis of DeepSeek Model Inference Performance

For today’s large models, the GPU operation process can be simplified into the following steps:

  1. Convert input text (e.g., Chinese characters or words) into numbers (vectors and positional encodings) that the model can understand.
  2. Perform computations based on the model’s parameters. For example, with Qwen2.5-72B, this involves loading 145GB of data into the compute unit.
  3. Generate responses, essentially producing candidate words and their probability distributions.

In this process, two GPU hardware parameters are most critical:

  1. Matrix multiplication performance, commonly referred to as GPU TFlops.
  2. GPU memory bandwidth, as model parameters must be read from memory. This depends on whether GDDR or HBM memory is used.

For modern GPUs, the “bottleneck effect” of the latter often outweighs the former. Here’s a comparison of computational power and memory bandwidth for some common GPUs:

Take the RTX 4090 as an example: with FP8, it can process 82TB of data per second, but its memory bandwidth only allows loading 1TB per second. Thus, in large model inference, when concurrency is low, memory bandwidth is typically the bottleneck. Only when concurrency is sufficiently high does the bottleneck shift from “memory” to “compute power.” This explains why many 671B model tests show increased throughput with higher concurrency.

3 batch size

Theoretical Performance Estimation for the 671B Model

For DeepSeek V3 and R1, the total parameter count is 671B. Thanks to the MoE (Mixture of Experts) architecture, only 37B parameters are activated during runtime. With FP8 representation (1 byte per parameter), the data read per token is:

37B × 1 byte = 37 GB

Note: For FP16 representation, this doubles to 74 GB/token.

Assuming a GPU memory bandwidth of approximately 1979 GB/s, without parallel splitting on a single GPU, the computation time per byte is:

4 computation time per byte

This corresponds to a throughput of about 53.5 tokens/s.

Note: This calculation represents a theoretical lower bound under “extreme” conditions. In practice, factors like overlapping computation, cache hits, KV-cache reads (which grow with sequence length), and various optimization techniques or display conditions may alter results.

While this estimate is rough and doesn’t account for tensor parallelism optimizations (where each GPU loads fewer activated parameters), the communication and synchronization overhead from tensor parallelism, along with reduced memory bandwidth utilization, align closely with our actual single-user inference tests for DeepSeek. Without aggressive optimization, single-user inference performance rarely exceeds 53.5 tokens/s.

2. Optimization Strategies for DeepSeek Model Inference Performance

For large model inference, optimization strategies fall into three categories:

  1. Data-Level Optimization: For example, compressing prompts or reducing unnecessary tokens. However, our current bottleneck isn’t in prompt decoding, and our goal is TPS (tokens per second) rather than QPS (queries per second), so this isn’t a priority now.
  2. Model-Level Optimization: DeepSeek implements MLA, MoE, and FP8 training. Here’s a brief overview:
    MLA Architecture: Compared to traditional MHA, MLA maintains strong expressive power while significantly reducing KV-cache size, lowering memory bandwidth and VRAM demands.
    b. MoE Architecture: By splitting a dense model into multiple specialized experts and activating only a subset (DeepSeek-V3 uses 8 routed experts + 1 shared expert), each token requires only 37B weights instead of 671B, greatly reducing computation and memory access costs.
    5 MoE Architecture
    c. Low-Precision FP8 Training and Quantization: Using FP8 weights halves data read/write volume, while quantizing KV-cache (e.g., DeepSeek-V2 compresses it to an average of 6 bits) reduces memory usage significantly without sacrificing precision.
  3. System-Level Optimization: Includes increasing parallelism, speculative decoding, and computation enhancements. While most are universal, MTP for speculative decoding is a DeepSeek-specific optimization:
    MTP Module: Primarily used in training to boost prediction, during inference it improves decoding efficiency via speculative sampling. Official data shows 85%–90% accuracy for extra predicted tokens, yielding about a 1.8x TPS boost.

6 MTP Module

3. Enterprise-Level Deployment and Practice

Balancing Cost and Performance

The deployment scheme in the DeepSeek-V3 paper (using 352 H800 GPUs per unit on an H800 cluster) leverages high parallelism to maximize GPU performance, achieving very high throughput but at a steep cost. To achieve high throughput at a lower cost, we first tested performance with fewer GPUs:

  1. Single H200 8-GPU Scenario

Environment Setup

Performance Results

Without speculative decoding:
We also tested enabling MTP speculative decoding with additional optimizations

Key observations after enabling MTP speculative decoding and other optimizations:

  • Throughput vs. First-Token Latency: At low concurrency (1-32), optimizations increase throughput while maintaining or reducing first-token latency—a win-win.
  • High-Concurrency Tradeoffs: At 128 concurrency, both first-token latency and throughput underperform compared to pre-optimization data.

Overall, MTP speculative decoding maintains good throughput while offering decent first-token response times in most scenarios. However, at very high concurrency, response times increase due to the computational overhead of speculative decoding, which may offset its benefits in large-scale parallel settings.

  1. Dual H20 96GB 16-GPU Scenario

Since H200 GPUs are harder to obtain, we tested with two H20 96GB * 8 setups. After configuring network conditions, we observed performance with TP=16 across varying concurrency and network latencies.

Note: TP refers to Tensor Parallelism.

Environment SetupServer internal hardware topology diagram:

12 Server internal hardware topology diagram

Deployment results on the ZStack AIOS platform:

Next, we tested performance using ZStack AIOS’s service evaluation tool:

TP16 Performance Results

To assess network latency’s impact on the TP16 deployment scheme, we artificially introduced delays using tc and compared throughput (TPS) under different network latencies:
Summarized in a chart:

Key findings:
From the table and chart, as network latency increases from 0.193ms to 2.193ms, TP16’s throughput drops from 18.943 tokens/s to 4.85 tokens/s—a maximum performance decline of 74%. This shows that rising network latency significantly reduces TP16 throughput.

Since this was a single-concurrency test, the impact of network latency on TP16 throughput is already evident. Thus, when designing and deploying TP16 solutions, minimizing network latency is critical to optimizing throughput and performance.

4. Further Optimization Strategies for Production Applications

Although the above methods have significantly improved inference efficiency, more aggressive optimization strategies in large-scale cluster environments could potentially multiply performance further:

  • Hybrid Parallelism with DP+EP, TP+EP:
    Principle: DP boosts overall inference speed for large batch inputs via parallel computation without overburdening individual devices. EP leverages MoE’s partial expert activation to reduce resource use and increase speed. Combining them enhances large model inference performance.
    b. Case Study: Just yesterday, DeepSeek open-sourced DeepEP, a communication library tailored for Mixture of Experts (MoE) and Expert Parallelism (EP). It offers load balancing and communication strategies, addressing load imbalance and high communication overhead in traditional DP+EP setups, achieving higher computational efficiency and scalability in large-scale MoE model training. It also supports low-precision operations, including FP8.
  • Optimizing Redundant Expert Strategies: Beyond dynamically adjusting the number of redundant experts per GPU, future strategies could include smarter global routing to further balance load across cards. Current strategies, like DeepSeek’s replication of high-load experts and periodic adjustments every 10 minutes during the prefill phase, achieve some load balancing. However, as cluster scale and application complexity grow, intelligent global routing could adapt better in real-time, optimizing load distribution.
  • Deepening Communication and PD Separation: Communication optimization for intra-node NVLink and cross-node IB can leverage hardware-level accelerators or network co-processors to further reduce latency. In large clusters with massive inter-node communication (e.g., during the Decode phase), techniques like IB point-to-point transmission and IBGDA lower latency. Yet, with rising inference demands, hardware-level optimizations can fundamentally boost communication efficiency, alleviating network congestion and ensuring fast data transfer to meet stringent low-latency requirements.19 Deepening Communication and PD Separation
  • Expanding Multi-Microbatch Overlap: Processing two microbatches simultaneously can better hide idle time during forward and backward communication, approaching theoretical throughput limits. This strategy shines in large-cluster inference. For instance, DeepSeek uses two equally sized microbatches in the Prefill phase, overlapping one microbatch’s Attention and MoE computation with another’s Dispatch and Combine operations, boosting throughput. In the Decode phase, similar approaches are being explored, overlapping one microbatch’s attention computation with another’s Dispatch + MoE + Combine operations. Further expanding this could unlock even greater performance potential.

5. Conclusion

Through the above theoretical analysis and experiments, we’ve validated large model performance bottlenecks under varying concurrency levels. By leveraging DeepSeek’s unique MLA and MoE architecture advantages, combined with FP8 quantization and the MTP module, GPU hardware performance can be fully utilized. On the network side, flexible parallel strategies can be configured based on network conditions to optimize system throughput.

In the future, strategies like expert parallelism, data parallelism, redundant experts, communication optimization, and multi-microbatch overlap can further enhance system performance, providing a solid technical foundation for large-scale deployment.

This concludes a comprehensive analysis and enterprise deployment outlook based on current theory and DeepSeek model deployment practices. We hope this article offers reference and inspiration for engineers and enterprise decision-makers in large model deployment.

6. Outlook

In the AI field, model iterations evolve rapidly, and the next disruptive model could emerge at any moment. Thus, enterprises must establish long-term model selection and evaluation mechanisms to stay ahead of technological trends. When choosing AI models, enterprises should select models with appropriate parameter sizes and hardware deployment schemes based on actual business needs, striking an optimal balance between inference performance and cost.

In future articles, we’ll explore:

  • Domestic GPU Deployment Strategies: How to run DeepSeek models on domestic GPUs, along with their inference performance and efficiency.

Stay tuned to the ZStack public account! We’ll continue optimizing and focusing on DeepSeek model inference performance and cost-effectiveness solutions, offering comprehensive and detailed deployment strategies for enterprise applications. This will help more industries swiftly adopt large language model technology and realize business value.

 

//