ZStack Cloud Platform
Single Server, Free Trial for One Year
In the previous articles of this series over the past few days, we’ve deeply explored DeepSeek’s distillation techniques, quantization strategies, and the deployment essentials and performance evaluations of the 7B, 32B, and 671B quantized models. This has helped readers select suitable model deployment solutions under varying resource constraints.
As enterprises deepen their exploration of AI applications, the DeepSeek series’ 671B full-power model, with its exceptional reasoning capabilities for ultra-complex tasks, has become a key asset for boosting competitiveness. However, its massive parameter size means single-GPU or single-machine deployments cannot fully unleash its potential. Multi-machine, multi-GPU deployments combined with the ZStack AIOS platform are critical to unlocking its capabilities. This article will detail the practical process of deploying the 671B full-power model on the AIOS platform using multiple machines and GPUs, analyze its performance, and provide robust support and guidance for enterprises adopting AI technology.
For today’s large models, the GPU operation process can be simplified into the following steps:
In this process, two GPU hardware parameters are most critical:
For modern GPUs, the “bottleneck effect” of the latter often outweighs the former. Here’s a comparison of computational power and memory bandwidth for some common GPUs:
Take the RTX 4090 as an example: with FP8, it can process 82TB of data per second, but its memory bandwidth only allows loading 1TB per second. Thus, in large model inference, when concurrency is low, memory bandwidth is typically the bottleneck. Only when concurrency is sufficiently high does the bottleneck shift from “memory” to “compute power.” This explains why many 671B model tests show increased throughput with higher concurrency.
Theoretical Performance Estimation for the 671B Model
For DeepSeek V3 and R1, the total parameter count is 671B. Thanks to the MoE (Mixture of Experts) architecture, only 37B parameters are activated during runtime. With FP8 representation (1 byte per parameter), the data read per token is:
37B × 1 byte = 37 GB
Note: For FP16 representation, this doubles to 74 GB/token.
Assuming a GPU memory bandwidth of approximately 1979 GB/s, without parallel splitting on a single GPU, the computation time per byte is:
This corresponds to a throughput of about 53.5 tokens/s.
Note: This calculation represents a theoretical lower bound under “extreme” conditions. In practice, factors like overlapping computation, cache hits, KV-cache reads (which grow with sequence length), and various optimization techniques or display conditions may alter results.
While this estimate is rough and doesn’t account for tensor parallelism optimizations (where each GPU loads fewer activated parameters), the communication and synchronization overhead from tensor parallelism, along with reduced memory bandwidth utilization, align closely with our actual single-user inference tests for DeepSeek. Without aggressive optimization, single-user inference performance rarely exceeds 53.5 tokens/s.
For large model inference, optimization strategies fall into three categories:
Balancing Cost and Performance
The deployment scheme in the DeepSeek-V3 paper (using 352 H800 GPUs per unit on an H800 cluster) leverages high parallelism to maximize GPU performance, achieving very high throughput but at a steep cost. To achieve high throughput at a lower cost, we first tested performance with fewer GPUs:
Environment Setup
Performance Results
Without speculative decoding:
We also tested enabling MTP speculative decoding with additional optimizations
Key observations after enabling MTP speculative decoding and other optimizations:
Overall, MTP speculative decoding maintains good throughput while offering decent first-token response times in most scenarios. However, at very high concurrency, response times increase due to the computational overhead of speculative decoding, which may offset its benefits in large-scale parallel settings.
Since H200 GPUs are harder to obtain, we tested with two H20 96GB * 8 setups. After configuring network conditions, we observed performance with TP=16 across varying concurrency and network latencies.
Note: TP refers to Tensor Parallelism.
Environment SetupServer internal hardware topology diagram:
Deployment results on the ZStack AIOS platform:
Next, we tested performance using ZStack AIOS’s service evaluation tool:
TP16 Performance Results
To assess network latency’s impact on the TP16 deployment scheme, we artificially introduced delays using tc and compared throughput (TPS) under different network latencies:
Summarized in a chart:
Key findings:
From the table and chart, as network latency increases from 0.193ms to 2.193ms, TP16’s throughput drops from 18.943 tokens/s to 4.85 tokens/s—a maximum performance decline of 74%. This shows that rising network latency significantly reduces TP16 throughput.
Since this was a single-concurrency test, the impact of network latency on TP16 throughput is already evident. Thus, when designing and deploying TP16 solutions, minimizing network latency is critical to optimizing throughput and performance.
Although the above methods have significantly improved inference efficiency, more aggressive optimization strategies in large-scale cluster environments could potentially multiply performance further:
Through the above theoretical analysis and experiments, we’ve validated large model performance bottlenecks under varying concurrency levels. By leveraging DeepSeek’s unique MLA and MoE architecture advantages, combined with FP8 quantization and the MTP module, GPU hardware performance can be fully utilized. On the network side, flexible parallel strategies can be configured based on network conditions to optimize system throughput.
In the future, strategies like expert parallelism, data parallelism, redundant experts, communication optimization, and multi-microbatch overlap can further enhance system performance, providing a solid technical foundation for large-scale deployment.
This concludes a comprehensive analysis and enterprise deployment outlook based on current theory and DeepSeek model deployment practices. We hope this article offers reference and inspiration for engineers and enterprise decision-makers in large model deployment.
In the AI field, model iterations evolve rapidly, and the next disruptive model could emerge at any moment. Thus, enterprises must establish long-term model selection and evaluation mechanisms to stay ahead of technological trends. When choosing AI models, enterprises should select models with appropriate parameter sizes and hardware deployment schemes based on actual business needs, striking an optimal balance between inference performance and cost.
In future articles, we’ll explore:
Stay tuned to the ZStack public account! We’ll continue optimizing and focusing on DeepSeek model inference performance and cost-effectiveness solutions, offering comprehensive and detailed deployment strategies for enterprise applications. This will help more industries swiftly adopt large language model technology and realize business value.