Before R2 came, DeepSeek put another smoke bomb

Around May, DeepSeek's actions were frequent, but none of them were the R2 that everyone expected. However, a series of appetizers have laid the foundation for R2.

On May 14, a DeepSeek V3 paper revealed how Liang Wenfeng achieved "extreme cost reduction". This paper also gives the industry a glimpse into the time when this company, which is based on technology, has cultivated its "internal skills".

Unlike the V3 technical report released earlier, this paper elaborates in detail how DeepSeek can achieve the ultimate cost-effectiveness calculation under the "tight curse" of hardware resources through exquisite "integrated software and hardware" collaborative design. (Huxiu Note: DeepSeek-V3 only uses 2,048 Nvidia H800 GPUs)

In the money-burning track of AI big models, computing power is power, but it may also be the last straw that broke the camel's back. The core of the DeepSeek V3 paper points out an industry pain point: how to make big models no longer exclusive games for a few giants?

In the paper, DeepSeek shared several secrets of "reducing costs and increasing efficiency". These technical details reveal the ultimate squeeze on the existing hardware potential between the lines, and also indicates the performance and efficiency ambitions of the DeepSeek series models in the future:

First, it is to lose weight for the model's "memory system". When AI processes long text and multiple rounds of conversations, it is necessary to remember a huge amount of context information. This part of "memory" (i.e. KV Cache) consumes video memory extremely amazing. DeepSeek V3 has launched the "Multi-Head Hidden Attention Mechanism" (MLA), which is like installing a high-efficiency compression software to the model's memory, which can condense lengthy information into essence and greatly reduce the use of video memory. This means that even when dealing with longer and longer contexts, the model can be more relaxed, which is crucial to unlocking more complex application scenarios.

The second is to build a team of "experts perform their duties". Faced with the astronomical and digital parameters of the big model, the traditional "one person lives everything" model is inevitably unable to do so. DeepSeek V3 continues and optimizes the "Hybrid Expert Model" (MoE) architecture. Imagine that a huge project is broken down into a group of experts with their own strengths. When encountering specific problems, the system will automatically "shake people" and only activate the most relevant experts to work together. In this way, not only will the computing efficiency be improved, but the "effective scale" of the model can also be controlled, avoiding unnecessary waste of resources. DeepSeek's optimization makes communication between these "experts" smoother and collaboration costs lower.

Third, it is to boldly embrace the digital accuracy of "just as long as you want". In AI training, the higher the accuracy of numbers, the better. FP8, a low-precision digital format, is like our daily life, sometimes it is enough to be accurate to the "angle" or even the "number", and there is no need to be "minus". DeepSeek V3 introduces FP8 mixed accuracy in training, and "roughly" calculations are made in the less sensitive to accuracy. The direct effect is that the calculation amount and memory usage are significantly reduced, and the training speed is faster and power-saving. The key is that the paper proves that this "lazy" does not obviously sacrifice the final performance of the model.

Fourth, it is to lay a smoother "information expressway". Large-scale training is inseparable from the collaborative combat of GPU clusters. The efficiency of data exchange between GPUs directly determines the overall speed of training. DeepSeek V3 adopts a "multi-plane network topology" to optimize the data transmission path within the cluster, reduce congestion and bottlenecks, and ensure that information flow is as smooth as silk.

It can be said that the latest paper of DeepSeek V3 is more like a demonstration of technical confidence. It sends a clear signal: even without the top-level hardware configuration, through ultimate engineering optimization and algorithm innovation, a large model with industry-leading potential can still be created. This undoubtedly provides a more feasible climbing path for those struggling with computing anxiety.

In fact, on April 30, DeepSeek also served another "appetizer".

At that time, the DeepSeek Prover V2 was unveiled with 671B parameters. You should know that the last time DeepSeek released the V2 model was only 7B. Regarding the DeepSeek Prover V2, industry observers generally believe that this is an important step for AI to take in the field of mathematics that assist scientific discoveries, especially challenging the limits of human intelligence.

DeepSeek's recent series of actions are placed in the macro context of the current AI model industry, which is more intriguing.

On the one hand, what we see is the all-round "inversion" of leading manufacturers in model parameters, multimodal capabilities, and application ecology. The speed of technology iteration is dazzling, and the enthusiasm of capital continues to rise. But on the other hand, the continuous rise in computing power costs, the unclear path to commercialization, and how to achieve true value creation after the emergence of "intelligence", are all the swords of Damocles hanging above all practitioners.

Against this background, the "cost-effectiveness" and "software and hardware collaboration" emphasized by DeepSeek V3 paper, as well as the deep cultivation of specific high-barrier fields represented by Prover V2, seem to convey a different signal: while pursuing greater and stronger, the ultimate pursuit of efficiency and the deep exploration of specific value scenarios may become key variables in AI competition in the second half.

When the marginal effect of "powerful miracles" begins to diminish, and when the market begins to shift from the fanaticism of the technology itself to the consideration of practical application value, those players who can use existing resources more intelligently, enter real needs more accurately, and understand and solve complex problems more deeply may be able to laugh to the end after the noise.

These "appetizers" from DeepSeek undoubtedly aroused the market's appetite. What people are looking forward to is not only a R2 model with stronger performance, but also a DeepSeek that can bring new ideas and new variables to the industry. At the AI ​​card table, this company with technology as its background obviously wants to play more unexpected cards.

"Special statement: The content of the above works (including videos, pictures or audio) is uploaded and published by users of the "Dafenghao" self-media platform under Phoenix.com. This platform only provides information storage space services.

Notice: The content above (including the videos, pictures and audios if any) is uploaded and posted by the user of Dafeng Hao, which is a social media platform and merely provide information storage space services."

[Editor in charge: Zhang Gege PT105]

Comment

Dedicated to interviewing and publishing global news events.