Does DeepSeek Break CUDA Moat?
Due to DeepSeek-V3 technical report, it says:
In addition, both dispatching and combining kernels overlap with the computation stream,
so we also consider their impact on other SM computation kernels.
Specifically, we employ customized PTX (Parallel Thread Execution) instructions and
auto-tune the communication chunk size, which significantly reduces the use of the
L2 cache and the interference to other SMs.
then people are saying like DeepSeek is breaking the Nvidia core moat - CUDA by employing the PTX directly, but is that true?
Read more...