Hi, I'm Kante Yin 👋

R&D Engineer • OpenSource Enthusiast • Cat Owner • Sports Fan ⚽️ 🏀 🥊


KubeCon Europe London

Session: Sailing Multi-Host Inference with LWS

[Slides]

Inference workloads are becoming increasingly prevalent and vital in Cloud Native world. However, it’s not easy, one of the biggest challenges is large foundation model can not fit into a single node, which brings out the distributed inference with model parallelism, again, make serving inference workloads more complicated.

Read more...

KServe, AIBrix, and llmaz

As a follower and active contributor for inference platform, I created the llmaz project to provide an unified inference platform for LLMs and also joined the AIBrix community to build the next-gen GenAI infrastructure.

Read more...

Does DeepSeek Break CUDA Moat?

Due to DeepSeek-V3 technical report, it says:

In addition, both dispatching and combining kernels overlap with the computation stream,
so we also consider their impact on other SM computation kernels.
Specifically, we employ customized PTX (Parallel Thread Execution) instructions and
auto-tune the communication chunk size, which significantly reduces the use of the
L2 cache and the interference to other SMs.

then people are saying like DeepSeek is breaking the Nvidia core moat - CUDA by employing the PTX directly, but is that true?

Read more...
Previous Page 2 of 5 Next Page