10 Principles I've Learned in My Life

Posted on 2025-01-10 in Life • 154 words • 1 minute read

Tags: 2025, thoughts

10 principles I’ve learned in my life — continuously updated. No particular order, and not meant to teach — just sharing my experience.

Dynamo KVRouter

Posted on 2026-01-18 in Work • 785 words • 4 minute read

Tags: 2026, dynamo, ai, inference

Dynamo is a high-throughput, low-latency inference framework in a distributed setup. Recently, I’m working on optimizing the Dynamo system, copiloted by our own product, the Hive, an evoluationary AI agent. Router is one of the key components in dynamo, especially like the KV-aware router, which seems like a good starting point to work on after talking with the Dynamo team.

Recap 2025

Posted on 2025-12-30 in Life • 1096 words • 6 minute read

Tags: 2025, recap

Review the Goals of 2025

Looking back 2025, I set three goals (all about the work) at the beginning of the year, it turns out that the results are not that satisfying:

What I Hope to Do in the Long Term

Posted on 2025-11-21 in Work • 545 words • 3 minute read

Tags: 2025, thoughts

It was my birthday last week, too busy with work to write anything. But I think it’s a good time to imagine what I want to achieve in the next few years, or say, my long-term vision for 2025. Especially when AI is changing the world so fast, it’s important to have a clear vision of where I want to go.

How AI Looks Like in 2030?

Posted on 2025-10-22 in Work • 252 words • 2 minute read

Tags: 2025, ai, thoughts

9 bold conjectures about what AI will be like in 5 years and how it will change our lives:

We're Coming Out of Stealth Mode

Posted on 2025-09-24 in Work • 23 words • 1 minute read

Tags: 2025, ai, Cambridge

After months of hard work, we’re coming out of stealth today! Welcome to join us to build AI agents democratizing algorithms -> Hiverge.ai.

The Headroom of Optimization

Posted on 2025-08-14 in Work • 563 words • 3 minute read

Tags: 2025, optimization

Working on system optimization these days, joined a lot of meetings and talked a lot with the potential customers, also questioned a lot by them. Most of them, or even all of them, show great interest in the optimization solution, who can reject the idea of optimization?

New Chapter in 2025

Posted on 2025-06-29 in Work • 590 words • 3 minute read

Tags: 2025

I joined DaoCloud in July 7th, 2021, almost 4 years ago, it’s an amazing journey for me I have to say. I’ve learned a lot, experienced a lot, and grown a lot during this period of time. I am really grateful for all the support and trust DaoCloud has given me, or I can not be who I am today.

KubeCon HongKong - Build a Large Model Inference Platform for Heterogeneous Chips Based on vLLM (Keynote)

Posted on 2025-06-11 in Work • 138 words • 1 minute read

Tags: 2025, kubecon, inference, opensource, keynote, talk, hongkong

[Slides] [Project]

With the growing demand for heterogeneous computing power, Chinese users are gradually adopting domestic GPUs, especially for inference. vLLM, the most popular open-source inference project, has drawn widespread attention but does not support domestic chips.Chinese inference engines are still developing in functionality, performance, and ecosystem. In this session, we’ll introduce how to adapt vLLM to support domestic GPUs,enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill. We’ll also cover performance bottleneck analysis and chip operator development to maximize hardware potential. Additionally, Kubernetes has become the standard for container orchestration and is the preferred platform for inference services. We’ll show how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with a few lines of code, and explore how llmaz handles heterogeneous GPU scheduling and our practices for monitoring and elastic scaling.

KubeCon HongKong - New Pattern for Sailing Multi-Host LLM Inference

Posted on 2025-06-10 in Work • 140 words • 1 minute read

Tags: 2025, kubecon, inference, opensource, keynote, talk, hongkong

[Slides] [Project]

Inference workloads are becoming increasingly prevalent and vital in Cloud Native world. However, it’s not easy, one of the biggest challenges is large foundation model can not fit into a single node, like llama 3.1-405B or DeepSeek R1, which brings out the distributed inference with model parallelism, again, make serving inference workloads more complicated.

1 of 5 Next Page