What I Hope to Do in the Long Term
It was my birthday last week, too busy with work to write anything. But I think it’s a good time to imagine what I want to achieve in the next few years, or say, my long-term vision for 2025. Especially when AI is changing the world so fast, it’s important to have a clear vision of where I want to go.
If we take a look of the AI development in the past few months or even years, either the giant companies or startups are building four things, AI terminals, AI browsers, AI agents, and AI IDEs. AI IDE is cool and I’ve been using IDEs like vs code for years (before I’m a vim user, the reason why I want to use vs code is because Kubernetes code base is huge, my vim is slow to react and I don’t want to spend more time on setting up vim plugins). However, it’s not something I want to focus on just because personal preference, I want to build something more interesting and more challenging, a system that combines the other three things together. I named it the AI-driven operating system.
I use terminal everyday, I like the way to interact with the computer via typing commands, it makes me feel in control. However, today’s so called AI terminals change the way I’ve been used to for years, they’re actually terminal-in-terminal, no longer terminal-native. For example, if you use the Gemini-CLI, some shortcuts are not working as expected, the output is not compact enough, everything is just not as smooth as running in a native terminal. I hope to change this.
Another drawback of terminal is it’s text-only, today’s generative AI models are multi-modal, they can understand and generate images, videos, audios, and more. That’s why AI browsers are emerging, but again, browsers have their own limitations, they’re not interactive. We saw some attempts to make browsers more interactive, like vimium, but they’re still browsers, they can not break the fundamental limitation of web pages.
That’s why I hope to build a AI-driven operating system that is interactive, multi-modal and agentic. It should be terminal-native, it should support multi-modal inputs and outputs, and it should be able to act as an agent to help users complete their tasks, not just passively respond to user commands.
To build such a product, several components are necessary: an agentic system for orchestration, a LLM inference engine for serving, a long-term memory system for vectoring and compressing data, an interactive GUI system for inputs and outputs, and a NotebookLLM-like system for understanding the data.
Then, once you launch the AIOS, all your data like files, voices, images, videos will be vectored and stored in the memory system, you can interact with the system via typing commands, speaking, or even drawing. The AIOS will understand your intentions, retrieve relevant data from the memory system, and use the LLM inference engine to generate responses. You can also extend the AIOS via plugins, just like how you extend your terminal via shell scripts. It’s also privacy-preserving because all your data and models are stored and run locally, no need to worry about data leakage. It’s cool, isn’t it?
Good luck and happy birthday to me then. Let’s see how far I can go!