Stop building AI platform | Going towards data science

0 0 5 minutes read

Stop building AI platform | Going towards data science

Medium-sized companies have been successful in building data and ML platforms, and building an AI platform is now very challenging. This post discusses three key reasons why you should be careful about building an AI platform and presents my thoughts on promising directions.

Disclaimer: It is based on a personal opinion and is not applicable to cloud providers and data/ML SaaS companies. Instead, they should double their research on AI platforms.

Where am I from

In my previous post From data platform to ML platform In moving towards data science, I share how the data platform evolved into an ML platform. This trip is for most small and medium-sized companies. However, small and medium-sized companies do not have a clear way to continue developing their platforms to AI platforms However. Upgrade to the AI platform, the path forked in two directions:

AI infrastructure: “New power” (AI inference) is more effective when centralized generation. This is a game for large technicians and large model providers.
AI application platform: It is impossible to build a “waterfront house” (AI platform) on the ever-changing ground. Evolving AI capabilities and new development paradigms make it challenging to find lasting standardization.

But even if AI models continue to evolve, there are still some directions that may still be important. It is covered at the end of this post.

High barriers to AI infrastructure

While Databricks may be several times more efficient than your own Spark Jobs, DeepSeek may be 100 times more efficient than LLM inferences. Training and serving LLM models requires more investment in infrastructure, and it is important to control the structure of the LLM model.

Images generated by Openai Chatgpt 4O

In this series, I briefly share the infrastructure of LLM training, which includes parallel training strategies, topological design, and training acceleration. In terms of hardware, most of the costs are spent on network setup and high-performance storage services, besides high-performance GPUs and TPUs. A cluster requires an additional RDMA network to enable non-obstruction, point-to-point connections for data exchange between instances. Orchestration services must support complex work plans, failover policies, hardware problem detection, and GPU resource abstraction and summary. Training the SDK requires facilitating asynchronous checkpointing, data processing, and model quantization.

Regarding model services, model providers often incorporate inference efficiency in the model development phase. Model providers may have a better model quantization strategy that will produce the same model quality and have a significantly smaller model size. Due to their control over the model structure, model providers may develop better model parallelism strategies. It can increase the batch size during LLM inference, thus effectively increasing GPU utilization. Additionally, large LLM players have logistical advantages that enable them to access cheap routers, mainframes and GPU chips. More importantly, more powerful model structure control and better model parallel capabilities Average model providers can take advantage of cheap GPU devices. GPU depreciation may be a bigger problem for model consumers who rely on open source models.

Take DeepSeek R1 as an example. Assume you are using a P5E.48XLARGE AWS instance that can be connected to an NVLink 8 H200 chip. It will cost you $35 per hour. Suppose you are doing NVIDIA and get 151 tokens/second performance. To generate 1 million output tokens, it will cost you $64 (1 million/(151 * 3600) * $35). DeepSeek sells its tokens for every million dollars? Only $2! DeepSeek can achieve 60 times the efficiency of cloud deployment (assuming a profit margin from DeepSeek is 50%).

Therefore, LLM reasoning capabilities are indeed like electricity. It reflects the diversity of applications that LLMS can provide; it also means that it is most efficient when centrally generated. However, you should still provide a self-service lawyer LLM service for privacy-sensitive use cases, just like a hospital has an emergency generator.

Continuously transfer the ground

Investing in AI infrastructure is a bold game, and building lightweight platforms for AI applications comes with its hidden pitfalls. With the rapid evolution of AI model functions, AI applications have no alignment paradigm; therefore, there is a lack of a solid foundation for establishing AI applications.

The simple answer is: be patient.

If we have an overall view of the data and ML platform, development paradigms only appear when the algorithm’s functions converge.

domain	Algorithm appears	The solution appears	Large platforms appear
Data platform	2004 – MapReduce (Google)	2010-2015 – Spark, Flink, Presto, Kafka	2020-Now – Data fishing, snowflakes
ML Platform	2012 – Imagenet (Alexnet, CNN breakthrough)	2015–2017 – Tensorflow, Pytorch, Scikit-Learn	2018-Now – sagemaker, mlflow, kubeflow, databricks ml
AI Platform	2017 – Transformers (all you need to pay attention)	2020–2022 – Chatgpt, Claude, Gemini, DeepSeek	2023-Now – ??

After several years of fierce competition, some large model players are still standing in the arena. However, the evolution of AI capabilities has not yet been integrated. As AI model capabilities develop, existing development paradigms will soon become obsolete. Large players have just begun stabbing on agent development platforms, and new solutions pop up like popcorn in the oven. I believe the winner will eventually appear. Currently, building agency standardization itself is a tricky call for small and medium-sized companies.

Old success path dependency

Another challenge in building an AI platform is quite subtle. It’s about reflecting the mindset of the platform builder, whether there is a path dependency on previous success in building data and ML platforms.

As we previously shared, data and ML development paradigms have been well aligned since 2017, and the most critical task of the ML platform is standardization and abstraction. However, no development paradigm for AI applications has been established. If teams follow previous success stories of building data and ML platforms, they may end up prioritizing standardization at the wrong time. Possible directions are:

Building an AI model gateway: Provide centralized audit and record requests to LLM models.
Building an AI proxy framework: Develop a self-built SDK to create an enhanced AI proxy with connections to the internal ecosystem.
Standardized rag practice: Building standard data index flows to lower the standards for engineers to build knowledge services.

These moves are indeed important. But ROI does depend on the size of your company. In any case, you will face the following challenges:

Keep up with the latest AI development.
Customer adoption rates can easily bypass your abstraction.

Assuming that the builders of data and ML platforms are like “closet organizers”, AI builders should now be like “fashion designers”. It requires embracing new ideas, conducting quick experiments, and even accepting a certain degree of imperfection.

My thoughts on a hopeful direction

Even with many challenges, please remind that working on an AI platform is still satisfying now, as you have a lot of leverage you didn’t have before:

AI’s transformation capabilities are more important than data and machine learning capabilities.

The motivation to adopt AI is more effective than ever.

If you choose the right direction and strategy, the transformation you can bring to your organization is important. Here are some of my thoughts on directions that may cause less disruption as the AI model expands further. I think they are equally important on AI platforms:

High-quality, rich voice data products: Data products with high precision and accountability, rich descriptions and trustworthy metrics will have a greater impact on the growth of AI models.
Multimodal Data Services: The scalable knowledge services behind MCP servers OLTP, OLAP, NOSQL and Elasticsearch may require multiple types of databases to support high-performance data services. Maintaining a single source of truth and performance through continuous reverse ETL assignments is a challenge.
AI DevOps: AI-centric software development, maintenance and analysis. Over the past 12 months, the accuracy of the code-category has been greatly improved.
Experiment and monitoring: Given the increased uncertainty of AI applications, the evaluation and monitoring of these applications are more critical.

These are my thoughts on building an AI platform. Please let me know what you think about this. cheers!

liralbes 8 hours ago

0 0 5 minutes read