Abstract

Mixtures of experts (MoEs), a class of statistical machine learning models that combine multiple models, known as experts, to form more complex and accurate models, have been combined into deep learning architectures to improve the ability of these architectures and AI models to capture the heterogeneity of the data and to scale up these architectures without increasing the computational cost. They have become the backbone of important large-scale AI models, including GPT-4 (OpenAI), DeepSeek-V3 (DeepSeek), and Mixtral (Mistral). In mixtures of experts, each expert specializes in a different aspect of the data, which is then combined with a gating function to produce the final output. Therefore, parameter and expert estimates play a crucial role by enabling statisticians and data scientists to articulate and make sense of the diverse patterns present in the data. However, the statistical behaviors of parameters and experts in a mixture of experts have remained unsolved, which is due to the complex interaction between gating function and expert parameters.

In the first part of the minicourse, we developed new theoretical insights and tools for understanding mixture of experts. We proposed Voronoi loss functions among parameters and connected the conver- gence rates of parameter and expert estimation to the solvability of systems of polynomial equations under these loss functions. These theories are then extended to the top-K sparse softmax gating Gaus- sian MoE, an important model inside several massive deep learning architectures. The theories further characterize the effect of the sparsity of gating functions on the behaviors of parameter estimation and verify the benefits of using top-1 sparse softmax gating MoE in practice. Finally, we analyzed the least squares estimators (LSE) under a deterministic softmax MoE model where the data are sampled ac- cording to a regression model. We introduced a strong identifiability condition for expert functions and demonstrated that the rates for estimating strongly identifiable experts, such as feed-forward networks with sigmoid or tanh activation functions, are substantially faster than those of polynomial experts, which we show to exhibit a surprisingly slow estimation rate.

In the second part of the minicourse, we leverage the theoretical developments from the first part to investigate the performance of DeepSeekMoE, which stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks.