JMLR

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Authors

Tenghui Li Guoxu Zhou Xuyang Zhao Qibin Zhao

View Full Paper

Paper Information

Journal:
Journal of Machine Learning Research
Added to Tracker:
Dec 30, 2025

Abstract

Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations.

Author Details

Tenghui Li

Author

Guoxu Zhou

Author

Xuyang Zhao

Author

Qibin Zhao

Author

Citation Information

APA Format


                                
                                    
                                    Tenghui Li
                                
                                    
                                        , 
                                    
                                    Guoxu Zhou
                                
                                    
                                        , 
                                    
                                    Xuyang Zhao
                                
                                    
                                         & 
                                    
                                    Qibin Zhao
                                
                                . 
                                Scaling Capability in Token Space: An Analysis of Large Vision Language Model. 
                                Journal of Machine Learning Research
                                .

BibTeX Format


@article{paper677,

  title = { Scaling Capability in Token Space: An Analysis of Large Vision Language Model },

  author = { 
                                
                                    Tenghui Li
                                
                                     and Guoxu Zhou
                                
                                     and Xuyang Zhao
                                
                                     and Qibin Zhao
                                
                                },

  journal = { Journal of Machine Learning Research },



  url = { https://www.jmlr.org/papers/v26/24-2243.html }

}

Back to Papers

View Full Paper More from JMLR