Scaling Capability in Token Space: An Analysis of Large Vision Language Model
Authors
Paper Information
-
Journal:
Journal of Machine Learning Research -
Added to Tracker:
Dec 30, 2025
Abstract
Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations.
Author Details
Tenghui Li
AuthorGuoxu Zhou
AuthorXuyang Zhao
AuthorQibin Zhao
AuthorCitation Information
APA Format
Tenghui Li
,
Guoxu Zhou
,
Xuyang Zhao
&
Qibin Zhao
.
Scaling Capability in Token Space: An Analysis of Large Vision Language Model.
Journal of Machine Learning Research
.
BibTeX Format
@article{paper677,
title = { Scaling Capability in Token Space: An Analysis of Large Vision Language Model },
author = {
Tenghui Li
and Guoxu Zhou
and Xuyang Zhao
and Qibin Zhao
},
journal = { Journal of Machine Learning Research },
url = { https://www.jmlr.org/papers/v26/24-2243.html }
}