Large language model
This paper introduces weakly supervised pre-training of vision models on large-scale image-text data, reframing it as a classification task. This approach eliminates the need for pairwise similarity computations in contrastive loss, addressing computational challenges and achieving a remarkable 2.7% increase in accuracy.