Vision evaluations

Are evaluations just for Text-text, or is there an efficient ways to evaluate image-text, like for MobileClip2, or YOLOE?

At first blush, it seems like that would just be redoing training, testing, and validation of the dataset, or am I missing something? (newbe)

Vision evaluations
 
 
Q