Computer Vision and Foundation Models

Is foundation models matured enough to take input from the Apple Vision framework to generate responses? Something similar to what google's gemini does although in a much smaller scale and for a very specific niche.

Computer Vision and Foundation Models
 
 
Q