Computer Vision and Foundation Models

Question

Created Nov ’25

Replies 1

Boosts 0

Participants 2

Is foundation models matured enough to take input from the Apple Vision framework to generate responses? Something similar to what google's gemini does although in a much smaller scale and for a very specific niche.

Boost

Answer 1

MB-Researcher OP

Nov ’25

So it depends. Currently the Foundation Models framework only supports text input, so it won't work to pass in an image directly.

However a great use case for combining the Apple Vision framework and Foundation Models is OCR (optical character recognition, basically detecting text). You can use the Vision framework to detect text from an image or live camera stream, and then pass that text to Foundation Models to generated responses about the text. This should work well for images of things like books, flyers, or notes.

Here's a sample project from Vision that shows the OCR part: https://developer.apple.com/documentation/vision/locating-and-displaying-recognized-text

1