Is foundation models matured enough to take input from the Apple Vision framework to generate responses? Something similar to what google's gemini does although in a much smaller scale and for a very specific niche.
Vision
RSS for tagApply computer vision algorithms to perform a variety of tasks on input images and video using Vision.
Posts under Vision tag
77 Posts
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
I'm on Tahoe 26.1 / M3 Macbook Air. I'm using VNDetectFaceRectanglesRequest as properly as possible, as in the minimal command line program attached below. For some reason, I always get:
MLE5Engine is disabled through the configuration
printed. I couldn't find any notes on developer docs saying that VNDetectFaceRectanglesRequest can not use the Apple Neural Engine. I'm assuming there is something wrong with my code however I wasn't able to find any remarks from documentation where it might be. I wasn't able to find the above error message online either. I would appreciate your help a lot and thank you in advance.
The code below accesses the video from AVCaptureDevice.DeviceType.builtInWideAngleCamera. Currently it directly chooses the 0th format which has the largest resolution (Full HD on my M3 MBA) and "4:2:0" color "v" reduced color component spectrum encoding ("420v").
After accessing video, it performs a VNDetectFaceRectanglesRequest. It prints "VNDetectFaceRectanglesRequest completion Handler called" many times, then prints the error message above, then continues printing "VNDetectFaceRectanglesRequest completion Handler called" until the user quits it.
To run it in Xcode, File > New project > Mac command line tool. Pasting the code below, then click on the root file > Targets > Signing & Capabilities > Hardened Runtime > Resource Access > Camera.
A possible explanation could be that either Apple's internal CoreML code for this function works on GPU/CPU only or it doesn't accept 420v as supplied by the Macbook Air camera
import AVKit
import Vision
var videoDataOutput: AVCaptureVideoDataOutput = AVCaptureVideoDataOutput()
var detectionRequests: [VNDetectFaceRectanglesRequest]?
var videoDataOutputQueue: DispatchQueue = DispatchQueue(label: "queue")
class XYZ: /*NSViewController or NSObject*/NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
func viewDidLoad() {
//super.viewDidLoad()
let session = AVCaptureSession()
let inputDevice = try! self.configureFrontCamera(for: session)
self.configureVideoDataOutput(for: inputDevice.device, resolution: inputDevice.resolution, captureSession: session)
self.prepareVisionRequest()
session.startRunning()
}
fileprivate func highestResolution420Format(for device: AVCaptureDevice) -> (format: AVCaptureDevice.Format, resolution: CGSize)? {
let deviceFormat = device.formats[0]
print(deviceFormat)
let dims = CMVideoFormatDescriptionGetDimensions(deviceFormat.formatDescription)
let resolution = CGSize(width: CGFloat(dims.width), height: CGFloat(dims.height))
return (deviceFormat, resolution)
}
fileprivate func configureFrontCamera(for captureSession: AVCaptureSession) throws -> (device: AVCaptureDevice, resolution: CGSize) {
let deviceDiscoverySession = AVCaptureDevice.DiscoverySession(deviceTypes: [AVCaptureDevice.DeviceType.builtInWideAngleCamera], mediaType: .video, position: AVCaptureDevice.Position.unspecified)
let device = deviceDiscoverySession.devices.first!
let deviceInput = try! AVCaptureDeviceInput(device: device)
captureSession.addInput(deviceInput)
let highestResolution = self.highestResolution420Format(for: device)!
try! device.lockForConfiguration()
device.activeFormat = highestResolution.format
device.unlockForConfiguration()
return (device, highestResolution.resolution)
}
fileprivate func configureVideoDataOutput(for inputDevice: AVCaptureDevice, resolution: CGSize, captureSession: AVCaptureSession) {
videoDataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
captureSession.addOutput(videoDataOutput)
}
fileprivate func prepareVisionRequest() {
let faceDetectionRequest: VNDetectFaceRectanglesRequest = VNDetectFaceRectanglesRequest(completionHandler: { (request, error) in
print("VNDetectFaceRectanglesRequest completion Handler called")
})
// Start with detection
detectionRequests = [faceDetectionRequest]
}
// MARK: AVCaptureVideoDataOutputSampleBufferDelegate
// Handle delegate method callback on receiving a sample buffer.
public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
var requestHandlerOptions: [VNImageOption: AnyObject] = [:]
let cameraIntrinsicData = CMGetAttachment(sampleBuffer, key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, attachmentModeOut: nil)
if cameraIntrinsicData != nil {
requestHandlerOptions[VNImageOption.cameraIntrinsics] = cameraIntrinsicData
}
let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)!
// No tracking object detected, so perform initial detection
let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
orientation: CGImagePropertyOrientation.up, options: requestHandlerOptions)
try! imageRequestHandler.perform(detectionRequests!)
}
}
let X = XYZ()
X.viewDidLoad()
sleep(9999999)
On iOS 26.1, this throws on the 2020 iPad Pro (4th gen) but works fine on an M4 iPad Pro or iPhone 15 Pro:
guard let device = AVCaptureDevice.default(.builtInLiDARDepthCamera, for: .video, position: .back) else {
throw ConfigurationError.lidarDeviceUnavailable
}
It's just the standard code from Apple's own sample code so obviously used to work:
https://developer.apple.com/documentation/AVFoundation/capturing-depth-using-the-lidar-camera
Does it fail because Apple have silently dumped support for the older LiDAR sensor used prior to the M4 iPad Pro, or is there another reason? What about the 5th and 6th gen iPad Pro, does it still work on those?
Hi all! Nice to meet you.,
I am planning to build an iOS application that can:
Capture an image using the camera or select one from the gallery.
Remove the background and keep only the detected main object.
Add a border (outline) around the detected object’s shape.
Apply an animation along that border (e.g., moving light or glowing effect).
Include a transition animation when removing the background — for example, breaking the background into pieces as it disappears.
The app Capword has a similar feature for object isolation, and I’d like to build something like that.
Could you please provide any guidance, frameworks, or sample code related to:
Object segmentation and background removal in Swift (Vision or Core ML).
Applying custom borders and shape animations around detected objects.
Recognizing the object name (e.g., “person”, “cat”, “car”) after segmentation.
Thank you very much for your support.
Best regards,
SINN SOKLYHOR
Hi,
I'm trying to use the new RecognizeDocumentsRequest from the Vision Framework to read a receipt. It looks very promising by being able to read paragraphs, lines and detect data. So far it unfortunately seems to read every line on the receipt as a paragraph and when there is more space on one line it creates two paragraphs.
Is there perhaps an Apple Engineer who knows if this is expected behaviour or if I should file a Feedback for this?
Code setup:
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: image)
guard let document = observations.first?.document else {
return
}
for paragraph in document.paragraphs {
print(paragraph.transcript)
for data in paragraph.detectedData {
switch data.match.details {
case .phoneNumber(let data):
print("Phone: \(data)")
case .postalAddress(let data):
print("Postal: \(data)")
case .calendarEvent(let data):
print("Calendar: \(data)")
case .moneyAmount(let data):
print("Money: \(data)")
case .measurement(let data):
print("Measurement: \(data)")
default:
continue
}
}
}
See attached image as an example of a receipt I'd like to parse. The top 3 lines are the name, street, and postal code + city. These are all separate paragraphs. Checking on detectedData does see the street (2nd line) as PostalAddress, but not the complete address. Might that be a location thing since it's a Dutch address.
And lower on the receipt it sees the block with "Pomp 1 95 Ongelood" and the things below also as separate paragraphs. First picking up the left side and after that the right side. So it's something like this:
*
Pomp 1
Volume
Prijs
€
TOTAAL
*
BTW
Netto
21.00 %
95 Ongelood
41,90 l
1.949/ 1
81.66
€
14.17
67.49
Fundamentally, my questions are: is there a known transform I can apply onto a given (pixel) position (passed into a Metal Fragment Function) to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?
My goal is to highlight people in a Vision Pro app using Compositor Services.
To start, I asynchronously receive camera frames for the main left and right cameras. This is the breakdown of the specific CameraVideoFormat I pass along to the CameraFrameProvider:
minFrameDuration: 0.03
maxFrameDuration: 0.033333335
frameSize: (1920.0, 1080.0)
pixelFormat: 875704422
cameraType: main
cameraPositions: [left, right]
cameraRectification: mono
From each camera frame sample, I extract the left and right buffers (CVReadOnlyPixelBuffer.withUnsafebuffer ==> CVPixelBuffer).
I asynchronously process the extracted buffers by performing a VNGeneratePersonSegmentationRequest on both of them:
// NOTE: This block of code and all following code blocks contain simplified representations of my code for clarity's sake.
var request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .balanced
request.outputPixelFormat = kCVPixelFormatType_OneComponent8
...
let lHandler = VNSequenceRequestHandler()
let rHandler = VNSequenceRequestHandler()
...
func processBuffers() async {
try lHandler.perform([request], on: lBuffer)
guard let lMask = request.results?.first?.pixelBuffer else {...}
try rHandler.perform([request], on: rBuffer)
guard let rMask = request.results?.first?.pixelBuffer else {...}
appModel.latestPersonMasks = (lMask, rMask)
}
I store the two resulting CVPixelBuffers in my appModel. For both of these buffers aka grayscale masks:
width (in pixels) = 512
height (in pixels) = 384
byters per row = 512
plane count = 0
pixel format type = 1278226488
I am using Compositor Services to render my content in Immersive Space. My implementation of Compositor Services is based off of the same code from Interacting with virtual content blended with passthrough.
Within the Shaders.metal, the tint's Fragment Shader is now passed the grayscale masks (converted from CVPixelBuffer to MTLTexture via CVMetalTextureCacheCreateTextureFromImage() at the beginning of the main render pipeline).
fragment float4 tintFragmentShader(
TintInOut in [[stage_in]],
ushort amp_id [[amplification_id]],
texture2d<uint> leftMask [[texture(0)]],
texture2d<uint> rightMask [[texture(1)]]
)
{
if (in.color.a <= 0.0) {
discard_fragment();
}
float2 uv;
if (amp_id == 0) { // LEFT
uv = ??????????????????????;
} else { // RIGHT
uv = ??????????????????????;
}
constexpr sampler linearSampler (mip_filter::linear, mag_filter::linear, min_filter::linear);
// Sample the PersonSegmentation grayscale mask
float maskValue = 0.0;
if (amp_id == 0) { // LEFT
if (leftMask.get_width() > 0) {
maskValue = rightMask.sample(linearSampler, uv).r;
}
} else { // RIGHT
if (rightMask.get_width() > 0) {
maskValue = rightMask.sample(linearSampler, uv).r;
}
}
if (maskValue > 250) {
return (1.0, 1.0, 1.0, 0.5)
}
return in.color;
}
I need to correctly sample the masks for a given fragment.
The LayerRenderer.Layout is set to .layered. From Developer Documentation.
A layout that specifies each view’s content as a slice of a single texture.
Using the Metal debugger, I know that the final render target texture for each view / eye is 1888 x 1792 pixels, giving an aspect ratio of 59:56.
The initial CVPixelBuffer provided by the main left and right cameras is 1920x1080 (16:9).
The grayscale CVPixelBuffer returned by the VNPersonSegmentationRequest is 512x384 (4:3).
All of these aspect ratios are different.
My questions come down to: is there a known transform I can apply onto a given (pixel) position to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks?
Within the tint's Vertex Shader, after applying the modelViewProjectionMatrix, I have tried every version I have been able to find that takes the pixel space position (= vertices[vertexID].position.xy) and the viewport size (1888x1792) to compute the correct clip space position (maybe = pixel space position.xy / (viewport size * 0.5)???) of the grayscale masks but nothing has worked. The "highlight" of the person segmentations is off: scaled a little too big, offset little to far up and off to the side.
The new Mac virtual display feature on visionOS 2 offers a curved/panoramic window. I was wondering if this is simply a property that can be applied to a window, or if it involves an immersive mode or SceneKit/RealityKit?
Hello,
If you add a ManipulationComponent to a RealityKit entity and then continue to add instructions, sooner or later you will encounter a crash with the following error message:
Attempting to move entity “%s” (%p) under “%s” (%p), but the new parent entity is currently being removed. Changing the parent/child entities of an entity in an event handler while that entity is already being reassigned is not supported.
CoreSimulator 1048 – Device: Apple Vision Pro 4K (B87DD32A-E862-4791-8B71-92E50CE6EC06) – Runtime: visionOS 26.0 (23M336) – Device Type: Apple Vision Pro
The problem occurs precisely with this code:
ManipulationComponent.configureEntity(object)
I adapted Apple's ObjectPlacementExample and made the changes available via GitHub.
The desired behavior is that I add entities to ManipulationComponent and then Realitiykit runs stably and does not crash randomly.
GitHub Repo
Thanks
Andre
I watched this year WWDC25 "Read Documents using the Vision framework". At the end of video there is mention of new DetectHandPoseRequest model for hand pose detection in Vision API.
I looked Apple documentation and I don't see new revision. Moreover probably typo in video because there is only DetectHumanPoseRequst (swift based) and
VNDetectHumanHandPoseRequest (obj-c based) (notice lack of Human prefix in WWDC video)
First one have revision only added in iOS 18+:
https://developer.apple.com/documentation/vision/detecthumanhandposerequest/revision-swift.enum/revision1
Second one have revision only added in iOS14+:
https://developer.apple.com/documentation/vision/vndetecthumanhandposerequestrevision1
I don't see any new revision targeting iOS26+
Hi there,
I have a custom keypoint detection model and want to use it via vision's CoremlRequest API. Here's some complication for input and output:
For input My model expect 512x512 a image. Which would be resized and padded from a 1920x1080 frame. I use the .scaleToFit option, but can I also specify the color used for padding?
For output:
My model output a CoreMLFeatureValueObservation, can I have it output in a format vision recognizes? such as joints/keypoints
If my model is able to output in a format vision recognizes, would it take care to restoring the coordinates back to the original frame? (undo the padding) If not, how do I restore it from .scaletofit option?
Best,
I'm playing with the new Vision API for iOS18, specifically with the new CalculateImageAestheticsScoresRequest API.
When I try to perform the image observation request I get this error:
internalError("Error Domain=NSOSStatusErrorDomain Code=-1 \"Failed to create espresso context.\" UserInfo={NSLocalizedDescription=Failed to create espresso context.}")
The code is pretty straightforward:
if let image = image {
let request = CalculateImageAestheticsScoresRequest()
Task {
do {
let cgImg = image.cgImage!
let observations = try await request.perform(on: cgImg)
let description = observations.description
let score = observations.overallScore
print(description)
print(score)
} catch {
print(error)
}
}
}
I'm running it on a M2 using the simulator.
Is it a bug? What's wrong?
For iPhones 16 and below, orientation is applied in UIImage or CIImage, but not for iPhone 17.
The camera is front-facing, and it uses Vision to capture facial images.
Thanks for your help.
Topic:
Developer Tools & Services
SubTopic:
Apple Developer Program
Tags:
Vision
Camera
Core Image
AVFoundation
I'm receiving output from avcapturesession and capturing an image using Vision, but the image is output in landscape orientation instead of portrait.
Even when I set the orientation to up in ciimage, cgimage, and uiimage, the image is still output in landscape orientation.
On iPhones 16 and below, the image is output in portrait orientation.
But on iPhones 17 and above, the image is output in landscape orientation.
Please help.
A is there a way to get big huge notitifications for Shareplay invitations ?
B can i have the notifications inside the app ?
we have a corporate app to check archtecture projects
we want to share these 3d spaces walking inside with near users in the same place to discuss about the project
.. but it takes too long
shareplay invitation is a small circle on top, if the others users just put the vision without configuring eyes and hands... it's gonna be impossible
thanks for sharing and giving us support
after launching a nearby exoerience on quick look or inside our app, all the user in the group watch sometimes teh model blinking abd becoming transparet...
... just one user hasnt the issue, either the one who launched shareplay or the user who force align the immersive space in front
weird
it looks like one week after accepting as a nearby other AVP device... it expires
since we are providing our clients for a timeless app to walk inside archtiecture, it's a shame that not technical staff should connect every week 5 devices to work together
is there any roundabout for this issue or straight to the wishlist ?
thanks for the support !!
Is the face and body detection service in the Vision framework a local model or a cloud model?
https://developer.apple.com/documentation/vision
Hello,
Does anyone have a recipe on how to raycast VNFaceLandmarkRegion2D points obtained from a frame's capturedImage?
More specifically, how to construct the "from" parameter of the frame's raycastQuery from a VNFaceLandmarkRegion2D point?
Do the points need to be flipped vertically? Is there any other transformation that needs to be performed on the points prior to passing them to raycastQuery?
Hi all, I am interested in unlocking unique applications with the new foundational models. I have a few questions regarding the availability of the following features:
Image Input: The update in June 2025 mentions "image" 44 times (https://machinelearning.apple.com/research/apple-foundation-models-2025-updates) - however I can't seem to find any information about having images as the input/prompt for the foundational models. When will this be available? I understand that there are existing Vision ML APIs, but I want image input into a multimodal on-device LLM (VLM) instead for features like "Which player is holding the ball in the image", etc (image understanding)
Cloud Foundational Model - when will this be available?
Thanks!
Clement :)
Topic:
Machine Learning & AI
SubTopic:
Foundation Models
Tags:
Vision
Machine Learning
Core ML
Apple Intelligence
Is the face and body detection service in the Vision framework a local model or a cloud model? Is there a performance report?
https://developer.apple.com/documentation/vision