Vision

Apply computer vision algorithms to perform a variety of tasks on input images and video using Vision.

Vision Documentation

Post

Replies

Boosts

Views

Activity

builtInLiDARDepthCamera doesn't work on the 2020 iPad Pro on iOS 26

On iOS 26.1, this throws on the 2020 iPad Pro (4th gen) but works fine on an M4 iPad Pro or iPhone 15 Pro: guard let device = AVCaptureDevice.default(.builtInLiDARDepthCamera, for: .video, position: .back) else { throw ConfigurationError.lidarDeviceUnavailable } It's just the standard code from Apple's own sample code so obviously used to work: https://developer.apple.com/documentation/AVFoundation/capturing-depth-using-the-lidar-camera Does it fail because Apple have silently dumped support for the older LiDAR sensor used prior to the M4 iPad Pro, or is there another reason? What about the 5th and 6th gen iPad Pro, does it still work on those?

Media Technologies Photos & Camera iOS ARKit iPadOS Vision

509

Nov ’25

Inquiry About Building an App for Object Detection, Background Removal, and Animation

Hi all! Nice to meet you., I am planning to build an iOS application that can: Capture an image using the camera or select one from the gallery. Remove the background and keep only the detected main object. Add a border (outline) around the detected object’s shape. Apply an animation along that border (e.g., moving light or glowing effect). Include a transition animation when removing the background — for example, breaking the background into pieces as it disappears. The app Capword has a similar feature for object isolation, and I’d like to build something like that. Could you please provide any guidance, frameworks, or sample code related to: Object segmentation and background removal in Swift (Vision or Core ML). Applying custom borders and shape animations around detected objects. Recognizing the object name (e.g., “person”, “cat”, “car”) after segmentation. Thank you very much for your support. Best regards, SINN SOKLYHOR

Machine Learning & AI General Vision Camera

197

Nov ’25

RecognizeDocumentsRequest for receipts

Hi, I'm trying to use the new RecognizeDocumentsRequest from the Vision Framework to read a receipt. It looks very promising by being able to read paragraphs, lines and detect data. So far it unfortunately seems to read every line on the receipt as a paragraph and when there is more space on one line it creates two paragraphs. Is there perhaps an Apple Engineer who knows if this is expected behaviour or if I should file a Feedback for this? Code setup: let request = RecognizeDocumentsRequest() let observations = try await request.perform(on: image) guard let document = observations.first?.document else { return } for paragraph in document.paragraphs { print(paragraph.transcript) for data in paragraph.detectedData { switch data.match.details { case .phoneNumber(let data): print("Phone: \(data)") case .postalAddress(let data): print("Postal: \(data)") case .calendarEvent(let data): print("Calendar: \(data)") case .moneyAmount(let data): print("Money: \(data)") case .measurement(let data): print("Measurement: \(data)") default: continue } } } See attached image as an example of a receipt I'd like to parse. The top 3 lines are the name, street, and postal code + city. These are all separate paragraphs. Checking on detectedData does see the street (2nd line) as PostalAddress, but not the complete address. Might that be a location thing since it's a Dutch address. And lower on the receipt it sees the block with "Pomp 1 95 Ongelood" and the things below also as separate paragraphs. First picking up the left side and after that the right side. So it's something like this: * Pomp 1 Volume Prijs € TOTAAL * BTW Netto 21.00 % 95 Ongelood 41,90 l 1.949/ 1 81.66 € 14.17 67.49

Machine Learning & AI General Vision VisionKit

593

Nov ’25

How-to highlight people in a Vision Pro app using Compositor Services

Fundamentally, my questions are: is there a known transform I can apply onto a given (pixel) position (passed into a Metal Fragment Function) to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks? My goal is to highlight people in a Vision Pro app using Compositor Services. To start, I asynchronously receive camera frames for the main left and right cameras. This is the breakdown of the specific CameraVideoFormat I pass along to the CameraFrameProvider: minFrameDuration: 0.03 maxFrameDuration: 0.033333335 frameSize: (1920.0, 1080.0) pixelFormat: 875704422 cameraType: main cameraPositions: [left, right] cameraRectification: mono From each camera frame sample, I extract the left and right buffers (CVReadOnlyPixelBuffer.withUnsafebuffer ==> CVPixelBuffer). I asynchronously process the extracted buffers by performing a VNGeneratePersonSegmentationRequest on both of them: // NOTE: This block of code and all following code blocks contain simplified representations of my code for clarity's sake. var request = VNGeneratePersonSegmentationRequest() request.qualityLevel = .balanced request.outputPixelFormat = kCVPixelFormatType_OneComponent8 ... let lHandler = VNSequenceRequestHandler() let rHandler = VNSequenceRequestHandler() ... func processBuffers() async { try lHandler.perform([request], on: lBuffer) guard let lMask = request.results?.first?.pixelBuffer else {...} try rHandler.perform([request], on: rBuffer) guard let rMask = request.results?.first?.pixelBuffer else {...} appModel.latestPersonMasks = (lMask, rMask) } I store the two resulting CVPixelBuffers in my appModel. For both of these buffers aka grayscale masks: width (in pixels) = 512 height (in pixels) = 384 byters per row = 512 plane count = 0 pixel format type = 1278226488 I am using Compositor Services to render my content in Immersive Space. My implementation of Compositor Services is based off of the same code from Interacting with virtual content blended with passthrough. Within the Shaders.metal, the tint's Fragment Shader is now passed the grayscale masks (converted from CVPixelBuffer to MTLTexture via CVMetalTextureCacheCreateTextureFromImage() at the beginning of the main render pipeline). fragment float4 tintFragmentShader( TintInOut in [[stage_in]], ushort amp_id [[amplification_id]], texture2d<uint> leftMask [[texture(0)]], texture2d<uint> rightMask [[texture(1)]] ) { if (in.color.a <= 0.0) { discard_fragment(); } float2 uv; if (amp_id == 0) { // LEFT uv = ??????????????????????; } else { // RIGHT uv = ??????????????????????; } constexpr sampler linearSampler (mip_filter::linear, mag_filter::linear, min_filter::linear); // Sample the PersonSegmentation grayscale mask float maskValue = 0.0; if (amp_id == 0) { // LEFT if (leftMask.get_width() > 0) { maskValue = rightMask.sample(linearSampler, uv).r; } } else { // RIGHT if (rightMask.get_width() > 0) { maskValue = rightMask.sample(linearSampler, uv).r; } } if (maskValue > 250) { return (1.0, 1.0, 1.0, 0.5) } return in.color; } I need to correctly sample the masks for a given fragment. The LayerRenderer.Layout is set to .layered. From Developer Documentation. A layout that specifies each view’s content as a slice of a single texture. Using the Metal debugger, I know that the final render target texture for each view / eye is 1888 x 1792 pixels, giving an aspect ratio of 59:56. The initial CVPixelBuffer provided by the main left and right cameras is 1920x1080 (16:9). The grayscale CVPixelBuffer returned by the VNPersonSegmentationRequest is 512x384 (4:3). All of these aspect ratios are different. My questions come down to: is there a known transform I can apply onto a given (pixel) position to correctly sample a texture provided by the main cameras + processed by a Vision request. If so, what is it? If not, how can I accurately sample my masks? Within the tint's Vertex Shader, after applying the modelViewProjectionMatrix, I have tried every version I have been able to find that takes the pixel space position (= vertices[vertexID].position.xy) and the viewport size (1888x1792) to compute the correct clip space position (maybe = pixel space position.xy / (viewport size * 0.5)???) of the grayscale masks but nothing has worked. The "highlight" of the person segmentations is off: scaled a little too big, offset little to far up and off to the side.

Media Technologies Photos & Camera Metal Vision

459

Nov ’25

Curved/panorama window in visionOS 2?

The new Mac virtual display feature on visionOS 2 offers a curved/panoramic window. I was wondering if this is simply a property that can be applied to a window, or if it involves an immersive mode or SceneKit/RealityKit?

Spatial Computing General Vision visionOS

1.5k

Nov ’25

ManipulationComponent create parent/child crash

Hello, If you add a ManipulationComponent to a RealityKit entity and then continue to add instructions, sooner or later you will encounter a crash with the following error message: Attempting to move entity “%s” (%p) under “%s” (%p), but the new parent entity is currently being removed. Changing the parent/child entities of an entity in an event handler while that entity is already being reassigned is not supported. CoreSimulator 1048 – Device: Apple Vision Pro 4K (B87DD32A-E862-4791-8B71-92E50CE6EC06) – Runtime: visionOS 26.0 (23M336) – Device Type: Apple Vision Pro The problem occurs precisely with this code: ManipulationComponent.configureEntity(object) I adapted Apple's ObjectPlacementExample and made the changes available via GitHub. The desired behavior is that I add entities to ManipulationComponent and then Realitiykit runs stably and does not crash randomly. GitHub Repo Thanks Andre

Spatial Computing General Vision RealityKit

516

Oct ’25

Updated DetectHandPoseRequest revision from WWDC25 doesn't exist

I watched this year WWDC25 "Read Documents using the Vision framework". At the end of video there is mention of new DetectHandPoseRequest model for hand pose detection in Vision API. I looked Apple documentation and I don't see new revision. Moreover probably typo in video because there is only DetectHumanPoseRequst (swift based) and VNDetectHumanHandPoseRequest (obj-c based) (notice lack of Human prefix in WWDC video) First one have revision only added in iOS 18+: https://developer.apple.com/documentation/vision/detecthumanhandposerequest/revision-swift.enum/revision1 Second one have revision only added in iOS14+: https://developer.apple.com/documentation/vision/vndetecthumanhandposerequestrevision1 I don't see any new revision targeting iOS26+

Machine Learning & AI General Vision

163

Oct ’25

Custom keypoint detection model through vision api

Hi there, I have a custom keypoint detection model and want to use it via vision's CoremlRequest API. Here's some complication for input and output: For input My model expect 512x512 a image. Which would be resized and padded from a 1920x1080 frame. I use the .scaleToFit option, but can I also specify the color used for padding? For output: My model output a CoreMLFeatureValueObservation, can I have it output in a format vision recognizes? such as joints/keypoints If my model is able to output in a format vision recognizes, would it take care to restoring the coordinates back to the original frame? (undo the padding) If not, how do I restore it from .scaletofit option? Best,

Machine Learning & AI Core ML Vision Core ML

935

Oct ’25

Vision and iOS18 - Failed to create espresso context.

I'm playing with the new Vision API for iOS18, specifically with the new CalculateImageAestheticsScoresRequest API. When I try to perform the image observation request I get this error: internalError("Error Domain=NSOSStatusErrorDomain Code=-1 \"Failed to create espresso context.\" UserInfo={NSLocalizedDescription=Failed to create espresso context.}") The code is pretty straightforward: if let image = image { let request = CalculateImageAestheticsScoresRequest() Task { do { let cgImg = image.cgImage! let observations = try await request.perform(on: cgImg) let description = observations.description let score = observations.overallScore print(description) print(score) } catch { print(error) } } } I'm running it on a M2 using the simulator. Is it a bug? What's wrong?

Machine Learning & AI General Vision

1.7k

Sep ’25

Starting with iPhone 17, the output image of avcapturesession is displayed horizontally.

For iPhones 16 and below, orientation is applied in UIImage or CIImage, but not for iPhone 17. The camera is front-facing, and it uses Vision to capture facial images. Thanks for your help.

Developer Tools & Services Apple Developer Program Vision Camera Core Image AVFoundation

272

Sep ’25

vision pro notifications too small for shareplay

A is there a way to get big huge notitifications for Shareplay invitations ? B can i have the notifications inside the app ? we have a corporate app to check archtecture projects we want to share these 3d spaces walking inside with near users in the same place to discuss about the project .. but it takes too long shareplay invitation is a small circle on top, if the others users just put the vision without configuring eyes and hands... it's gonna be impossible thanks for sharing and giving us support

UI Frameworks General Vision visionOS

201

Sep ’25

immersive scene blinking on nearby experience on app

after launching a nearby exoerience on quick look or inside our app, all the user in the group watch sometimes teh model blinking abd becoming transparet... ... just one user hasnt the issue, either the one who launched shareplay or the user who force align the immersive space in front weird

Graphics & Games RealityKit Vision

132

Sep ’25

vision shareplay nearby codes expired

it looks like one week after accepting as a nearby other AVP device... it expires since we are providing our clients for a timeless app to walk inside archtiecture, it's a shame that not technical staff should connect every week 5 devices to work together is there any roundabout for this issue or straight to the wishlist ? thanks for the support !!

Spatial Computing General Vision

Sep ’25

face and body detection is local model or a cloud model？

Is the face and body detection service in the Vision framework a local model or a cloud model? https://developer.apple.com/documentation/vision

Machine Learning & AI Apple Intelligence Vision

746

Sep ’25

Raycasting VNFaceLandmarkRegion2D

Hello, Does anyone have a recipe on how to raycast VNFaceLandmarkRegion2D points obtained from a frame's capturedImage? More specifically, how to construct the "from" parameter of the frame's raycastQuery from a VNFaceLandmarkRegion2D point? Do the points need to be flipped vertically? Is there any other transformation that needs to be performed on the points prior to passing them to raycastQuery?

Media Technologies Photos & Camera ARKit Vision

325

Sep ’25

Foundational Model - Image as Input? Timeline

Hi all, I am interested in unlocking unique applications with the new foundational models. I have a few questions regarding the availability of the following features: Image Input: The update in June 2025 mentions "image" 44 times (https://machinelearning.apple.com/research/apple-foundation-models-2025-updates) - however I can't seem to find any information about having images as the input/prompt for the foundational models. When will this be available? I understand that there are existing Vision ML APIs, but I want image input into a multimodal on-device LLM (VLM) instead for features like "Which player is holding the ball in the image", etc (image understanding) Cloud Foundational Model - when will this be available? Thanks! Clement :)

Machine Learning & AI Foundation Models Vision Machine Learning Core ML Apple Intelligence

592

Sep ’25

face and body detection in the Vision framework a local model or a cloud model?

Is the face and body detection service in the Vision framework a local model or a cloud model? Is there a performance report? https://developer.apple.com/documentation/vision

Machine Learning & AI Foundation Models Vision

505

Sep ’25

iOS: How to maintain good app icon contrast in grayscale mode?

I’m developing an iOS app, and I’ve noticed that when the user enables Accessibility → Display & Text Size → Color Filters → Grayscale, my app icon loses a lot of visual contrast. The original colored version looks fine, but in grayscale it appears “flat” and harder to distinguish, unlike a pure black-and-white design. What I want to achieve: Ensure the app icon remains visually clear and high-contrast even when iOS renders it in grayscale. Ideally, provide an alternate “high-contrast” app icon version when grayscale mode is enabled. What I’ve tried: Increased color contrast in the original icon design. Added outlines and stronger shapes. Tested with grayscale filters in design tools. Researched Asset Catalog and alternate icons, but found no documented API to detect or respond to grayscale mode. Questions: Is there any API in iOS that allows detecting when the system is in grayscale mode so that I can programmatically switch to an alternate app icon? If not, are there Apple-recommended best practices for designing app icons that still look clear in grayscale? Are there any accessibility guidelines specifically addressing icon design for grayscale or color-blind modes? Additional info: iOS version tested: iOS 17.5 Development in Swift + SwiftUI, using Asset Catalog for icons. I am aware that iOS supports alternate icons via setAlternateIconName, but I haven’t found a trigger for grayscale mode.

Accessibility & Inclusion General Vision Visual Design ColorSync

466

Aug ’25

How to obtain the physical memory size of VisionPro and how much memory is currently available

UI Frameworks SwiftUI Vision VisionKit

132

Aug ’25

Real Time Text detection using iOS18 RecognizeTextRequest from video buffer returns gibberish

Hey Devs, I'm trying to create my own Real Time Text detection like this Apple project. https://developer.apple.com/documentation/vision/extracting-phone-numbers-from-text-in-images I want to use the new iOS18 RecognizeTextRequest instead of the old VNRecognizeTextRequest in my SwiftUI project. This is my delegate code with the camera setup. I removed region of interest for debugging but I'm trying to scan English words in books. The idea is to get one word in the ROI in the future. But I can't even get proper words so testing without ROI incase my math is wrong. @Observable class CameraManager: NSObject, AVCapturePhotoCaptureDelegate ... override init() { super.init() setUpVisionRequest() } private func setUpVisionRequest() { textRequest = RecognizeTextRequest(.revision3) } ... func setup() -> Bool { captureSession.beginConfiguration() guard let captureDevice = AVCaptureDevice.default( .builtInWideAngleCamera, for: .video, position: .back) else { return false } self.captureDevice = captureDevice guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return false } /// Check whether the session can add input. guard captureSession.canAddInput(deviceInput) else { print("Unable to add device input to the capture session.") return false } /// Add the input and output to session captureSession.addInput(deviceInput) /// Configure the video data output videoDataOutput.setSampleBufferDelegate( self, queue: videoDataOutputQueue) if captureSession.canAddOutput(videoDataOutput) { captureSession.addOutput(videoDataOutput) videoDataOutput.connection(with: .video)? .preferredVideoStabilizationMode = .off } else { return false } // Set zoom and autofocus to help focus on very small text do { try captureDevice.lockForConfiguration() captureDevice.videoZoomFactor = 2 captureDevice.autoFocusRangeRestriction = .near captureDevice.unlockForConfiguration() } catch { print("Could not set zoom level due to error: \(error)") return false } captureSession.commitConfiguration() // potential issue with background vs dispatchqueue ?? Task(priority: .background) { captureSession.startRunning() } return true } } // Issue here ??? extension CameraManager: AVCaptureVideoDataOutputSampleBufferDelegate { func captureOutput( _ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection ) { guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return } Task { textRequest.recognitionLevel = .fast textRequest.recognitionLanguages = [Locale.Language(identifier: "en-US")] do { let observations = try await textRequest.perform(on: pixelBuffer) for observation in observations { let recognizedText = observation.topCandidates(1).first print("recognized text \(recognizedText)") } } catch { print("Recognition error: \(error.localizedDescription)") } } } } The results I get look like this ( full page of English from a any book) recognized text Optional(RecognizedText(string: e bnUI W4, confidence: 0.5)) recognized text Optional(RecognizedText(string: ?'U, confidence: 0.3)) recognized text Optional(RecognizedText(string: traQt4, confidence: 0.3)) recognized text Optional(RecognizedText(string: li, confidence: 0.3)) recognized text Optional(RecognizedText(string: 15,1,#, confidence: 0.3)) recognized text Optional(RecognizedText(string: jllÈ, confidence: 0.3)) recognized text Optional(RecognizedText(string: vtrll, confidence: 0.3)) recognized text Optional(RecognizedText(string: 5,1,: 11, confidence: 0.5)) recognized text Optional(RecognizedText(string: 1141, confidence: 0.3)) recognized text Optional(RecognizedText(string: jllll ljiiilij41, confidence: 0.3)) recognized text Optional(RecognizedText(string: 2f4, confidence: 0.3)) recognized text Optional(RecognizedText(string: ktril, confidence: 0.3)) recognized text Optional(RecognizedText(string: ¥LLI, confidence: 0.3)) recognized text Optional(RecognizedText(string: 11[Itl,, confidence: 0.3)) recognized text Optional(RecognizedText(string: 'rtlÈ131, confidence: 0.3)) Even with ROI set to a specific rectangle Normalized to Vision, I get the same results with single characters returning gibberish. Any help would be amazing thank you. Am I using the buffer right ? Am I using the new perform(on: CVPixelBuffer) right ? Maybe I didn't set up my camera properly? I can provide code

Machine Learning & AI General Vision

363

Jul ’25

builtInLiDARDepthCamera doesn't work on the 2020 iPad Pro on iOS 26

Media Technologies Photos & Camera iOS ARKit iPadOS Vision

Replies: 2
Boosts: 0
Views: 509
Activity: Nov ’25

Inquiry About Building an App for Object Detection, Background Removal, and Animation

Machine Learning & AI General Vision Camera

Replies: 0
Boosts: 0
Views: 197
Activity: Nov ’25

RecognizeDocumentsRequest for receipts

Machine Learning & AI General Vision VisionKit

Replies: 3
Boosts: 1
Views: 593
Activity: Nov ’25

How-to highlight people in a Vision Pro app using Compositor Services

Media Technologies Photos & Camera Metal Vision

Replies: 1
Boosts: 0
Views: 459
Activity: Nov ’25

Curved/panorama window in visionOS 2?

Spatial Computing General Vision visionOS

Replies: 5
Boosts: 0
Views: 1.5k
Activity: Nov ’25

ManipulationComponent create parent/child crash

Spatial Computing General Vision RealityKit

Replies: 3
Boosts: 0
Views: 516
Activity: Oct ’25

Updated DetectHandPoseRequest revision from WWDC25 doesn't exist

Machine Learning & AI General Vision

Replies: 0
Boosts: 0
Views: 163
Activity: Oct ’25

Custom keypoint detection model through vision api

Machine Learning & AI Core ML Vision Core ML

Replies: 1
Boosts: 0
Views: 935
Activity: Oct ’25

Vision and iOS18 - Failed to create espresso context.

Machine Learning & AI General Vision

Replies: 3
Boosts: 1
Views: 1.7k
Activity: Sep ’25

Starting with iPhone 17, the output image of avcapturesession is displayed horizontally.

For iPhones 16 and below, orientation is applied in UIImage or CIImage, but not for iPhone 17. The camera is front-facing, and it uses Vision to capture facial images. Thanks for your help.

Developer Tools & Services Apple Developer Program Vision Camera Core Image AVFoundation

Replies: 1
Boosts: 0
Views: 272
Activity: Sep ’25

vision pro notifications too small for shareplay

UI Frameworks General Vision visionOS

Replies: 4
Boosts: 0
Views: 201
Activity: Sep ’25

immersive scene blinking on nearby experience on app

Graphics & Games RealityKit Vision

Replies: 0
Boosts: 0
Views: 132
Activity: Sep ’25

vision shareplay nearby codes expired

Spatial Computing General Vision

Replies: 0
Boosts: 0
Views: 82
Activity: Sep ’25

face and body detection is local model or a cloud model？

Is the face and body detection service in the Vision framework a local model or a cloud model? https://developer.apple.com/documentation/vision

Machine Learning & AI Apple Intelligence Vision

Replies: 1
Boosts: 0
Views: 746
Activity: Sep ’25

Raycasting VNFaceLandmarkRegion2D

Media Technologies Photos & Camera ARKit Vision

Replies: 4
Boosts: 0
Views: 325
Activity: Sep ’25

Foundational Model - Image as Input? Timeline

Machine Learning & AI Foundation Models Vision Machine Learning Core ML Apple Intelligence

Replies: 1
Boosts: 0
Views: 592
Activity: Sep ’25

face and body detection in the Vision framework a local model or a cloud model?

Is the face and body detection service in the Vision framework a local model or a cloud model? Is there a performance report? https://developer.apple.com/documentation/vision

Machine Learning & AI Foundation Models Vision

Replies: 1
Boosts: 0
Views: 505
Activity: Sep ’25

iOS: How to maintain good app icon contrast in grayscale mode?

Accessibility & Inclusion General Vision Visual Design ColorSync

Replies: 0
Boosts: 1
Views: 466
Activity: Aug ’25

How to obtain the physical memory size of VisionPro and how much memory is currently available

UI Frameworks SwiftUI Vision VisionKit

Replies: 0
Boosts: 0
Views: 132
Activity: Aug ’25

Real Time Text detection using iOS18 RecognizeTextRequest from video buffer returns gibberish

Machine Learning & AI General Vision

Replies: 1
Boosts: 0
Views: 363
Activity: Jul ’25

Vision

Posts under Vision tag

Post

Replies

Boosts

Views

Activity