I've been dealing with a puzzling issue for some time now, and I’m hoping someone here might have insights or suggestions.
The Problem:
We’re observing an occasional crash in our app that seems to originate from the Vision framework.
Frequency: It happens randomly, after many successful executions of the same code, hard to tell how long the app was working, but in some cases app could run for like a month without any issues.
Devices: The issue doesn't seem device-dependent (we’ve seen it on various iPad models).
OS Versions: The crashes started occurring with iOS 18.0.1 and are still present in 18.1 and 18.1.1.
What I suspected: The crash logs point to a potential data race within the Vision framework.
The relevant section of the code where the crash happens:
guard let cgImage = image.cgImage else {
throw ...
let request = VNCoreMLRequest(model: visionModel)
try VNImageRequestHandler(cgImage: cgImage).perform([request]) // <- the line causing the crash
Since the code is rather simple, I'm not sure what else there could be missing here.
The images sent here are uniform (fixed size).
Model is loaded and working, the crash occurs random after a period of time and the call worked correctly many times. Also, the model variable is not an optional.
Here is the crash log:
libobjc.A objc_exception_throw
CoreFoundation -[NSMutableArray removeObjectsAtIndexes:]
Vision -[VNWeakTypeWrapperCollection _enumerateObjectsDroppingWeakZeroedObjects:usingBlock:]
Vision -[VNWeakTypeWrapperCollection addObject:droppingWeakZeroedObjects:]
Vision -[VNSession initWithCachingBehavior:]
Vision -[VNCoreMLTransformer initWithOptions:model:error:]
Vision -[VNCoreMLRequest internalPerformRevision:inContext:error:]
Vision -[VNRequest performInContext:error:]
Vision -[VNRequestPerformer _performOrderedRequests:inContext:error:]
Vision -[VNRequestPerformer _performRequests:onBehalfOfRequest:inContext:error:]
Vision -[VNImageRequestHandler performRequests:gatheredForensics:error:]
OurApp ModelWrapper.perform
And I'm a bit lost at this point, I've tried everything I could image so far.
I've tried to putting a symbolic breakpoint in the removeObjectsAtIndexes to check if some library (e.g. crash reporter) we use didn't do some implementation swap. There was none, and if anything did some method swizzling, I'd expect that to show in the stack trace before the original code would be called. I did peek into the previous functions and I've noticed a lock used in one of the Vision methods, so in my understanding any data race in this code shouldn't be possible at all. I've also put breakpoints in the NSLock variants, to check for swizzling/override with a category and possibly messing the locking - again, nothing was there.
There is also another model that is running on a separate queue, but after seeing the line with the locking in the debugger, it doesn't seem to me like this could cause a problem, at least not in this specific spot.
Is there something I'm missing here, or something I'm doing wrong?
Thanks in advance for your help!
if I set UIApplicationPreferredDefaultSceneSessionRole to UISceneSessionRoleImmersiveSpaceApplication then my Immersive Space for image is working fine but when I try with UIWindowSceneSessionRoleApplication this option and try to open Immersive space on particular sub screen then its not showing image in immersive space(Immersive space not open).
Any one have idea what the issue.
My info.plist value as above
Hi all,
I am developing an app that scans barcodes using VisionKit, but I am facing some difficulties.
The accuracy level is not at where I hope it to be at. Changing the “qualityLevel” parameter from balanced to accurate made the barcode reading slightly better, but it is still misreading some cases. I previously implemented the same barcode scanning app with AVFoundation, and that had much better accuracy. I tested it out, and barcodes that were read correctly with AVFoundation were read incorrectly with VisionKit . Is there anyway to improve the accuracy of the barcode reading in VisionKit? Or is this something that is built in and the developer cannot change? Either way, any ideas on how to improve reading accuracy would help.
Thanks in advance!
When I try to open Immersive space I got error like below:-
HALC_ProxyIOContext::IOWorkLoop: skipping cycle due to overload
How to solve it any idea?
Hi everyone,
I'm working on an iOS app that uses VisionKit and I'm exploring the .visualLookUp feature. Specifically, I want to extract the detailed information that Visual Look Up provides after identifying an object in an image (e.g., if the object is a flower, retrieve its name; if it’s a clothing tag, get the tag's content).
I'm working with vision framework to detect barcodes. I tested both ean13 and data matrix detection and both are working fine except for the QuadrilateralProviding values in the returned BarcodeObservation. TopLeft, topRight, bottomRight and bottomLeft coordinates are rotated 90° counter clockwise (physical bottom left of data Matrix, the corner of the "L" is returned as the topLeft point in observation). The same behaviour is happening with EAN13 Barcode.
Did someone else experienced the same issue with orientation? Is it normal behaviour or should we expect a fix in next releases of the Vision Framework?
I'm trying to set up Facebook AI's "Segment Anything" MLModel to compare its performance and efficacy on-device against the Vision library's Foreground Instance Mask Request.
The Vision request accepts any reasonably-sized image for processing, and then has a method to produce an output at the same resolution as the input image. Conversely, the MLModel for Segment Anything accepts a 1024x1024 image for inference and outputs a 1024x1024 image for output.
What is the best way to work with non-square images, such as 4:3 camera photos? I can basically think of 3 methods for accomplishing this:
Scale the image to 1024x1024, ignoring aspect ratio, then inversely scale the output back to the original size. However, I have a big concern that squashing the content will result in poor inference results.
Scale the image, preserving its aspect ratio so its minimum dimension is 1024, then run the model multiple times on a sliding 1024x1024 window and then aggregating the results. My main concern here is the complexity of de-duping the output, when each run could make different outputs based on how objects are cropped.
Fit the image within 1024x1024 and pad with black pixels to make a square. I'm not sure if the border will muck up the inference.
Anyway, this seems like it must be a well-solved problem in ML, but I'm having difficulty finding an authoritative best practice.
I'm trying to analyze images in my Photos library with the following code:
func analyzeImages(_ inputIDs: [String])
let manager = PHImageManager.default()
let option = PHImageRequestOptions()
option.isSynchronous = true
option.isNetworkAccessAllowed = true
option.resizeMode = .none
option.deliveryMode = .highQualityFormat
let concurrentTasks=1
let clock = ContinuousClock()
let duration = clock.measure {
let group = DispatchGroup()
let sema = DispatchSemaphore(value: concurrentTasks)
for entry in inputIDs {
if let asset=PHAsset.fetchAssets(withLocalIdentifiers: [entry], options: nil).firstObject {
print("analyzing asset: \(entry)")
manager.requestImage(for: asset, targetSize: PHImageManagerMaximumSize, contentMode: .aspectFit, options: option) { (result, info) in
if let result = result {
Task {
print("retrieved asset: \(entry)")
let aestheticsRequest = CalculateImageAestheticsScoresRequest()
let fingerprintRequest = GenerateImageFeaturePrintRequest()
let inputImage = result.cgImage!
let handler = ImageRequestHandler(inputImage)
let (aesthetics,fingerprint) = try await handler.perform(aestheticsRequest, fingerprintRequest)
// save Results
print("finished asset: \(entry)")
else {
print("analyzeImages: Duration \(duration)")
When running this code, only two requests are being processed simultaneously (due to to the semaphore)... However, if I call the function with a large list of images (>100), memory usage balloons over 1.6GB and the app crashes. If I call with a smaller number of images, the loop completes and the memory is freed.
When I use instruments to look for memory leaks, it indicates no memory leaks are found, but there are 150+ VM:IOSurfaces allocated by CMPhoto, CoreVideo and CoreGraphics @ 35MB each. Shouldn't each surface be released when the task is complete?
What is the immersive space projection method? erp, fisheye, cube
We want to achieve the same effect as Apple immersive
Hi everyone,
I'm working on integrating object recognition from live video feeds into my existing app by following Apple's sample code. My original project captures video and records it successfully. However, after integrating the Vision-based object detection components (VNCoreMLRequest), no detections occur, and the callback for the request is never triggered.
To debug this issue, I’ve added the following functionality:
Set up AVCaptureVideoDataOutput for processing video frames.
Created a VNCoreMLRequest using my Core ML model.
The video recording functionality works as expected, but no object detection happens. I’d like to know:
How to debug this further? Which key debug points or logs could help identify where the issue lies?
Have I missed any key configurations? Below is a diff of the modifications I’ve made to my project for the new feature.
Diff of Changes:
(Attach the diff provided above)
Specific Observations:
The captureOutput method is invoked correctly, but there is no output or error from the Vision request callback.
Print statements in my setup function setForVideoClassify() show that the setup executes without errors.
Could this be due to issues with my Core ML model compatibility or configuration?
Is the VNCoreMLRequest setup incorrect, or do I need to ensure specific image formats for processing?
Xcode 16.1, iOS 18.1, Swift 5, SwiftUI, iPhone 11,
Darwin MacBook-Pro.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:02:27 PDT 2024; root:xnu-11215.41.3~2/RELEASE_X86_64 x86_64
Any guidance or advice is appreciated! Thanks in advance.
I decided to use a club to kick a ball and let it roll on the turf in RealityKit, but now I can only let it slide but can not roll.
I add collision on the turf(static), club (kinematic) and the ball(dynamic), and set some parameters: radius, mass.
Using these parameters calculate linear damping, inertia, besides, use time between frames and the club position to calculate speed. Code like these:
let radius: Float = 0.025
let mass: Float = 0.04593 // 质量,单位:kg
var inertia = 2/5 * mass * pow(radius, 2)
let currentPosition = entity.position(relativeTo: nil)
let distance = distance(currentPosition, rgfc.lastPosition)
let deltaTime = Float(context.deltaTime)
let speed = distance / deltaTime
let C_d: Float = 0.47 //阻力系数
let linearDamping = 0.5 * 1.2 * pow(speed, 2) * .pi * pow(radius, 2) * C_d //线性阻尼(1.2表示空气密度)
entity.components[PhysicsBodyComponent.self]?.massProperties.inertia = SIMD3<Float>(inertia, inertia, inertia)
entity.components[PhysicsBodyComponent.self]?.linearDamping = linearDamping
// force
let acceleration = speed / deltaTime
let forceDirection = normalize(currentPosition - rgfc.lastPosition)
let forceMultiplier: Float = 1.0
let appliedForce = forceDirection * mass * acceleration * forceMultiplier
entityCollidedWith.addForce(appliedForce, at: rgfc.hitPosition, relativeTo: nil)
Also I try to applyImpulse but not addForce, like:
let linearImpulse = forceDirection * speed * forceMultiplier * mass
No matter how I adjust the friction(static, dynamic) and restitution, using addForce or applyImpulse, the ball can only slide. How can I solve this problem?
I'm seeking detailed information about the rotation matrix of the iPhone's front-facing (selfie) camera when using ARKit.
Specifically, I need to understand:
The exact rotation matrix applied to the front-facing camera's output in ARKit.
Whether this matrix is consistent across all iPhone models or if there are variations.
If there are any transformations applied to align the camera's coordinate system with the device's orientation, particularly in portrait mode.
How this rotation matrix relates to the transform property of `ARFrame.camera
I am using Apple’s Vision framework with DetectHorizonRequest to detect the horizon in an image. Here is my code:
func processHorizonImage(_ ciImage: CIImage) async {
let request = DetectHorizonRequest()
do {
let result = try await request.perform(on: ciImage)
} catch {
After calling the perform method, I am getting result as nil. To ensure the request's correctness, I have verified the following:
The input CIImage is valid and contains a visible horizon.
No errors are being thrown.
The relevant frameworks are properly imported.
Given that my image contains a clear horizon, why am I still not getting any results? I would appreciate any help or suggestions to resolve this issue.
Thank you for your support!
This is the image
I’m working on a program that analyzes video files frame by frame to detect human poses in each frame. However, during the process of reading observations from the stream, the analysis frequently stops with the following error:
[LOG_ERROR] /Library/Caches/com.apple.xbs/Sources/MediaAnalysis/VideoProcessing/VCPHumanPoseImageRequest.mm[85]: code -18
[LOG_ERROR] /Library/Caches/com.apple.xbs/Sources/MediaAnalysis/VideoProcessing/VCPHumanPoseImageRequest.mm[178]: code -18
The error was caught and printed using a do-catch block, and here is the output:
Error Domain=NSOSStatusErrorDomain Code=-18 "Error: failed to processImage" UserInfo={NSLocalizedDescription=Error: failed to processImage}
While the do-catch block helps prevent the app from crashing, the frames following the error cannot be analyzed.
I’m hoping to understand the cause of this error, or find a way to skip the problematic frames and continue analyzing the subsequent ones.
My development environment is Xcode Version 16.0 (16A242d) and iOS 18.0.
Thank you for your help. (Attaching my code below.)
let videoProcessor = VideoProcessor(videoURL)
let bodyPoseRequest = DetectHumanBodyPoseRequest()
let asset = AVURLAsset(url: videoURL)
let videoTrack = try await asset.loadTracks(withMediaType: .video).first
let bodyPoseStream = try await videoProcessor.addRequest(bodyPoseRequest)
do {
for try await observations in bodyPoseStream {
guard let observation = observations.first else { continue }
if let timeRange = observation.timeRange {
/// do something...
} catch {
Hey everyone,
I've been updating my code to take advantage of the new Vision API for text recognition in macOS 15. I'm noticing some very odd behavior though, it seems like in general the new Vision API consistently produces worse results than the old API. For reference here is how I'm setting up my request.
var request = RecognizeTextRequest()
request.recognitionLevel = getOCRMode() // generally accurate
request.usesLanguageCorrection = !disableLanguageCorrection // generally true
request.recognitionLanguages = language.split(separator: ",").map { Locale.Language(identifier: String($0)) } // generally 'en'
let observations = try? await request.perform(on: image) as [RecognizedTextObservation]
Then I will process the results and just get the top candidate, which as mentioned above, typically is of worse quality then the same request formed with the old API.
Am I doing something wrong here?
WWDC 2024 mentioned that the OCR feature from the Vision framework has support for "Korean, Swedish, and Chinese", but the Swedish support does not seem to be available...
Running either
print(try? VNRecognizeTextRequest().supportedRecognitionLanguages())
var ocrRequest = RecognizeTextRequest(.revision3)
did not print out Swedish as one of the supported languages, but Korean and Chinese are.
Tested on early versions of iOS 18 developer beta, and the latest version of iOS 18.1 (22B5054e).
I would like to offer the functionality that the user aims the camera at a graph (including axes and scales) and the app detects the graph and the app replicates the graph using the image.
I have the whole camera setup finished with a AVCaptureSession, VNDetectContoursRequest, VNImageRequestHandler, etc.
However, now I get many many results so I guess I will now need to tell the image processing process what I am looking for. i.e. filter the VNContoursObservations.
I 'think' I first need to detect two perpendicular lines (the two axes). How do I do that? If I do not see them, I can just ignore that input and wait for the next VNContoursObservation.
When I found the axes of the graph, I will need to find the curve (graph) that I need to scan. Any tips on how I can find that curve and turn that curve into a bunch of coordinates?
Hi, we have in our app an immersive space and we taught the palm menu button is not available in immersive spaces, but when I look in the hand and tap the menu button appear. Is it possible to keep it hidden? Because we a have an hand tracking feature in palm and when we try to press a button to overlap the palm it triggers the menu button and then when the user presses again by mistake, it sends the application to the background.
This is very important for us because we would like to release this hand-tracking feature as soon as possible.
Here is a link with to a video with the problem:
Hi everyone,
I'm working on an iOS app built in Swift using Xcode, where I'm integrating Roboflow's object detection API to extract items from grocery receipts. My goal is to identify key information (like items, total, tax, etc.) from the images of these receipts.
I'm successfully sending images to the Roboflow API and receiving predictions with bounding box data, but when I attempt to extract text from the detected regions (bounding boxes), it appears that the text extraction is failing—no text is being recognized. The issue seems to be that the bounding boxes are either not properly being handled or something is going wrong in the way I process the API response.
Here's a brief breakdown of what I'm doing:
The image is captured, converted to base64, and sent to the Roboflow API.
The API response comes back with bounding boxes for the detected elements (items, date, subtotal, etc.).
The problem occurs when I try to extract the text from the image using the bounding box data—it seems like the bounding boxes are being found, but no text is returned.
I suspect the issue might be happening because the app’s segue to the results view controller is triggered before the OCR extraction completes, or there might be a problem in my code handling the bounding box response.
Response Data:
"inference_id": "77134cce-91b5-4600-a59b-fab74350ca06",
"time": 0.09240847699993537,
"image": {
"width": 370,
"height": 502
"predictions": [
"x": 163.5,
"y": 250.5,
"width": 313.0,
"height": 127.0,
"confidence": 0.9357666373252869,
"class": "Item",
"class_id": 1,
"detection_id": "753341d5-07b6-42a1-8926-ecbc61128243"
"x": 52.5,
"y": 417.5,
"width": 89.0,
"height": 23.0,
"confidence": 0.8819760680198669,
"class": "Date",
"class_id": 0,
"detection_id": "b4681149-d538-47b1-8700-d9528bf1daa0"
And the log showing bounding boxes:
Prediction: ["width": 313, "y": 250.5, "x": 163.5, "detection_id": 753341d5-07b6-42a1-8926-ecbc61128243, "class": Item, "height": 127, "confidence": 0.9357666373252869, "class_id": 1]
No bounding box found in prediction.
I've double-checked the bounding box coordinates, and everything seems fine. Does anyone have experience with using OCR alongside object detection APIs in Swift? Any help on how to ensure the bounding boxes are properly processed and used for OCR would be greatly appreciated!
Also, would it help to delay the segue to the results view controller until OCR is complete?
Thank you!
When I use VNGenerateForegroundInstanceMaskRequest to generate the mask in the simulator by SwiftUI, there is an error "Could not create inference context".
Then I add the code to make the vision by CPU:
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
#if targetEnvironment(simulator)
if #available(iOS 18.0, *) {
let allDevices = MLComputeDevice.allComputeDevices
for device in allDevices {
request.setComputeDevice(.some(device), for: .main)
} else {
// Fallback on earlier versions
request.usesCPUOnly = true
do {
try handler.perform([request])
if let result = request.results?.first {
let mask = try result.generateScaledMaskForImage(forInstances: result.allInstances, from: handler)
return CIImage(cvPixelBuffer: mask)
} catch {
Even I force the simulator to run the code by CPU, but it still have the error: "Could not create inference context"