Image object detection with video sizing issue

I'm working on my first model that detects bowling score screens, and I have it working with pictures no problem. But when it comes to video, I have a sizing issue.

I added my model to a small app I wrote for taking a picture of a Bowling Scoring Screen, where my model will frame the screens in the video feed from the camera. My model works, but my boxes are about 2/3 the size of the screens being detected. I don't understand the theory of the video stream the camera is feeding me. What I mean is that I don't want to make tweaks to the size of my rectangles by making them larger, and I'm not sure if the video feed is larger than what I'm detecting in code.

Questions I have are like is the video feed a certain resolution like 1980x something, or a much higher resolution in the 12 megapixel range?

On a static image of say 1920x something, My alignment is perfect.

AI says that it's my model training, that I'm training on square images but video is 16:9. Or that I'm producing 4:3 images in a 16:9 environment.

I'm missing something here but not sure what it is. I already wrote code to force it to fit, but reverted back to trying for a natural fit.

Answered by jkirkerx in 872900022

Problem: Misaligned Vision Bounding Boxes in SwiftUI

When using AVCaptureVideoPreviewLayer inside a UIViewRepresentable, my Vision bounding boxes were either stretched, offset to the side, or shrunken in the center. The standard GeometryReader approach failed because it reported the full screen dimensions, while the camera feed was being letterboxed or aspect-filled. The Solution: Native Layout Bridging

I moved away from artificial scaling factors (like 1.33 or manual multipliers) and implemented a "Pure Natural" layout bridge.

  1. Capturing the "Real" Video Frame I added a closure (onLayout) to my UIView subclass. This allowed the UIKit layer to report its actual non-zero dimensions (e.g., 832×420) back to SwiftUI only after the layout math was finalized by the system. This eliminated the (0,0) size errors during initialization.

  2. Standardizing the Vision Request I set the VNImageCropAndScaleOption to .scaleFill. This ensures that the Vision coordinate system (0.0 to 1.0) maps exactly to the edges of the reported PreviewView bounds, making the math predictable regardless of the device aspect ratio.

  3. The "Natural" Coordinate Flip By using the actual reported videoFeedSize, the conversion from Vision's bottom-left origin to SwiftUI's top-left origin became a simple subtraction without needing complex [ratios:]

This is the closest I've reached so far, it's not bad considering it's a test run off a photo on the computer screen, and not a real scoring system, like a live photo.

Accepted Answer

Problem: Misaligned Vision Bounding Boxes in SwiftUI

When using AVCaptureVideoPreviewLayer inside a UIViewRepresentable, my Vision bounding boxes were either stretched, offset to the side, or shrunken in the center. The standard GeometryReader approach failed because it reported the full screen dimensions, while the camera feed was being letterboxed or aspect-filled. The Solution: Native Layout Bridging

I moved away from artificial scaling factors (like 1.33 or manual multipliers) and implemented a "Pure Natural" layout bridge.

  1. Capturing the "Real" Video Frame I added a closure (onLayout) to my UIView subclass. This allowed the UIKit layer to report its actual non-zero dimensions (e.g., 832×420) back to SwiftUI only after the layout math was finalized by the system. This eliminated the (0,0) size errors during initialization.

  2. Standardizing the Vision Request I set the VNImageCropAndScaleOption to .scaleFill. This ensures that the Vision coordinate system (0.0 to 1.0) maps exactly to the edges of the reported PreviewView bounds, making the math predictable regardless of the device aspect ratio.

  3. The "Natural" Coordinate Flip By using the actual reported videoFeedSize, the conversion from Vision's bottom-left origin to SwiftUI's top-left origin became a simple subtraction without needing complex [ratios:]

This is the closest I've reached so far, it's not bad considering it's a test run off a photo on the computer screen, and not a real scoring system, like a live photo.

Image object detection with video sizing issue
 
 
Q