Inference with non-square Images

I'm trying to set up Facebook AI's "Segment Anything" MLModel to compare its performance and efficacy on-device against the Vision library's Foreground Instance Mask Request.

The Vision request accepts any reasonably-sized image for processing, and then has a method to produce an output at the same resolution as the input image. Conversely, the MLModel for Segment Anything accepts a 1024x1024 image for inference and outputs a 1024x1024 image for output.

What is the best way to work with non-square images, such as 4:3 camera photos? I can basically think of 3 methods for accomplishing this:

  1. Scale the image to 1024x1024, ignoring aspect ratio, then inversely scale the output back to the original size. However, I have a big concern that squashing the content will result in poor inference results.
  2. Scale the image, preserving its aspect ratio so its minimum dimension is 1024, then run the model multiple times on a sliding 1024x1024 window and then aggregating the results. My main concern here is the complexity of de-duping the output, when each run could make different outputs based on how objects are cropped.
  3. Fit the image within 1024x1024 and pad with black pixels to make a square. I'm not sure if the border will muck up the inference.

Anyway, this seems like it must be a well-solved problem in ML, but I'm having difficulty finding an authoritative best practice.

Inference with non-square Images
 
 
Q