Confidence of Vision different from CoreML output

Hi,

I have a custom object detection CoreML model and I notice something strange when using the model with the Vision framework.

I have tried two different approaches as to how to process an image and do inference on the CoreML model. The first one is using the CoreML "raw": initialising the model, getting the input image ready and using the model's .prediction() function to get the models output.

The second one is using Vision to wrap the CoreML model in a VNCoreMLModel, creating a VNCoreMLRequest and using the VNImageRequestHandler to actually perform the model inference. The result of the VNCoreMLRequest is of type VNRecognizedObjectObservation.

The issue I now face is in the difference in the output of both methods. The first method gives back the raw output of the CoreML model: confidence and coordinates. The confidence is an array with size equal to the number of classes in my model (3 in my case). The second method gives back the boundingBox, confidence and labels. However here the confidence is only the confidence for the most likely class (so size is equal to 1). But the confidence I get from the second approach is quite different from the confidence I get during the first approach.

I can use either one of the approaches in my application. However, I really want to find out what is going on and understand how this difference occurred.

Thanks!

Answered by DTS Engineer in 727460022

Hello,

However here the confidence is only the confidence for the most likely class (so size is equal to 1).

That is not actually what this confidence value represents. Currently, the "overall" confidence value of a VNRecognizedObjectObservation is the sum of each of the confidences for each label in the labels array.

Note that this is an implementation detail that is subject to change, so you should not rely on this behavior always being the case.

The confidence scores for each label should match what you receive when you run inference through CoreML.

Hello,

Are you doing any sort of scaling to your input image when using the CoreML model directly?

By default, VNCoreMLRequest will use a imageCropAndScaleOption of centerCrop.

If you aren't applying the same cropping and scaling when using CoreML directly, then that could explain why you are seeing different results.

Hi,

Both approaches use the same kind of scaling.

However, as a test setup I stored an image on disk with the exact dimensions (width, height) as expected by my CoreML model. So normally there is no need for any scaling, but again to be sure I have added the scaleFit scaling option to both approaches.

So both approaches should have the exact same input image, but still the difference in confidence still occurs...

What happens internally to get/calculate the confidence of a VNRecognizedObjectObservation? In the documentation it only says it is a normalised value. For the boundingBox it is clear that it is "recalculated" to another coordinate origin, however it is unclear to me what happens internally with the confidence score...

Thanks!

Accepted Answer

Hello,

However here the confidence is only the confidence for the most likely class (so size is equal to 1).

That is not actually what this confidence value represents. Currently, the "overall" confidence value of a VNRecognizedObjectObservation is the sum of each of the confidences for each label in the labels array.

Note that this is an implementation detail that is subject to change, so you should not rely on this behavior always being the case.

The confidence scores for each label should match what you receive when you run inference through CoreML.

Hi @gchiste

I believe the above explanation is out-of-date. Could you explain what how the new implementation works? According to the documentation of VNRecognizedObjectObservation you have to

Multiply the classification confidence with the confidence of this observation.

but how does Vision acquire these different confidences (both for the object observation as for the classification). The underlying CoreML detection model (e.g. YOLOv3-Tiny) has only one output that generates confidences so I am wondering how Vision is able to decouple this confidence into two different ones.

Is it also possible to give me an idea of when this implementation changed?

Thanks!

Hello, and thanks for stopping by at WWDC24!

Object detection models have two types of confidence. The first type is a confidence score associated with each bounding box. This represents the confidence that some object of interest has been detected at the given location. This is the confidence score reported on VNRecognizedObjectObservation.

Each detected object also has a set of classification labels. These labels each have a confidence score representing the confidence that the detected object is the given label, assuming that there is a valid detection. These confidence scores are reported in the VNClassificationObservation labels. Vision has reported confidence scores in this way since iOS 12.0/ macOS 10.14. Small changes in confidences are to be expected across different devices and operating systems.

You can combine the bounding box’s confidence with a label’s confidence by multiplying them together. This will give you the overall confidence that an object with label X was detected at location Y. This combined confidence should correspond to the raw output of the model. The bounding box confidence is derived by summing the raw outputs, and the class label confidences are the raw outputs normalized to sum to 1.

Also, note that the MLModel preview only shows the normalized class label confidences, not the bounding box confidence.

Confidence of Vision different from CoreML output
 
 
Q