Vision's pose model architecture

Could I please ask what is (at least plainly) the deep learning architecture of the Apple's custom pose models available through Vision (for example with the VNDetectHumanBodyPoseRequest)? Or whether it is based on some publicly used architecture (such as ResNet) only with modifications or custom Apple dataset?

I was not able to find this information anywhere in the Apple documentation and it would be highly beneficial to know this, as we are using this data in a research about which we want to publish a paper.

Thanks beforehand!