Hi everyone,
I’ve been analyzing the current state of Sign Language accessibility tools, and I noticed a significant gap in learning tools: we lack real-time feedback for students (e.g., "Is my hand position correct?").
Most current solutions rely on 2D video processing, which struggles with depth perception and occlusion (hand-over-hand or hand-over-face gestures), which are critical in Sign Language grammar.
I'd like to propose/discuss an architecture leveraging the current LiDAR + Neural Engine capabilities found in iPhone devices to solve this.
The Concept: Skeleton-based Normalization
Instead of training ML models on raw video frames (which introduces noise from lighting, skin tone, and clothing), we could use ARKit's Body Tracking to abstract the input.
- Capture: Use ARKit/LiDAR to track the user's upper body and hand joints in 3D space.
- Data Normalization: Extract only the vector coordinates (X, Y, Z of joints). This creates a "clean" dataset, effectively normalizing the user regardless of physical appearance.
- Comparison: Feed these vectors into a CoreML model trained on "Reference Skeletons" (recorded by native signers).
- Feedback Loop: The app calculates the geometric distance between the user's pose and the reference pose to provide specific correction (e.g., "Raise your elbow 10 degrees").
Why this approach?
- Solves Occlusion: LiDAR handles depth much better than standard RGB cameras when hands cross the body.
- Privacy: We are processing coordinates, not video streams.
- Efficiency: Comparing vector sequences is computationally cheaper than video analysis, preserving battery life.
Has anyone experimented with using ARKit Body Anchors specifically for comparing complex gesture sequences against a stored "correct" database? I believe this "Skeleton First" approach is the key to scalable Sign Language education apps.
Looking forward to hearing your thoughts.