What would be the most accurate way to classify both playing a tennis rally and the downtime in between rallies?

My goal is to mark any tennis video's timestamps of both the start of each rally/point and the end of each rally/point. I tried trajectory detection, but the "end time" is when the ball bounces rather than when the rally/point ends. I'm not quite sure what direction to go from here to improve on this. Would action classification of body poses in each frame (two classes, "playing" and "not playing") be the best way to split the video into segments? A different technique?

What would be the most accurate way to classify both playing a tennis rally and the downtime in between rallies?
 
 
Q