We are a research team conducting a study collecting subject's SensorKit speech data, and we've encountered some questions we couldn't resolve ourselves or by looking up the online SensorKit documentation:
Microphone Activation: In general, how is the microphone being turned on to capture a speech session? And how was each session determined to be an independent session?
Negative Values: In the speech classification data, there are entries where some of the start and end values are negative (see screenshot below). How should we interpret and handle these values? Is it safe to filter them out?
Duplicated sessions: From the same screenshot you can see there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?
Another Negative Values: The same question for speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?
(Note that these screenshot got rid of subject IDs for privacy purposes but each screenshot was from one subject.)
We greatly appreciate your time and help.
Let me try to clarify some of this:
Microphone Activation:
Q: how is the microphone being turned on to capture a speech session?
A: Speech Metrics are collected when the user has already engaged the microphone - through a Siri utterance or through telephony (a VoIP app, the Phone app, FaceTime). SensorKit does not manipulate the microphone itself
Q: how is each session determined to be an independent session?
A: each session mainly marks a Siri utterance or a phone call. But changing system conditions with the audio subsystem during a session (a long phone call where the user starts a call, switches to a Bluetooth headset, and perhaps connects to a car audio, for example) may change session ids.
Negative Values:
Q: In the speech classification data, there are entries where some of the start and end values are negative
A: this is not expected behavior, and we would need to see what exactly is being sent to your app.
Please file a Feedback Report with as much details of the occurrence, and logs from your own app (rather than screenshots) and also create a system diagnostic after reproducing the issue and attach it to the Feedback Report.
Q: in speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?
A: averagePauseDuration == -1 in this context means there were no pauses detected, possibly because only one word was uttered. Whether to remove them or not depends on your logic
Duplicated sessions:
Q: there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?
A: each sound classification within the same utterance or call (for example, speech, laughter, shouting) will output a separate entry with the same session identifier even from the same audio “session”. For short utterances, the time scopes could be the same as the classifier could recognize multiple outputs. Or for longer sessions, you could be getting more sound classifications and the time range should reflect which part of the audio stream each entry is referring to.
Argun Tekant / DTS Engineer / Core Technologies