SensorKit Speech data question

We are a research team conducting a study collecting subject's SensorKit speech data, and we've encountered some questions we couldn't resolve ourselves or by looking up the online SensorKit documentation:

Microphone Activation: In general, how is the microphone being turned on to capture a speech session? And how was each session determined to be an independent session?

Negative Values: In the speech classification data, there are entries where some of the start and end values are negative (see screenshot below). How should we interpret and handle these values? Is it safe to filter them out?

Duplicated sessions: From the same screenshot you can see there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?

Another Negative Values: The same question for speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?

(Note that these screenshot got rid of subject IDs for privacy purposes but each screenshot was from one subject.)

We greatly appreciate your time and help.

Answered by Engineer in 853886022

Let me try to clarify some of this:

Microphone Activation:

Q: how is the microphone being turned on to capture a speech session?

A: Speech Metrics are collected when the user has already engaged the microphone - through a Siri utterance or through telephony (a VoIP app, the Phone app, FaceTime). SensorKit does not manipulate the microphone itself

Q: how is each session determined to be an independent session?

A: each session mainly marks a Siri utterance or a phone call. But changing system conditions with the audio subsystem during a session (a long phone call where the user starts a call, switches to a Bluetooth headset, and perhaps connects to a car audio, for example) may change session ids.

Negative Values:

Q: In the speech classification data, there are entries where some of the start and end values are negative

A: this is not expected behavior, and we would need to see what exactly is being sent to your app.

Please file a Feedback Report with as much details of the occurrence, and logs from your own app (rather than screenshots) and also create a system diagnostic after reproducing the issue and attach it to the Feedback Report.

Q: in speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?

A: averagePauseDuration == -1 in this context means there were no pauses detected, possibly because only one word was uttered. Whether to remove them or not depends on your logic

Duplicated sessions:

Q: there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?

A: each sound classification within the same utterance or call (for example, speech, laughter, shouting) will output a separate entry with the same session identifier even from the same audio “session”. For short utterances, the time scopes could be the same as the classifier could recognize multiple outputs. Or for longer sessions, you could be getting more sound classifications and the time range should reflect which part of the audio stream each entry is referring to.


Argun Tekant /  DTS Engineer / Core Technologies

Let me try to clarify some of this:

Microphone Activation:

Q: how is the microphone being turned on to capture a speech session?

A: Speech Metrics are collected when the user has already engaged the microphone - through a Siri utterance or through telephony (a VoIP app, the Phone app, FaceTime). SensorKit does not manipulate the microphone itself

Q: how is each session determined to be an independent session?

A: each session mainly marks a Siri utterance or a phone call. But changing system conditions with the audio subsystem during a session (a long phone call where the user starts a call, switches to a Bluetooth headset, and perhaps connects to a car audio, for example) may change session ids.

Negative Values:

Q: In the speech classification data, there are entries where some of the start and end values are negative

A: this is not expected behavior, and we would need to see what exactly is being sent to your app.

Please file a Feedback Report with as much details of the occurrence, and logs from your own app (rather than screenshots) and also create a system diagnostic after reproducing the issue and attach it to the Feedback Report.

Q: in speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?

A: averagePauseDuration == -1 in this context means there were no pauses detected, possibly because only one word was uttered. Whether to remove them or not depends on your logic

Duplicated sessions:

Q: there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?

A: each sound classification within the same utterance or call (for example, speech, laughter, shouting) will output a separate entry with the same session identifier even from the same audio “session”. For short utterances, the time scopes could be the same as the classifier could recognize multiple outputs. Or for longer sessions, you could be getting more sound classifications and the time range should reflect which part of the audio stream each entry is referring to.


Argun Tekant /  DTS Engineer / Core Technologies

Thank you for your answers, Argun!

While most of the answers are very helpful, I am puzzled by the one regarding duplicated sessions. As you can see in the screenshot, for example, session ID 5B155CE8-6AA9-4A3F-BCD0-9D88AF69F196;1 was linked to all three different classifications (laughter, shouting, speech), which is the opposite of "each sound classification within the same utterance or call will output a separate identifier" explanation. It's the same case for all the other session IDs ; and all these session IDs are linked with the same time stamp and the same start time.

And it's not uncommon, below is another example from the same subject. I marked the records with same time with the same name; note that the blue records are again the same session ID linked with three different classifications.

And in general why would a same segment be classified to different categories? Does that mean all of them are possible but can't be sure? Therefore, is there any recommended cut-off of the confidence (e.g. 0.5, 0.9, etc.)?

As for sending in raw logs I am afraid it's very difficult as the data is from our participants not ourselves. I would assume it's an obstacle for all research teams.

Thank you.

It seems I misspoke when explaining the classifications behavior. Indeed, it is possible you will get multiple entries with the same id and time range.

You will get a separate output from each sound classifier (laughter, shouting, speech) for the same audio input.

These will have the same session id since they’re from the same session. For short sessions you could get the same time values.

So, for a short utterance, you might get multiple entries with different classifications as the classifier can think it is shouting or laughing but not sure, and it will output 2 entries with the same identifier and the same time scope.

For longer sessions you may get more sound classifications and the time range should reflect which part of the audio stream each entry is referring to.

I hope this clears the question you had. I will also go back and correct the initial answer so others searching for the same answers will see the correct answer.

As for what level of confidence should be filtered out or accepted, that will depend entirely on your use case and requirements for your research. Depending on your specific needs, you may need to conduct some tests to determine what the confidence values represent for your app in actual situations.


Argun Tekant /  DTS Engineer / Core Technologies

SensorKit Speech data question
 
 
Q