Inside Macintosh: Sound /: Chapter 4 - Speech Manager / Using the Speech Manager

Legacy Document

Important: Inside Macintosh: Sound is deprecated as of Mac OS X v10.5. For new audio development in Mac OS X, use Core Audio. See the Audio page in the ADC Reference Library.

Writing Embedded Speech Commands
Embedded speech commands allow you to customize the quality of speech output by fine tuning it. You can make speech much easier to understand than the default way in which text is spoken by a synthesizer. An embedded speech command is a command embedded within a text buffer to be spoken by the Speech Manager that causes the Speech Manager to take a certain action. For example, you could use an embedded speech command to emphasize a particular word in a text string to make it stand out to the user.
An advantage of this technique is that your application needs to call only the standard functions that generate speech: SpeakString, SpeakText, or SpeakBuffer. To change the way a phrase is generated, you do not need to change any of your application's code; you merely need to change the embedded command text. Your application can also use embedded speech commands even if it speaks text created by the user, as opposed to a limited set of phrases. Before passing text to the Speech Manager, your application could embed various commands within the text. For example, a word-processing application might embed commands that tell the Speech Manager to put extra emphasis around words that the user has boldfaced or underlined.

Embedded Command Delimiters
When processing input text data, speech synthesizers look for special sequences of characters called command delimiters. These character sequences are usually defined to be unusual pairings of printable characters that would not normally appear in the text. When a begin command delimiter string is encountered in the text, the following characters are assumed to contain one or more commands. The synthesizer will attempt to parse and process these commands until an end command delimiter string is encountered. By default, the begin command delimiter string is "[[", and the end command delimiter string is "]]". You can change the command delimiters if necessary, but you should be sure to use printable characters that are not in common use. Be sure to change the default delimiters back to the assigned characters when you are done with the speech processing for which you changed the delimiters. For example, if your application needs to speak text that naturally contains the default delimiter characters, then it should temporarily change the delimiters to sequences not included in the text. Or, if your application does not wish to support embedded speech commands, then it can disable such processing by setting both the begin command delimiter and the end command delimiter to 2 NIL bytes.

Syntax of Embedded Speech Commands
This section describes the syntax of embedded speech commands in detail. All embedded speech commands must be enclosed by the begin command delimiter and the end command delimiter, as follows:
[[emph +]]
All speech commands require parameters immediately following the speech command. The parameter to the speech emphasis command above is the plus sign. The format of the parameter depends on the command issued. Numeric type parameters include fixed-point numbers, bytes, integers, and 32-bit values. Hexadecimal numbers may be entered using either Pascal or C syntax; $1A22 and 0x1A22 are both acceptable.
A common type of parameter is an operating-system type parameter, used generally to specify a particular selector. For example,
[[inpt PHON]]
changes the text-processing mode so that the Speech Manager interprets text to be composed of phonemes.
Some commands allow you to specify an absolute value by including just a number as the parameter or to specify a relative value by adding a + or - character. For example, the following command raises the speech volume by 0.1:
[[volm +0.1]]
Your application can place multiple commands within a single set of delimiters by using semicolons-for example:
[[volm 0.3 ; rate 165]]
It is suggested that you precede all other embedded speech commands by a format version command. This command indicates to speech synthesizers the format version to be used by all subsequent embedded speech commands. The current format version is 1. You could write a format version command for the current format version like this:
[[vers $00000001]]
Table 4-1 provides a formalization of the embedded command syntax structure, subject to these conventions:

Items enclosed in angle brackets (< and >) represent logical units that either are defined further below in the table or are atomic units that should be self-explanatory, in which case the explanations are provided in italic type. All logical units are listed in the first column.
Items enclosed in single brackets ([ and ]) are optional.
Items followed by an ellipsis (...) may be repeated one or more times.
For items separated by a vertical bar (|), any one of the listed items may be used.
Multiple space characters between tokens may be used if desired.
Multiple commands within a single set of parameters should be separated by semicolons.

Table 4-1
Identifier Syntax
CommandBlock <BeginDelimiter> <CommandList> <EndDelimiter>
BeginDelimiter <String1> | <String2>
EndDelimiter <String1> | <String2>
CommandList <Command> [; <Command>]...
Command <CommandSelector> [parameter]...
CommandSelector <OSType>
Parameter <OSType> | <String1> | <String2> | <StringN> | <FixedPointValue> | <32BitValue> | <16BitValue> | <8BitValue>
String1 <Character>
String2 <Character> <Character>
StringN [<Character>...]
OSType <Character> <Character> <Character> <Character>
32BitValue <OSType> | <LongInt> | <HexLongInt>
16BitValue <Integer> |<HexInteger>
8BitValue <Byte> | <HexByte>
FixedPointValue <Decimal number: 0.0000 N 65,535.9999>
LongInt <Decimal number: 0 N 4,294,967,295>
HexLongInt <Hex number: 0x00000000 N 0xFFFFFFFF>
Integer <Decimal number: 0 N 65,535>
HexInteger <Hex number: 0x0000 N 0xFFFF>
Character <Any printable character (for example, A, b, *, #, x)>
Byte <Decimal number: 0 N 255>
HexByte <Hex number: 0x00 N 0xFF>
The embedded command syntax structure
Table 4-2 outlines the set of currently defined embedded speech commands in alphabetical order and uses the same syntax conventions as Table 4-1. Note that when writing embedded speech commands, you omit the symbols like angle brackets and ellipses that are used here for explanatory purposes.
Table 4-2 Embedded speech commands
Command and selector Command syntax and description
Character mode (char) char NORM | LTRL
The character mode command sets the word-speaking mode of the speech channel. When NORM mode is selected, the synthesizer attempts to automatically convert words into speech. This is the most basic function of the text-to-speech synthesizer. When LTRL mode is selected, the synthesizer speaks every word, number, and symbol character by character. Embedded command processing continues to function normally, however.
This embedded speech command is analogous to the soCharacterMode speech information selector.
Comment (cmnt) cmnt [<Character>...]
The comment command is ignored by speech synthesizers. It enables a developer to insert a comment that will not be spoken into a text stream for documentation purposes. Note that all characters following the cmnt selector up to <EndDelimiter> are part of the comment.
Delimiter (dlim) dlim <BeginDelimiter> <EndDelimiter>
The delimiter command changes the character sequences that mark the beginning and end of all subsequent commands to the character sequences specified. The new delimiters take effect after the command list containing this command has been completely processed. If the delimiter strings are empty, an error is generated.
This embedded speech command is analogous to the soCommandDelimiter speech information selector.
Emphasis (emph) emph + | -
The emphasis command causes the next word to be spoken with either greater emphasis or less emphasis than would normally be used. Using + will force added emphasis, while using - will force reduced emphasis. For an illustration of using the emphasis command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.
Input mode (inpt) inpt | TEXT | PHON
The input mode command switches the input-processing mode to either normal text mode or phoneme mode. Passing TEXT sets the mode to text mode; passing PHON sets the mode to phoneme mode. Some speech synthesizers might define additional speech input mode selectors. In phoneme mode, characters are interpreted as representing phonemes, as described in "Phonemic Representation of Speech" on page 4-32.
This embedded speech command is analogous to the soInputMode speech information selector.
Number mode (nmbr) nmbr NORM | LTRL
The number mode command sets the number-speaking mode of the speech synthesizer. When NORM mode is selected, the synthesizer attempts to automatically speak numeric strings as intelligently as possible. When LTRL mode is selected, numeric strings are spoken digit by digit. When the word-speaking mode is set to literal via the character mode command or the soCharacterMode speech information selector, numbers are spoken digit by digit regardless of the current number-speaking mode.
This embedded speech command is analogous to the soNumberMode speech information selector.
Baseline pitch (pbas) pbas [+ | -] <FixedPointValue>
The baseline pitch command changes the current speech pitch for the speech channel to the fixed point value specified. If the pitch number is preceded by a + or - character, the speech pitch is adjusted relative to its current value. Base pitch values are always positive numbers in the range from 1.000 to 127.000.
This embedded speech command is analogous to the soPitchBase speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.

Pitch modulation (pmod) pmod [+ | -] <FixedPointValue>
The pitch modulation command changes the modulation range for the speech channel based on the modulation depth fixed-point value specified. The actual pitch of generated speech might vary from the baseline pitch up or down as much as the modulation depth. If the modulation depth number is preceded by a + or - character, the pitch modulation is adjusted relative to its current value. Speech pitches fall in the range of 0.000 to 127.000.
This embedded speech command is analogous to the soPitchMod speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.
Speech rate (rate) rate [+ | -] <FixedPointValue>
The speech rate command sets the speech rate in words per minute on the speech channel to the fixed-point value specified. If the rate value is preceded by a + or - character, the speech rate is adjusted relative to its current value. Speech rates fall in the range 0.000 to 65535.999, which translate into 50 to 500 words per minute. Normal human speech rates are around 180 to 220 words per minute.
This embedded speech command is analogous to the soRate speech information selector. For a discussion of speech rate, see the section "Speech Attributes" beginning on page 4-6.
Reset (rset) rset <32BitValue>
The reset command will reset the speech channel's voice and speech attributes back to default values. The parameter has no effect; it should be set to 0.
This embedded speech command is analogous to the soReset speech information selector.
Silence (slnc) slnc <32BitValue>
The silence command causes the synthesizer to generate silence for the number of milliseconds specified. The timing of the silence will vary widely between synthesizers. For an illustration of using the silence command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.
Synchronization (sync) sync <32BitValue>
The synchronization command causes the application's synchronization callback procedure to be executed. The callback is made as the audio corresponding to the next word begins to sound. The callback procedure is passed the 32-bit value specified in the command. Synchronization callback procedures are described in "Synchronization Callback Procedure" beginning on page 4-85.
Format version (vers) vers <32BitValue>
The format version command informs the speech synthesizer of the format version that subsequent embedded speech commands will use. This command is optional but is recommended to ensure that embedded speech commands are compatible with all versions of the Speech Manager. The current format version is $0001.
Speech volume (volm) volm [+ | -] <FixedPointValue>
The speech volume command changes the speech volume on the speech channel to the fixed-point value specified. If the volume value is preceded by a + or - character, the speech volume is adjusted relative to its current value. Volumes are expressed in fixed-point units ranging from 0.000 through 1.000. A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume.
This embedded speech command is analogous to the soVolume speech information selector.
Synthesizer-specific (xtnd) xtnd <OSType> [<Parameter>...]
The synthesizer-specific command enables synthesizer-specific commands to be embedded in the input text stream. Synthesizer-specific speech commands are processed by the speech synthesizer whose creator ID is specified in the first parameter and by other speech synthesizers that support commands aimed at the synthesizer with the specified creator ID. The format of the data following the parameter is entirely dependent on the synthesizer being used.
This embedded speech command is analogous to the soSynthExtension speech information selector, described in "Speech Information Selectors" beginning on page 4-39.

While embedded speech commands are being processed, several types of errors might be detected and reported to your application. If you have enabled error callbacks by using the SetSpeechInfo function with the soErrorCallBack selector, the error callback procedure will be executed once for every error that is detected, as described in "Error Callback Procedure" beginning on page 4-86. If you have not enabled error callbacks, you can still obtain information about the errors encountered by calling the GetSpeechInfo function with the soErrors selector. The following errors might be detected during processing of embedded speech commands:
badParmVal -245 Parameter value is invalid
badCmdText -246 Embedded command syntax or parameter problem
unimplCmd -247 Embedded command is not implemented on synthesizer
unimplMsg -248 Unimplemented message
badVoiceID -250 Specified voice has not been preloaded
badParmCount -252 Incorrect number of embedded command arguments

Examples of Embedded Speech Commands
If you use just a few of the embedded speech commands, you can markedly increase the understandability of text spoken by your application. Your application knows more about the speech being produced than a speech synthesizer does. A synthesizer speaks text according to a predetermined set of rules about language production. Therefore, the voices available on a Macintosh computer with the Speech Manager installed sound very synthetic and sometimes robotic because the pronunciation rules are formalized. You can make the speech produced by the synthesizer sound a lot more human by observing some simple rules of human speech and embedding speech commands in text according to these conventions. The techniques presented in this section could be applied when your application is having a dialog with the user or speaking some error messages or announcements.
The most common technique humans use in speaking is to emphasizing or deemphasizing words in a sentence. This change in emphasis marks for the listener new and important information by highlighting it vocally, making it easier for the listener to recognize important or different words in a sentence. For example, in a calendar-scheduling program, your application might speak a list of appointments for a day. The following text strings would all be spoken with the same tune and rhythm.
At 4pm you have a meeting with Kim Silver.
At 6pm you have a meeting with Tim Johnson.
At 7pm you have a meeting with Mark Smith.
The example that follows shows how you use embedded speech commands to deemphasize repeated words in similar sentences and highlight new information in a sentence. The first sentence of the following example sounds fairly acceptable. The second sentence deemphasizes the repeated words have and meeting to point out the new information--with whom the meeting is. The choice of which words to emphasize or deemphasize is based on what was spoken in the preceding sentence.To use the embedded command emph (emphasis), you insert it followed by a plus or minus sign before the word you want emphasized or deemphasized. The emph command lasts for a duration of one word.
At 4:15 you have a meeting with Ray Chiang.
At 6:30, you [[emph -]] have a [[emph -]] meeting with 
William Ortiz.
At 7pm, you [[emph -]] have a [[emph -]] meeting with 
Eric Braz Ford.
As shown in the next example, you can further enhance this text by spelling out the numbers so that you can emphasize changes in increments of time. For example, the following sentences deemphasize the repeated word six to highlight the difference between the meetings; which both occur between six and seven o'clock.
At four fifteen you have a meeting with Lori Kaplan.
At six [[emph -]] fifteen, you [[emph -]] have a [[emph -]] 
meeting with Tim Monroe.
At [[emph -]] six thirty, you [[emph -]] have a [[emph -]] 
meeting with Michael Abrams.
Another use of the emphasis embedded command is to make confusing, boring, or mechanical sounding text more understandable. One example of this is strings of nouns that refer to one entity (called complex nominals) that when spoken differently have a different meeting.
1a. Steel warehouse.
1b. Steel [[emph -]] warehouse.
2a. French teachers.
2b. French [[emph -]] teachers.
In the first example, phrase 1a, steel warehouse, refers to a warehouse made of steel, in which anything could be stored. But phrase 1b describes a warehouse of unspecified construction in which steel is stored. In the second example, phrase 2a, French teachers, refers to teachers from France who teach any subject. In the same example, phrase 2b specifies people from anywhere who teach French classes. You can use this technique of deemphasizing words in phrases to help users correctly understand the meaning of text spoken from your application.
You use the emph command to emphasize words in order to contrast them. You contrast words that are similar to words found later in a sentence to help distinguish new information.
You have [[emph +]] 3 text [[emph -]] messages, two fax [[emph 
-]] messages, and [[emph +]] one [[emph +]] voice [[emph -]] 
message.
This example emphasizes the words related to the number of messages and type of messages to help the listener discern the different kinds of information being presented.
Another common speaking technique that humans use is to pause before starting to speak about a new idea or before beginning a new paragraph. Adding an slnc (silence) command before beginning to speak a new idea or paragraph makes the synthetic voice sound like a person does when taking a breath in between ideas. This technique works best if you also raise the pitch range (using the pmod and pbas embedded commands) of the first sentence of the new paragraph. You must remember to lower the pitch range to achieve the desired effect.
[[emph -; pmod +1; pbas +1]] Good morning! [[pmod -1; pbas -1]]  
This is a [[emph +]] newer [[emph -]] version of Apple's speech 
synthesis. The previous [[emph -]] version has already been 
[[emph -]] adopted by many developers. Users have sent us many 
positive [[emph +]] reports.

[[slnc 500; pmod +1; pbas +1]]
This newer [[emph -]] version has better signal [[emph -]] 
processing [[pmod -1; pbas -1]], new pitch [[emph -]] contours, 
and a new compression. It still doesn't [[emph -]] sound perfect, 
but people find it easier to understand.
This example deemphasizes the first word of the utterance, but raises the pitch to make the greeting sound more like a human would speak it. Then words are emphasized or deemphasized according to the techniques discussed previously. Silence is introduced before the new paragraph to signal a change in thought process. The pitch is raised and then lowered again after the first phrase. Note that you don't have to wait a full sentence before changing the pitch back to its previous value. It's best to work with these techniques until you find the most human-sounding utterances.

Identifier	Syntax
CommandBlock	`<BeginDelimiter> <CommandList> <EndDelimiter>`
BeginDelimiter	`<String1> \| <String2>`
EndDelimiter	`<String1> \| <String2>`
CommandList	`<Command> [; <Command>]...`
Command	`<CommandSelector> [parameter]...`
CommandSelector	`<OSType>`
Parameter	`<OSType> \| <String1> \| <String2> \| <StringN> \| <FixedPointValue> \| <32BitValue> \| <16BitValue> \| <8BitValue>`
String1	`<Character>`
String2	`<Character> <Character>`
StringN	`[<Character>...]`
OSType	`<Character> <Character> <Character> <Character>`
32BitValue	`<OSType>` `\|` <LongInt> `\| <HexLongInt>`
16BitValue	`<Integer> \|<HexInteger>`
8BitValue	`<Byte> \| <HexByte>`
FixedPointValue	`<Decimal number: 0.0000 N 65,535.9999>`
LongInt	`<Decimal number: 0 N 4,294,967,295>`
HexLongInt	`<Hex number: 0x00000000 N 0xFFFFFFFF>`
Integer	`<Decimal number: 0 N 65,535>`
HexInteger	`<Hex number: 0x0000 N 0xFFFF>`
Character	`<Any printable character (for example, A, b, *, #, x)>`
Byte	`<Decimal number: 0 N 255>`
HexByte	`<Hex number: 0x00 N 0xFF>`

**Table 4-2 Embedded speech commands**
Command and selector	Command syntax and description
Character mode (`char`)	`char NORM \| LTRL` The character mode command sets the word-speaking mode of the speech channel. When `NORM` mode is selected, the synthesizer attempts to automatically convert words into speech. This is the most basic function of the text-to-speech synthesizer. When `LTRL` mode is selected, the synthesizer speaks every word, number, and symbol character by character. Embedded command processing continues to function normally, however. This embedded speech command is analogous to the `soCharacterMode` speech information selector.
Comment (`cmnt`)	`cmnt [<Character>...]` The comment command is ignored by speech synthesizers. It enables a developer to insert a comment that will not be spoken into a text stream for documentation purposes. Note that all characters following the `cmnt` selector up to `<EndDelimiter>` are part of the comment.
Delimiter (`dlim`)	`dlim <BeginDelimiter> <EndDelimiter>` The delimiter command changes the character sequences that mark the beginning and end of all subsequent commands to the character sequences specified. The new delimiters take effect after the command list containing this command has been completely processed. If the delimiter strings are empty, an error is generated. This embedded speech command is analogous to the `soCommandDelimiter` speech information selector.
Emphasis (`emph`)	`emph + \| -` The emphasis command causes the next word to be spoken with either greater emphasis or less emphasis than would normally be used. Using `+` will force added emphasis, while using `-` will force reduced emphasis. For an illustration of using the emphasis command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.
Input mode (`inpt`)	`inpt \| TEXT \| PHON` The input mode command switches the input-processing mode to either normal text mode or phoneme mode. Passing `TEXT` sets the mode to text mode; passing `PHON` sets the mode to phoneme mode. Some speech synthesizers might define additional speech input mode selectors. In phoneme mode, characters are interpreted as representing phonemes, as described in "Phonemic Representation of Speech" on page 4-32. This embedded speech command is analogous to the `soInputMode` speech information selector.
Number mode (`nmbr`)	`nmbr NORM \| LTRL` The number mode command sets the number-speaking mode of the speech synthesizer. When `NORM` mode is selected, the synthesizer attempts to automatically speak numeric strings as intelligently as possible. When `LTRL` mode is selected, numeric strings are spoken digit by digit. When the word-speaking mode is set to literal via the character mode command or the `soCharacterMode` speech information selector, numbers are spoken digit by digit regardless of the current number-speaking mode. This embedded speech command is analogous to the `soNumberMode` speech information selector.
Baseline pitch (`pbas`)	`pbas [+ \| -] <FixedPointValue>` The baseline pitch command changes the current speech pitch for the speech channel to the fixed point value specified. If the pitch number is preceded by a `+` or `-` character, the speech pitch is adjusted relative to its current value. Base pitch values are always positive numbers in the range from 1.000 to 127.000. This embedded speech command is analogous to the `soPitchBase` speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.

Pitch modulation (`pmod`)	`pmod [+ \| -] <FixedPointValue>` The pitch modulation command changes the modulation range for the speech channel based on the modulation depth fixed-point value specified. The actual pitch of generated speech might vary from the baseline pitch up or down as much as the modulation depth. If the modulation depth number is preceded by a `+` or `-` character, the pitch modulation is adjusted relative to its current value. Speech pitches fall in the range of 0.000 to 127.000. This embedded speech command is analogous to the `soPitchMod` speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.
Speech rate (`rate`)	`rate [+ \| -] <FixedPointValue>` The speech rate command sets the speech rate in words per minute on the speech channel to the fixed-point value specified. If the rate value is preceded by a `+` or `-` character, the speech rate is adjusted relative to its current value. Speech rates fall in the range 0.000 to 65535.999, which translate into 50 to 500 words per minute. Normal human speech rates are around 180 to 220 words per minute. This embedded speech command is analogous to the `soRate` speech information selector. For a discussion of speech rate, see the section "Speech Attributes" beginning on page 4-6.
Reset (`rset`)	`rset <32BitValue>` The reset command will reset the speech channel's voice and speech attributes back to default values. The parameter has no effect; it should be set to 0. This embedded speech command is analogous to the `soReset` speech information selector.
Silence (`slnc`)	`slnc <32BitValue>` The silence command causes the synthesizer to generate silence for the number of milliseconds specified. The timing of the silence will vary widely between synthesizers. For an illustration of using the silence command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.
Synchronization (`sync`)	`sync <32BitValue>` The synchronization command causes the application's synchronization callback procedure to be executed. The callback is made as the audio corresponding to the next word begins to sound. The callback procedure is passed the 32-bit value specified in the command. Synchronization callback procedures are described in "Synchronization Callback Procedure" beginning on page 4-85.
Format version (`vers`)	`vers <32BitValue>` The format version command informs the speech synthesizer of the format version that subsequent embedded speech commands will use. This command is optional but is recommended to ensure that embedded speech commands are compatible with all versions of the Speech Manager. The current format version is $0001.
Speech volume (`volm`)	`volm [+ \| -] <FixedPointValue>` The speech volume command changes the speech volume on the speech channel to the fixed-point value specified. If the volume value is preceded by a `+` or `-` character, the speech volume is adjusted relative to its current value. Volumes are expressed in fixed-point units ranging from 0.000 through 1.000. A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume. This embedded speech command is analogous to the `soVolume` speech information selector.
Synthesizer-specific (`xtnd`)	`xtnd <OSType> [<Parameter>...]` The synthesizer-specific command enables synthesizer-specific commands to be embedded in the input text stream. Synthesizer-specific speech commands are processed by the speech synthesizer whose creator ID is specified in the first parameter and by other speech synthesizers that support commands aimed at the synthesizer with the specified creator ID. The format of the data following the parameter is entirely dependent on the synthesizer being used. This embedded speech command is analogous to the `soSynthExtension` speech information selector, described in "Speech Information Selectors" beginning on page 4-39.

Shop the Apple Online Store (1-800-MY-APPLE), visit an Apple Retail Store, or find a reseller.

badParmVal	-245	Parameter value is invalid
badCmdText	-246	Embedded command syntax or parameter problem
unimplCmd	-247	Embedded command is not implemented on synthesizer
unimplMsg	-248	Unimplemented message
badVoiceID	-250	Specified voice has not been preloaded
badParmCount	-252	Incorrect number of embedded command arguments