Important: Inside Macintosh: Sound is deprecated as of Mac OS X v10.5. For new audio development in Mac OS X, use Core Audio. See the Audio page in the ADC Reference Library.
Writing Embedded Speech Commands
Embedded speech commands allow you to customize the quality of speech output by fine tuning it. You can make speech much easier to understand than the default way in which text is spoken by a synthesizer. An embedded speech command is a command embedded within a text buffer to be spoken by the Speech Manager that causes the Speech Manager to take a certain action. For example, you could use an embedded speech command to emphasize a particular word in a text string to make it stand out to the user.An advantage of this technique is that your application needs to call only the standard functions that generate speech:
SpeakString
,SpeakText
, orSpeakBuffer
. To change the way a phrase is generated, you do not need to change any of your application's code; you merely need to change the embedded command text. Your application can also use embedded speech commands even if it speaks text created by the user, as opposed to a limited set of phrases. Before passing text to the Speech Manager, your application could embed various commands within the text. For example, a word-processing application might embed commands that tell the Speech Manager to put extra emphasis around words that the user has boldfaced or underlined.Embedded Command Delimiters
When processing input text data, speech synthesizers look for special sequences of characters called command delimiters. These character sequences are usually defined to be unusual pairings of printable characters that would not normally appear in the text. When a begin command delimiter string is encountered in the text, the following characters are assumed to contain one or more commands. The synthesizer will attempt to parse and process these commands until an end command delimiter string is encountered. By default, the begin command delimiter string is "[[
", and the end command delimiter string is "]]
". You can change the command delimiters if necessary, but you should be sure to use printable characters that are not in common use. Be sure to change the default delimiters back to the assigned characters when you are done with the speech processing for which you changed the delimiters. For example, if your application needs to speak text that naturally contains the default delimiter characters, then it should temporarily change the delimiters to sequences not included in the text. Or, if your application does not wish to support embedded speech commands, then it can disable such processing by setting both the begin command delimiter and the end command delimiter to 2NIL
bytes.Syntax of Embedded Speech Commands
This section describes the syntax of embedded speech commands in detail. All embedded speech commands must be enclosed by the begin command delimiter and the end command delimiter, as follows:
[[emph +]]All speech commands require parameters immediately following the speech command. The parameter to the speech emphasis command above is the plus sign. The format of the parameter depends on the command issued. Numeric type parameters include fixed-point numbers, bytes, integers, and 32-bit values. Hexadecimal numbers may be entered using either Pascal or C syntax; $1A22 and 0x1A22 are both acceptable.A common type of parameter is an operating-system type parameter, used generally to specify a particular selector. For example,
[[inpt PHON]]changes the text-processing mode so that the Speech Manager interprets text to be composed of phonemes.Some commands allow you to specify an absolute value by including just a number as the parameter or to specify a relative value by adding a
+
or-
character. For example, the following command raises the speech volume by 0.1:
[[volm +0.1]]Your application can place multiple commands within a single set of delimiters by using semicolons-for example:
[[volm 0.3 ; rate 165]]It is suggested that you precede all other embedded speech commands by a format version command. This command indicates to speech synthesizers the format version to be used by all subsequent embedded speech commands. The current format version is 1. You could write a format version command for the current format version like this:
[[vers $00000001]]Table 4-1 provides a formalization of the embedded command syntax structure, subject to these conventions:
Table 4-1
- Items enclosed in angle brackets (
<
and>
) represent logical units that either are defined further below in the table or are atomic units that should be self-explanatory, in which case the explanations are provided in italic type. All logical units are listed in the first column.- Items enclosed in single brackets (
[
and]
) are optional.- Items followed by an ellipsis (
...
) may be repeated one or more times.- For items separated by a vertical bar (
|
), any one of the listed items may be used.- Multiple space characters between tokens may be used if desired.
- Multiple commands within a single set of parameters should be separated by semicolons.
The embedded command syntax structure
Identifier Syntax CommandBlock <BeginDelimiter> <CommandList> <EndDelimiter>
BeginDelimiter <String1> | <String2>
EndDelimiter <String1> | <String2>
CommandList <Command> [; <Command>]...
Command <CommandSelector> [parameter]...
CommandSelector <OSType>
Parameter <OSType> | <String1> | <String2> | <StringN> | <FixedPointValue> | <32BitValue> | <16BitValue> | <8BitValue>
String1 <Character>
String2 <Character> <Character>
StringN [<Character>...]
OSType <Character> <Character> <Character> <Character>
32BitValue <OSType>
|
<LongInt>| <HexLongInt>
16BitValue <Integer> |<HexInteger>
8BitValue <Byte> | <HexByte>
FixedPointValue <Decimal number: 0.0000 N 65,535.9999>
LongInt <Decimal number: 0 N 4,294,967,295>
HexLongInt <Hex number: 0x00000000 N 0xFFFFFFFF>
Integer <Decimal number: 0 N 65,535>
HexInteger <Hex number: 0x0000 N 0xFFFF>
Character <Any printable character (for example, A, b, *, #, x)>
Byte <Decimal number: 0 N 255>
HexByte <Hex number: 0x00 N 0xFF>
Table 4-2 outlines the set of currently defined embedded speech commands in alphabetical order and uses the same syntax conventions as Table 4-1. Note that when writing embedded speech commands, you omit the symbols like angle brackets and ellipses that are used here for explanatory purposes.
Table 4-2 Embedded speech commands Command and selector Command syntax and description Character mode ( char
)char NORM | LTRL
The character mode command sets the word-speaking mode of the speech channel. When
NORM
mode is selected, the synthesizer attempts to automatically convert words into speech. This is the most basic function of the text-to-speech synthesizer. WhenLTRL
mode is selected, the synthesizer speaks every word, number, and symbol character by character. Embedded command processing continues to function normally, however.This embedded speech command is analogous to the
soCharacterMode
speech information selector.Comment ( cmnt
)cmnt [<Character>...]
The comment command is ignored by speech synthesizers. It enables a developer to insert a comment that will not be spoken into a text stream for documentation purposes. Note that all characters following the
cmnt
selector up to<EndDelimiter>
are part of the comment.Delimiter ( dlim
)dlim <BeginDelimiter> <EndDelimiter>
The delimiter command changes the character sequences that mark the beginning and end of all subsequent commands to the character sequences specified. The new delimiters take effect after the command list containing this command has been completely processed. If the delimiter strings are empty, an error is generated.
This embedded speech command is analogous to the
soCommandDelimiter
speech information selector.Emphasis ( emph
)emph + | -
The emphasis command causes the next word to be spoken with either greater emphasis or less emphasis than would normally be used. Using
+
will force added emphasis, while using-
will force reduced emphasis. For an illustration of using the emphasis command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.Input mode ( inpt
)inpt | TEXT | PHON
The input mode command switches the input-processing mode to either normal text mode or phoneme mode. Passing
TEXT
sets the mode to text mode; passingPHON
sets the mode to phoneme mode. Some speech synthesizers might define additional speech input mode selectors. In phoneme mode, characters are interpreted as representing phonemes, as described in "Phonemic Representation of Speech" on page 4-32.This embedded speech command is analogous to the
soInputMode
speech information selector.Number mode ( nmbr
)nmbr NORM | LTRL
The number mode command sets the number-speaking mode of the speech synthesizer. When
NORM
mode is selected, the synthesizer attempts to automatically speak numeric strings as intelligently as possible. WhenLTRL
mode is selected, numeric strings are spoken digit by digit. When the word-speaking mode is set to literal via the character mode command or thesoCharacterMode
speech information selector, numbers are spoken digit by digit regardless of the current number-speaking mode.This embedded speech command is analogous to the
soNumberMode
speech information selector.Baseline pitch ( pbas
)pbas [+ | -] <FixedPointValue>
The baseline pitch command changes the current speech pitch for the speech channel to the fixed point value specified. If the pitch number is preceded by a
+
or-
character, the speech pitch is adjusted relative to its current value. Base pitch values are always positive numbers in the range from 1.000 to 127.000.This embedded speech command is analogous to the
soPitchBase
speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.Pitch modulation ( pmod
)pmod [+ | -] <FixedPointValue>
The pitch modulation command changes the modulation range for the speech channel based on the modulation depth fixed-point value specified. The actual pitch of generated speech might vary from the baseline pitch up or down as much as the modulation depth. If the modulation depth number is preceded by a
+
or-
character, the pitch modulation is adjusted relative to its current value. Speech pitches fall in the range of 0.000 to 127.000.This embedded speech command is analogous to the
soPitchMod
speech information selector. For a discussion of speech pitch, see the section "Speech Attributes" beginning on page 4-6.Speech rate ( rate
)rate [+ | -] <FixedPointValue>
The speech rate command sets the speech rate in words per minute on the speech channel to the fixed-point value specified. If the rate value is preceded by a
+
or-
character, the speech rate is adjusted relative to its current value. Speech rates fall in the range 0.000 to 65535.999, which translate into 50 to 500 words per minute. Normal human speech rates are around 180 to 220 words per minute.This embedded speech command is analogous to the
soRate
speech information selector. For a discussion of speech rate, see the section "Speech Attributes" beginning on page 4-6.Reset ( rset
)rset <32BitValue>
The reset command will reset the speech channel's voice and speech attributes back to default values. The parameter has no effect; it should be set to 0.
This embedded speech command is analogous to the
soReset
speech information selector.Silence ( slnc
)slnc <32BitValue>
The silence command causes the synthesizer to generate silence for the number of milliseconds specified. The timing of the silence will vary widely between synthesizers. For an illustration of using the silence command, see the section "Examples of Embedded Speech Commands" beginning on page 4-30.
Synchronization ( sync
)sync <32BitValue>
The synchronization command causes the application's synchronization callback procedure to be executed. The callback is made as the audio corresponding to the next word begins to sound. The callback procedure is passed the 32-bit value specified in the command. Synchronization callback procedures are described in "Synchronization Callback Procedure" beginning on page 4-85.
Format version ( vers
)vers <32BitValue>
The format version command informs the speech synthesizer of the format version that subsequent embedded speech commands will use. This command is optional but is recommended to ensure that embedded speech commands are compatible with all versions of the Speech Manager. The current format version is $0001.
Speech volume ( volm
)volm [+ | -] <FixedPointValue>
The speech volume command changes the speech volume on the speech channel to the fixed-point value specified. If the volume value is preceded by a
+
or-
character, the speech volume is adjusted relative to its current value. Volumes are expressed in fixed-point units ranging from 0.000 through 1.000. A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume.This embedded speech command is analogous to the
soVolume
speech information selector.Synthesizer-specific ( xtnd
)xtnd <OSType> [<Parameter>...]
The synthesizer-specific command enables synthesizer-specific commands to be embedded in the input text stream. Synthesizer-specific speech commands are processed by the speech synthesizer whose creator ID is specified in the first parameter and by other speech synthesizers that support commands aimed at the synthesizer with the specified creator ID. The format of the data following the parameter is entirely dependent on the synthesizer being used.
This embedded speech command is analogous to the
soSynthExtension
speech information selector, described in "Speech Information Selectors" beginning on page 4-39.While embedded speech commands are being processed, several types of errors might be detected and reported to your application. If you have enabled error callbacks by using the
SetSpeechInfo
function with thesoErrorCallBack
selector, the error callback procedure will be executed once for every error that is detected, as described in "Error Callback Procedure" beginning on page 4-86. If you have not enabled error callbacks, you can still obtain information about the errors encountered by calling theGetSpeechInfo
function with thesoErrors
selector. The following errors might be detected during processing of embedded speech commands:Examples of Embedded Speech Commands
If you use just a few of the embedded speech commands, you can markedly increase the understandability of text spoken by your application. Your application knows more about the speech being produced than a speech synthesizer does. A synthesizer speaks text according to a predetermined set of rules about language production. Therefore, the voices available on a Macintosh computer with the Speech Manager installed sound very synthetic and sometimes robotic because the pronunciation rules are formalized. You can make the speech produced by the synthesizer sound a lot more human by observing some simple rules of human speech and embedding speech commands in text according to these conventions. The techniques presented in this section could be applied when your application is having a dialog with the user or speaking some error messages or announcements.The most common technique humans use in speaking is to emphasizing or deemphasizing words in a sentence. This change in emphasis marks for the listener new and important information by highlighting it vocally, making it easier for the listener to recognize important or different words in a sentence. For example, in a calendar-scheduling program, your application might speak a list of appointments for a day. The following text strings would all be spoken with the same tune and rhythm.
At 4pm you have a meeting with Kim Silver. At 6pm you have a meeting with Tim Johnson. At 7pm you have a meeting with Mark Smith.The example that follows shows how you use embedded speech commands to deemphasize repeated words in similar sentences and highlight new information in a sentence. The first sentence of the following example sounds fairly acceptable. The second sentence deemphasizes the repeated words have and meeting to point out the new information--with whom the meeting is. The choice of which words to emphasize or deemphasize is based on what was spoken in the preceding sentence.To use the embedded commandemph
(emphasis), you insert it followed by a plus or minus sign before the word you want emphasized or deemphasized. Theemph
command lasts for a duration of one word.
At 4:15 you have a meeting with Ray Chiang. At 6:30, you [[emph -]] have a [[emph -]] meeting with William Ortiz. At 7pm, you [[emph -]] have a [[emph -]] meeting with Eric Braz Ford.As shown in the next example, you can further enhance this text by spelling out the numbers so that you can emphasize changes in increments of time. For example, the following sentences deemphasize the repeated word six to highlight the difference between the meetings; which both occur between six and seven o'clock.
At four fifteen you have a meeting with Lori Kaplan. At six [[emph -]] fifteen, you [[emph -]] have a [[emph -]] meeting with Tim Monroe. At [[emph -]] six thirty, you [[emph -]] have a [[emph -]] meeting with Michael Abrams.Another use of the emphasis embedded command is to make confusing, boring, or mechanical sounding text more understandable. One example of this is strings of nouns that refer to one entity (called complex nominals) that when spoken differently have a different meeting.
1a. Steel warehouse. 1b. Steel [[emph -]] warehouse. 2a. French teachers. 2b. French [[emph -]] teachers.In the first example, phrase 1a, steel warehouse, refers to a warehouse made of steel, in which anything could be stored. But phrase 1b describes a warehouse of unspecified construction in which steel is stored. In the second example, phrase 2a, French teachers, refers to teachers from France who teach any subject. In the same example, phrase 2b specifies people from anywhere who teach French classes. You can use this technique of deemphasizing words in phrases to help users correctly understand the meaning of text spoken from your application.You use the
emph
command to emphasize words in order to contrast them. You contrast words that are similar to words found later in a sentence to help distinguish new information.
You have [[emph +]] 3 text [[emph -]] messages, two fax [[emph -]] messages, and [[emph +]] one [[emph +]] voice [[emph -]] message.This example emphasizes the words related to the number of messages and type of messages to help the listener discern the different kinds of information being presented.Another common speaking technique that humans use is to pause before starting to speak about a new idea or before beginning a new paragraph. Adding an
slnc
(silence) command before beginning to speak a new idea or paragraph makes the synthetic voice sound like a person does when taking a breath in between ideas. This technique works best if you also raise the pitch range (using thepmod
andpbas
embedded commands) of the first sentence of the new paragraph. You must remember to lower the pitch range to achieve the desired effect.
[[emph -; pmod +1; pbas +1]] Good morning! [[pmod -1; pbas -1]] This is a [[emph +]] newer [[emph -]] version of Apple's speech synthesis. The previous [[emph -]] version has already been [[emph -]] adopted by many developers. Users have sent us many positive [[emph +]] reports. [[slnc 500; pmod +1; pbas +1]] This newer [[emph -]] version has better signal [[emph -]] processing [[pmod -1; pbas -1]], new pitch [[emph -]] contours, and a new compression. It still doesn't [[emph -]] sound perfect, but people find it easier to understand.This example deemphasizes the first word of the utterance, but raises the pitch to make the greeting sound more like a human would speak it. Then words are emphasized or deemphasized according to the techniques discussed previously. Silence is introduced before the new paragraph to signal a change in thought process. The pitch is raised and then lowered again after the first phrase. Note that you don't have to wait a full sentence before changing the pitch back to its previous value. It's best to work with these techniques until you find the most human-sounding utterances.