Locate relevant passages in a document by asking the Bidirectional Encoder Representations from Transformers (BERT) model a question.
- iOS 13.0+
- Xcode 11.0+
- Mac Catalyst 13.0+
- Core ML
This sample app leverages the BERT model to find the answer to a user’s question in a body of text. The model accepts text from a document and a question, in natural English, about the document. The model responds with the location of a passage within the document text that answers the question. For example, given the text, “The quick brown fox jumps over the lethargic dog.”, with the question “Who jumped over the dog?”, the BERT model’s predicted answer is, “the quick brown fox”.
The BERT model does not generate new sentences to answer a given question. It finds the passage in a document that’s most likely to answer the question.
The sample leverages the BERT model by:
Importing the BERT model’s vocabulary into a dictionary
Breaking up the document and question texts into tokens
Converting the tokens to ID numbers using the vocabulary dictionary
Packing the converted token IDs into the model’s input format
Calling the BERT model’s
Locating the answer by analyzing the BERT model’s output
Extracting that answer from the original document text
Configure the Sample Code Project
Before you run the sample code project in Xcode, use a device with either:
iOS 13 or later
macOS 10.15 or later
Build the Vocabulary
The first step to using the BERT model is to import its vocabulary. The sample creates a vocabulary dictionary by splitting the vocabulary file into lines, each of which has one token.
load method creates a dictionary entry for each token, and each entry occupies an entire line in the vocabulary text file. The function assigns each token’s (zero-based) line number as its value. For example, the first token,
"[PAD]", has an ID of
0, and the 5,001st token,
"knight", has an ID of
Split the Text into Word Tokens
The BERT model requires you to convert each word into one or more token IDs. Before you can use the vocabulary dictionary to find those IDs, you must divide the document’s text and the question’s text into word tokens.
The sample does this by using an
NLTagger, which breaks up a string into word tokens, each of which is a substring of the original. The sample’s
word method adds each word token to an array as the tagger enumerates through them.
The sample app leverages the tagger to split each string into tokens by using its
enumerate method with the
token tagging scheme and the
NLToken token unit.
Convert Word or Wordpiece Tokens into Their IDs
For speed and efficiency, the BERT model operates on token IDs, which are numbers that represent tokens, rather than operating on the text tokens themselves. The sample’s
wordpiece method converts each word token into its ID by looking it up in the vocabulary dictionary.
If a word token doesn’t exist in the vocabulary, the method looks for subtokens, or wordpieces. A wordpiece is a component of a larger word token. For example, the word lethargic isn’t in the vocabulary but its wordpieces, let, har, and gic are. Dividing the vocabulary’s large words into wordpieces reduces the vocabulary size and makes the BERT model more flexible. The model can understand words that aren’t explicitly in the vocabulary by combining their wordpieces.
Secondary wordpieces, such as har and gic, each appear in the vocabulary with two leading pound signs, as
Continuing the example, the method converts document text into the word and wordpiece token IDs shown in the following figure.
Prepare the Model Input
The BERT model has two inputs:
word— Accepts the document and question texts
word— Tells the BERT model which elements of
wordare from the document
The sample creates the
word array by arranging the token IDs in the following order:
A classification start token ID, which has a value of
101and appears as
"[CLS]"in the vocabulary file
The token IDs from the question string
A separator token ID, which has a value of
102and appears as
"[SEP]"in the vocabulary file
The token IDs from the text string
Another separator token ID
One or more padding token IDs for the remaining, unused elements, which have a value of
0and appear as
"[PAD]"in the vocabulary file
Next, the sample prepares the
word input by creating an array of the same length, where all the elements that correspond to the document text are
1 and all others are
Continuing the example, the sample arranges the two input arrays with the values shown in the figure below.
Next, the sample creates an
MLMulti for each input and copies the contents from the arrays, which it uses to create a
BERTQAFP16Input feature provider.
Make a Prediction
You use the BERT model to predict where to find an answer to the question in the document text, by giving the model your input feature provider with the input
MLMulti instances. The sample then calls the model’s
prediction(from:) method in the app’s
Find the Answer
You locate the answer to the question by analyzing the output from the BERT model. The model produces two outputs,
end. Each logit is a raw confidence score of where the BERT model predicts the beginning and the end of an answer is.
In this example, the best start and end logits are
7 for the tokens
"fox", respectively. The sample finds the indices of the highest-value starting and ending logits by:
Converting each output logit
Isolating the logits relevant to the document.
Finding the indices, in each array, to the 20 logits with the highest values.
Searching through the 20 x 20 or fewer combinations of logits for the best combination.
In this example, the indices of the best start and end logits are
11, respectively. The answer substring, located between indices
11 of the original text, is
“the quick brown fox”.
Scale for Larger Documents
The BERT model included in this sample can process up to 384 tokens, including the three overhead tokens—one “classification start” token and two separator tokens—leaving 381 tokens for your text and question, combined. For larger texts that exceed this limitation, consider using one of these techniques:
Use a search mechanism to narrow down the relevant document text.
Break up the document text into sections, such as by paragraph, and make a prediction for each section.