FoundationModel, context length, and testing

I am working on an app using FoundationModels to process web pages.

I am looking to find ways to filter the input to fit within the token limits.

I have unit tests, UI tests and the app running on an iPad in the simulator. It appears that the different configurations of the test environment seems to affect the token limits.

That is, the same input in a unit test and UI test will hit different token limits.

Is this correct? Or is this an artifact of my test tooling?

Answered by MB-Researcher in 865200022

The token limit on the SystemLanguageModel is currently 4096 tokens. This is always the fixed token limit, there's no possibility of it changing.

See this tech note for more discussion of the context window: TN3193: Managing the on-device foundation model’s context window | Apple Developer Documentation

The tokenizer will always produce the same amount of tokens given the same input. So you shouldn't see any variation.

... One source of confusion might be, currently the error message for GenerationError.exceededContextWindowSize will print out a token size as soon as the token length of your content trips over the 4096 limit. So sometimes it might print the token size of your content is 4090 or maybe 4100... that number is when the error was triggered, the actual limit is still always 4096.

Accepted Answer

The token limit on the SystemLanguageModel is currently 4096 tokens. This is always the fixed token limit, there's no possibility of it changing.

See this tech note for more discussion of the context window: TN3193: Managing the on-device foundation model’s context window | Apple Developer Documentation

The tokenizer will always produce the same amount of tokens given the same input. So you shouldn't see any variation.

... One source of confusion might be, currently the error message for GenerationError.exceededContextWindowSize will print out a token size as soon as the token length of your content trips over the 4096 limit. So sometimes it might print the token size of your content is 4090 or maybe 4100... that number is when the error was triggered, the actual limit is still always 4096.

I am wondering how you observe the same input hiting different token limits in different testing environments. If you can share more details, ideally the transcript of your language model session, I may be able to explain.

The on-device foundation model currently has a context window of 4096 tokens per language model session, and all the input and response in the generation process contribute tokens to the context window, as mentioned here.

Thay being said, the token limit shouldn't change, but because LLM doesn't always generate a same response for a same input, you may or may not reach the limit if the response is different.

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

I have realized the input tokens are variable, most variations are when the sampling strategy is default, but around 100, I think it is not a quantity that makes the crash by exceeded context capacity, but it is curious.

Other think that I realized is when the request has defined values like the threshold, temperature and so on the variability decreases. And totally disappears when we are on "greedy" or using Seed.

I upload a video to YouTube with several runs of the same request that is measured with instruments:

🎦 https://youtu.be/eSMA-e1j4ps

Other think that I don't understand is why input tokens are sent when response is completed my feel is like some checks are done before the Foundation Models API returns a response. Could you confirm? Thank you in advance.

‼️ I approach this post to point an issue: when the model is in "stream" response mode ➡️ it seems the output tokens are no measured, it seems the count stops after receives the first chunk of output tokens, always is 2. Check at moment 30 seconds of the video.

Hi Eduardo,

If you're comfortable doing so, can you send the exact code you're using to run inference, either here or in a Feedback Assistant report to Foundation Models?

I tried replicating the behavior from your YouTube video (very helpful by the way, thanks!) but I was not able to see this behavior in my Xcode and Instruments, running the same prompt as in your video. Incidentally, I noticed my tokenizer returns just 52 input tokens for the input:

let instructions = "You are a teacher. Today you talk about New Work City in USA. You must only answer questions about New York."

let prompt= "Talk about the statue of liberty"

...which is significantly less than the 163 tokens in your video. Additionally, both with streaming and non-streaming, I see the input tokens are 52 and never change.

This makes me suspect there might be something in your code that's triggering additional tokenization, but without seeing your inference code, I can't diagnose further.

Thanks! - MB Researcher

Hi

I share the code of function that launches the stream query:

    func lanzarPeticion() async{
        defer{👎🔘lanzarConsulta.toggle()}
        do {
            
            let sessionStream = LanguageModelSession(model: modelStream, instructions: instructions)
            
            let genOptions = generarSamplingMode(temperatura: temperatura, maxResponseTokens: maxResponseTokens, probabilityThreshold: probabilityThreshold, topK: topK, seed: semilla)
            
            let responseStream = sessionStream.streamResponse(
                to: prompt, options: genOptions)
            for try await partial in responseStream {
                withAnimation{
                    self.respuesta = partial.content
                    t.createPartialTime()
                }
            }
            withAnimation{
                t.stop()
            }
        } catch {
// all the logic for errors
}
    }

Clarifications about the code:

First of all, English is dominant on code but sometimes use Spanish, my native language, specially when terms overlap swift or framework ones; I have my own emoji rules to avoid characters an improve readability (at this moment I'm indie developer so I set the rules)

The defer block is to "free" the button that launches the request to avoid double clicks that will generate 2 successive requests.

The t instance is a timer.

The function "generarSamplingMode" collects all the values of View and set the sampling mode.

I'm using Xcode Version 26.0 (17A324)

instructions and prompt instances are State attributes that takes the values in Screen, it's straight forward.

Other issue is the output tokens on stream mode. Please, re-watch again between 0:29 and 0:56 I executed a batch of three runs on streaming mode and the output is always 2 tokens. With other apps the same result. My intuition is maybe it stops counting when its receives the first output or the counter isn't able to update with a continuous flow of updates.

FoundationModel, context length, and testing
 
 
Q