FoundationModel, context length, and testing

I am working on an app using FoundationModels to process web pages.

I am looking to find ways to filter the input to fit within the token limits.

I have unit tests, UI tests and the app running on an iPad in the simulator. It appears that the different configurations of the test environment seems to affect the token limits.

That is, the same input in a unit test and UI test will hit different token limits.

Is this correct? Or is this an artifact of my test tooling?

The token limit on the SystemLanguageModel is currently 4096 tokens. This is always the fixed token limit, there's no possibility of it changing.

See this tech note for more discussion of the context window: TN3193: Managing the on-device foundation model’s context window | Apple Developer Documentation

The tokenizer will always produce the same amount of tokens given the same input. So you shouldn't see any variation.

... One source of confusion might be, currently the error message for GenerationError.exceededContextWindowSize will print out a token size as soon as the token length of your content trips over the 4096 limit. So sometimes it might print the token size of your content is 4090 or maybe 4100... that number is when the error was triggered, the actual limit is still always 4096.

I am wondering how you observe the same input hiting different token limits in different testing environments. If you can share more details, ideally the transcript of your language model session, I may be able to explain.

The on-device foundation model currently has a context window of 4096 tokens per language model session, and all the input and response in the generation process contribute tokens to the context window, as mentioned here.

Thay being said, the token limit shouldn't change, but because LLM doesn't always generate a same response for a same input, you may or may not reach the limit if the response is different.

Best,
——
Ziqiao Chen
 Worldwide Developer Relations.

I have realized the input tokens are variable, most variations are when the sampling strategy is default, but around 100, I think it is not a quantity that makes the crash by exceeded context capacity, but it is curious.

Other think that I realized is when the request has defined values like the threshold, temperature and so on the variability decreases. And totally disappears when we are on "greedy" or using Seed.

I upload a video to YouTube with several runs of the same request that is measured with instruments:

🎦 https://youtu.be/eSMA-e1j4ps

Other think that I don't understand is why input tokens are sent when response is completed my feel is like some checks are done before the Foundation Models API returns a response. Could you confirm? Thank you in advance.

‼️ I approach this post to point an issue: when the model is in "stream" response mode ➡️ it seems the output tokens are no measured, it seems the count stops after receives the first chunk of output tokens, always is 2. Check at moment 30 seconds of the video.

FoundationModel, context length, and testing
 
 
Q