I have realized the input tokens are variable, most variations are when the sampling strategy is default, but around 100, I think it is not a quantity that makes the crash by exceeded context capacity, but it is curious.
Other think that I realized is when the request has defined values like the threshold, temperature and so on the variability decreases. And totally disappears when we are on "greedy" or using Seed.
I upload a video to YouTube with several runs of the same request that is measured with instruments:
🎦 https://youtu.be/eSMA-e1j4ps
Other think that I don't understand is why input tokens are sent when response is completed my feel is like some checks are done before the Foundation Models API returns a response. Could you confirm? Thank you in advance.
‼️ I approach this post to point an issue: when the model is in "stream" response mode ➡️ it seems the output tokens are no measured, it seems the count stops after receives the first chunk of output tokens, always is 2. Check at moment 30 seconds of the video.