While response.usage provides token counts and reasoning signals, are there plans to expose per-token logprobs or confidence scores to help developers build more robust 'evaluators' for non-deterministic outputs?
Within the coreai-models repo, it is possible to do evaluations like MMLU or GSM for the LLMs exported using the model catalog:
https://github.com/apple/coreai-models
% uv run coreai.llm.eval --help
usage: coreai.llm.eval [-h] --model MODEL [--tasks TASKS [TASKS ...]]
Evaluate a Core AI LLM against standard benchmarks
options:
-h, --help show this help message and exit
--model MODEL HuggingFace model ID or path to model bundle
--tasks TASKS [TASKS ...]
Evaluation tasks (e.g. tinyMMLU tinyGSM8k)
To see a list of supported models, one can use the following query:
uv run coreai.model.registry --list-models --type llm
For tighter integration, please see the Swift runner source code: https://github.com/apple/coreai-models/tree/main/swift/Sources/Tools/llm-runner