On Model Control & Metadata

While response.usage provides token counts and reasoning signals, are there plans to expose per-token logprobs or confidence scores to help developers build more robust 'evaluators' for non-deterministic outputs?

Answered by Developer Tools Engineer in 891694022

Within the coreai-models repo, it is possible to do evaluations like MMLU or GSM for the LLMs exported using the model catalog:

https://github.com/apple/coreai-models

% uv run coreai.llm.eval --help
usage: coreai.llm.eval [-h] --model MODEL [--tasks TASKS [TASKS ...]]

Evaluate a Core AI LLM against standard benchmarks

options:
  -h, --help            show this help message and exit
  --model MODEL         HuggingFace model ID or path to model bundle
  --tasks TASKS [TASKS ...]
                        Evaluation tasks (e.g. tinyMMLU tinyGSM8k)

To see a list of supported models, one can use the following query:

uv run coreai.model.registry  --list-models --type llm

For tighter integration, please see the Swift runner source code: https://github.com/apple/coreai-models/tree/main/swift/Sources/Tools/llm-runner

Within the coreai-models repo, it is possible to do evaluations like MMLU or GSM for the LLMs exported using the model catalog:

https://github.com/apple/coreai-models

% uv run coreai.llm.eval --help
usage: coreai.llm.eval [-h] --model MODEL [--tasks TASKS [TASKS ...]]

Evaluate a Core AI LLM against standard benchmarks

options:
  -h, --help            show this help message and exit
  --model MODEL         HuggingFace model ID or path to model bundle
  --tasks TASKS [TASKS ...]
                        Evaluation tasks (e.g. tinyMMLU tinyGSM8k)

To see a list of supported models, one can use the following query:

uv run coreai.model.registry  --list-models --type llm

For tighter integration, please see the Swift runner source code: https://github.com/apple/coreai-models/tree/main/swift/Sources/Tools/llm-runner

On Model Control & Metadata
 
 
Q