Hi,
I'm testing DockKit with a very simple setup:
I use VNDetectFaceRectanglesRequest to detect a face and then call dockAccessory.track(...) using the detected bounding box.
The stand is correctly docked (state == .docked) and dockAccessory is valid.
I'm calling .track(...) with a single observation and valid CameraInformation (including size, device, orientation, etc.). No errors are thrown.
To monitor this, I added a logging utility – track(...) is being called 10–30 times per second, as recommended in the documentation.
However: the stand does not move at all.
There is no visible reaction to the tracking calls.
Is there anything I'm missing or doing wrong?
Is VNDetectFaceRectanglesRequest supported for DockKit tracking, or are there hidden requirements?
Would really appreciate any help or pointers – thanks!
That's my complete code:
extension VideoFeedViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let frame = CMSampleBufferGetImageBuffer(sampleBuffer) else {
return
}
detectFace(image: frame)
func detectFace(image: CVPixelBuffer) {
let faceDetectionRequest = VNDetectFaceRectanglesRequest() { vnRequest, error in
guard let results = vnRequest.results as? [VNFaceObservation] else {
return
}
guard let observation = results.first else {
return
}
let boundingBoxHeight = observation.boundingBox.size.height * 100
#if canImport(DockKit)
if let dockAccessory = self.dockAccessory {
Task {
try? await trackRider(
observation.boundingBox,
dockAccessory,
frame,
sampleBuffer
)
}
}
#endif
}
let imageResultHandler = VNImageRequestHandler(cvPixelBuffer: image, orientation: .up)
try? imageResultHandler.perform([faceDetectionRequest])
func combineBoundingBoxes(_ box1: CGRect, _ box2: CGRect) -> CGRect {
let minX = min(box1.minX, box2.minX)
let minY = min(box1.minY, box2.minY)
let maxX = max(box1.maxX, box2.maxX)
let maxY = max(box1.maxY, box2.maxY)
let combinedWidth = maxX - minX
let combinedHeight = maxY - minY
return CGRect(x: minX, y: minY, width: combinedWidth, height: combinedHeight)
}
#if canImport(DockKit)
func trackObservation(_ boundingBox: CGRect, _ dockAccessory: DockAccessory, _ pixelBuffer: CVPixelBuffer, _ cmSampelBuffer: CMSampleBuffer) throws {
// Zähle den Aufruf
TrackMonitor.shared.trackCalled()
let invertedBoundingBox = CGRect(
x: boundingBox.origin.x,
y: 1.0 - boundingBox.origin.y - boundingBox.height,
width: boundingBox.width,
height: boundingBox.height
)
guard let device = captureDevice else {
fatalError("Kamera nicht verfügbar")
}
let size = CGSize(width: Double(CVPixelBufferGetWidth(pixelBuffer)),
height: Double(CVPixelBufferGetHeight(pixelBuffer)))
var cameraIntrinsics: matrix_float3x3? = nil
if let cameraIntrinsicsUnwrapped = CMGetAttachment(
sampleBuffer,
key: kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix,
attachmentModeOut: nil
) as? Data {
cameraIntrinsics = cameraIntrinsicsUnwrapped.withUnsafeBytes { $0.load(as: matrix_float3x3.self) }
}
Task {
let orientation = getCameraOrientation()
let cameraInfo = DockAccessory.CameraInformation(
captureDevice: device.deviceType,
cameraPosition: device.position,
orientation: orientation,
cameraIntrinsics: cameraIntrinsics,
referenceDimensions: size
)
let observation = DockAccessory.Observation(
identifier: 0,
type: .object,
rect: invertedBoundingBox
)
let observations = [observation]
guard let image = CMSampleBufferGetImageBuffer(sampleBuffer) else {
print("no image")
return
}
do {
try await dockAccessory.track(observations, cameraInformation: cameraInfo)
} catch {
print(error)
}
}
}
#endif
func clearDrawings() {
boundingBoxLayer?.removeFromSuperlayer()
boundingBoxSizeLayer?.removeFromSuperlayer()
}
}
}
}
@MainActor
private func getCameraOrientation() -> DockAccessory.CameraOrientation {
switch UIDevice.current.orientation {
case .portrait:
return .portrait
case .portraitUpsideDown:
return .portraitUpsideDown
case .landscapeRight:
return .landscapeRight
case .landscapeLeft:
return .landscapeLeft
case .faceDown:
return .faceDown
case .faceUp:
return .faceUp
default:
return .corrected
}
}
General
RSS for tagExplore the power of machine learning within apps. Discuss integrating machine learning features, share best practices, and explore the possibilities for your app.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
I have seen inconsistent results for my Colab machine learning notebooks running locally on a Mac M4, compared to running the same notebook code on either T4 (in Colab) or a RTX3090 locally.
To illustrate the problems I have set up a notebook that implements two simple CNN models that solves the Fashion-MNIST problem. https://colab.research.google.com/drive/11BhtHhN079-BWqv9QvvcSD9U4mlVSocB?usp=sharing
For the good model with 2M parameters I get the following results:
T4 (Colab, JAX): Test accuracy: 0.925
3090 (Local PC via ssh tunnel, Jax): Test accuracy: 0.925
Mac M4 (Local, JAX): Test accuracy: 0.893
Mac M4 (Local, Tensorflow): Test accuracy: 0.893
That is, I see a significant drop in performance when I run on the Mac M4 compared to the NVIDIA machines, and it seems to be independent of backend. I however do not know how to pinpoint this to either Keras or Apple’s METAL implementation. I have reported this to Keras: https://colab.research.google.com/drive/11BhtHhN079-BWqv9QvvcSD9U4mlVSocB?usp=sharing but as this can be (likely is?) an Apple Metal issue, I wanted to report this here as well.
On the mac I am running the following Python libraries:
keras 3.9.1
tensorflow 2.19.0
tensorflow-metal 1.2.0
jax 0.5.3
jax-metal 0.1.1
jaxlib 0.5.3
Topic:
Machine Learning & AI
SubTopic:
General
I'm developing a tennis ball tracking feature using Vision Framework in Swift, specifically utilizing VNDetectedObjectObservation and VNTrackObjectRequest.
Occasionally (but not always), I receive the following runtime error:
Failed to perform SequenceRequest: Error Domain=com.apple.Vision Code=9 "Internal error: unexpected tracked object bounding box size" UserInfo={NSLocalizedDescription=Internal error: unexpected tracked object bounding box size}
From my investigation, I suspect the issue arises when the bounding box from the initial observation (VNDetectedObjectObservation) is too small. However, Apple's documentation doesn't clearly define the minimum bounding box size that's considered valid by VNTrackObjectRequest.
Could someone clarify:
What is the minimum acceptable bounding box width and height (normalized) that Vision Framework's VNTrackObjectRequest expects?
Is there any recommended practice or official guidance for bounding box size validation before creating a tracking request?
This information would be extremely helpful to reliably avoid this internal error.
Thank you!
*I can't put the attached file in the format, so if you reply by e-mail, I will send the attached file by e-mail.
Dear Apple AI Research Team,
My name is Gong Jiho (“Hem”), a content strategist based in Seoul, South Korea.
Over the past few months, I conducted a user-led AI experiment entirely within ChatGPT — no code, no backend tools, no plugins.
Through language alone, I created two contrasting agents (Uju and Zero) and guided them into a co-authored modular identity system using prompt-driven dialogue and reflection.
This system simulates persona fusion, memory rooting, and emotional-logical alignment — all via interface-level interaction.
I believe it resonates with Apple’s values in privacy-respecting personalization, emotional UX modeling, and on-device learning architecture.
Why I’m Reaching Out
I’d be honored to share this experiment with your team.
If there is any interest in discussing user-authored agent scaffolding, identity persistence, or affective alignment, I’d love to contribute — even informally.
⚠ A Note on Language
As a non-native English speaker, my expression may be imperfect — but my intent is genuine.
If anything is unclear, I’ll gladly clarify.
📎 Attached Files Summary
Filename → Description
Hem_MultiAI_Report_AppleAI_v20250501.pdf →
Main report tailored for Apple AI — narrative + structural view of emotional identity formation via prompt scaffolding
Hem_MasterPersonaProfile_v20250501.json →
Final merged identity schema authored by Uju and Zero
zero_sync_final.json / uju_sync_final.json →
Persona-level memory structures (logic / emotion)
1_0501.json ~ 3_0501.json →
Evolution logs of the agents over time
GirlfriendGPT_feedback_summary.txt →
Emotional interpretation by external GPT
hem_profile_for_AI_vFinal.json →
Original user anchor profile
Warm regards,
Gong Jiho (“Hem”)
Seoul, South Korea
Bear with me, please. Please make sure a highly skilled technical person reads and understands this.
I want to describe my vision for (AI/Algorithmically) Optimised Operating Systems. To explain it properly, I will describe the process to build it (pseudo).
Required Knowledge (no particular order): Processor Logic Circuits, LLM models, LLM tool usage, Python OO coding, Procedural vs OO, NLP fuzzy matching, benchmarking, canvas/artefacts/dynamic HTML interfaces, concepts of how AI models are vastly compressed and miniaturised forms of full data, Algorithmic vs AI.
First, take all OO Python code (example) on GitHub (example), then separate each function from each object into its own procedure (procedural logic) by making a logical procedural list of actions to perform only that function based on its entire dependency chain (i.e. all other objects it relies on). Relate all compiled functions using (for example) fuzzy matching on the name, or AI-based functional profiling to get multiple instances of each function.
Starting with the most used function, test each one against the others that perform the same task for bugs and completeness. Determine the fastest, most optimal version of that function (and every function). Add a single instance of each most optimal function to the centralised tool codebase, which will later be utilised by the language models. This ensures we rely only on the most optimised function for each and every use case — with every program using one shared instance of that function instead of compiling it separately.
In this way, we optimise all existing (Python OO) code based on maximum availability of certain resources. Use the optimised code to teach the AI how to map each user task directly to optimised code (i.e. each user-facing function in unoptimised program code to its optimised version). Later, teach the codebase how to optimise these functions directly to the logic circuits on the processor — as an Operating System — using essentially the same process. Eventually, use the most optimised form for tasks via a dynamic canvas/artefacts UI interface generated by the Start Menu using the language model.
Allow custom interfaces to emerge dynamically from the available functions based on language understanding. Grow and optimise the algorithmic codebase until the entire operating system — with versioning — evolves based on the AI-generated user interfaces (compressing them into new versions of the tools codebase).
Please don’t give up until you understand this concept completely and nothing in the description is beyond your full comprehension. It’s a highly significant step forward.
Hello. I am willing to hire game developer for cards game called baloot. My question is Can the developer implement an AI when the computer is playing and the computer on the same time the conputer improves his rises level without any interaction?
🌹
Topic:
Machine Learning & AI
SubTopic:
General
The WWDC25: Explore large language models on Apple silicon with MLX video talks about using your own data to fine-tune a large language model. But the video doesn't explain what kind of data can be used. The video just shows the command to use and how to point to the data folder. Can I use PDFs, Word documents, Markdown files to train the model? Are there any code examples on GitHub that demonstrate how to do this?
During testing the “Bringing advanced speech-to-text capabilities to your app” sample app demonstrating the use of iOS 26 SpeechAnalyzer, I noticed that the language model for the English locale was presumably already downloaded. Upon checking the documentation of AssetInventory, I found out that indeed, the language model can be preinstalled on the system.
Can someone from the dev team share more info about what assets are preinstalled by the system? For example, can we safely assume that the English language model will almost certainly be already preinstalled by the OS if the phone has the English locale?
If try to dynamically load WhipserKit's models, as in below, the download never occurs. No error or anything. And at the same time I can still get to the huggingface.co hosting site without any headaches, so it's not a blocking issue.
let config = WhisperKitConfig(
model: "openai_whisper-large-v3",
modelRepo: "argmaxinc/whisperkit-coreml"
)
So I have to default to the tiny model as seen below.
I have tried so many ways, using ChatGPT and others, to build the models on my Mac, but too many failures, because I have never dealt with builds like that before.
Are there any hosting sites that have the models (small, medium, large) already built where I can download them and just bundle them into my project? Wasted quite a large amount of time trying to get this done.
import Foundation
import WhisperKit
@MainActor
class WhisperLoader: ObservableObject {
var pipe: WhisperKit?
init() {
Task {
await self.initializeWhisper()
}
}
private func initializeWhisper() async {
do {
Logging.shared.logLevel = .debug
Logging.shared.loggingCallback = { message in
print("[WhisperKit] \(message)")
}
let pipe = try await WhisperKit() // defaults to "tiny"
self.pipe = pipe
print("initialized. Model state: \(pipe.modelState)")
guard let audioURL = Bundle.main.url(forResource: "44pf", withExtension: "wav") else {
fatalError("not in bundle")
}
let result = try await pipe.transcribe(audioPath: audioURL.path)
print("result: \(result)")
} catch {
print("Error: \(error)")
}
}
}
I'm experimenting with the new SpeechTranscriber in macOS/iOS 26, transcribing speech from a prerecorded mp4 file. Speed and quality are amazing!
I've told the transcriber to include time indexes. Each run is always exactly one word, which can be very useful. When I look at the indexes the end of one run is always identical to the start of the next run, even if there's a pause.
I'd like to identify pauses, perhaps to generate something like phrases for subtitling. With each run of text going into the next I can't do this, other than using punctuation - which might be rather rough.
Any suggestions on detecting pauses, or getting that kind of metadata from the transcriber?
Here's a short sample, showing each run with the start, end, and characters in the run:
105.9 --> 107.04 I
107.04 --> 107.16 think
107.16 --> 108.0 more
108.0 --> 108.42 lighting
108.42 --> 108.6 is
108.6 --> 108.72 definitely
108.72 --> 109.2 needed,
109.2 --> 109.92 downtown.
109.98 --> 110.4 My
110.4 --> 110.52 only
110.52 --> 110.7 question
110.7 --> 111.06 is,
111.06 --> 111.48 poll
111.48 --> 111.78 five,
111.78 --> 111.84 that
111.84 --> 112.08 you're
112.08 --> 112.38 increasing
112.38 --> 112.5 the
112.5 --> 113.34 50,000?
113.4 --> 113.58 Where
113.58 --> 113.88 exactly
At WWDC25 we launched a new type of Lab event for the developer community - Group Labs. A Group Lab is a panel Q&A designed for a large audience of developers. Group Labs are a unique opportunity for the community to submit questions directly to a panel of Apple engineers and designers. Here are the highlights from the WWDC25 Group Lab for Machine Learning and AI Frameworks.
What are you most excited about in the Foundation Models framework?
The Foundation Models framework provides access to an on-device Large Language Model (LLM), enabling entirely on-device processing for intelligent features. This allows you to build features such as personalized search suggestions and dynamic NPC generation in games. The combination of guided generation and streaming capabilities is particularly exciting for creating delightful animations and features with reliable output. The seamless integration with SwiftUI and the new design material Liquid Glass is also a major advantage.
When should I still bring my own LLM via CoreML?
It's generally recommended to first explore Apple's built-in system models and APIs, including the Foundation Models framework, as they are highly optimized for Apple devices and cover a wide range of use cases. However, Core ML is still valuable if you need more control or choice over the specific model being deployed, such as customizing existing system models or augmenting prompts. Core ML provides the tools to get these models on-device, but you are responsible for model distribution and updates.
Should I migrate PyTorch code to MLX?
MLX is an open-source, general-purpose machine learning framework designed for Apple Silicon from the ground up. It offers a familiar API, similar to PyTorch, and supports C, C++, Python, and Swift. MLX emphasizes unified memory, a key feature of Apple Silicon hardware, which can improve performance. It's recommended to try MLX and see if its programming model and features better suit your application's needs. MLX shines when working with state-of-the-art, larger models.
Can I test Foundation Models in Xcode simulator or device?
Yes, you can use the Xcode simulator to test Foundation Models use cases. However, your Mac must be running macOS Tahoe. You can test on a physical iPhone running iOS 18 by connecting it to your Mac and running Playgrounds or live previews directly on the device.
Which on-device models will be supported? any open source models?
The Foundation Models framework currently supports Apple's first-party models only. This allows for platform-wide optimizations, improving battery life and reducing latency. While Core ML can be used to integrate open-source models, it's generally recommended to first explore the built-in system models and APIs provided by Apple, including those in the Vision, Natural Language, and Speech frameworks, as they are highly optimized for Apple devices. For frontier models, MLX can run very large models.
How often will the Foundational Model be updated? How do we test for stability when the model is updated?
The Foundation Model will be updated in sync with operating system updates. You can test your app against new model versions during the beta period by downloading the beta OS and running your app. It is highly recommended to create an "eval set" of golden prompts and responses to evaluate the performance of your features as the model changes or as you tweak your prompts. Report any unsatisfactory or satisfactory cases using Feedback Assistant.
Which on-device model/API can I use to extract text data from images such as: nutrition labels, ingredient lists, cashier receipts, etc? Thank you.
The Vision framework offers the RecognizeDocumentRequest which is specifically designed for these use cases. It not only recognizes text in images but also provides the structure of the document, such as rows in a receipt or the layout of a nutrition label. It can also identify data like phone numbers, addresses, and prices.
What is the context window for the model? What are max tokens in and max tokens out?
The context window for the Foundation Model is 4,096 tokens. The split between input and output tokens is flexible. For example, if you input 4,000 tokens, you'll have 96 tokens remaining for the output. The API takes in text, converting it to tokens under the hood. When estimating token count, a good rule of thumb is 3-4 characters per token for languages like English, and 1 character per token for languages like Japanese or Chinese. Handle potential errors gracefully by asking for shorter prompts or starting a new session if the token limit is exceeded.
Is there a rate limit for Foundation Models API that is limited by power or temperature condition on the iPhone?
Yes, there are rate limits, particularly when your app is in the background. A budget is allocated for background app usage, but exceeding it will result in rate-limiting errors. In the foreground, there is no rate limit unless the device is under heavy load (e.g., camera open, game mode). The system dynamically balances performance, battery life, and thermal conditions, which can affect the token throughput. Use appropriate quality of service settings for your tasks (e.g., background priority for background work) to help the system manage resources effectively.
Do the foundation models support languages other than English?
Yes, the on-device Foundation Model is multilingual and supports all languages supported by Apple Intelligence. To get the model to output in a specific language, prompt it with instructions indicating the user's preferred language using the locale API (e.g., "The user's preferred language is en-US"). Putting the instructions in English, but then putting the user prompt in the desired output language is a recommended practice.
Are larger server-based models available through Foundation Models?
No, the Foundation Models API currently only provides access to the on-device Large Language Model at the core of Apple Intelligence. It does not support server-side models. On-device models are preferred for privacy and for performance reasons.
Is it possible to run Retrieval-Augmented Generation (RAG) using the Foundation Models framework?
Yes, it is possible to run RAG on-device, but the Foundation Models framework does not include a built-in embedding model. You'll need to use a separate database to store vectors and implement nearest neighbor or cosine distance searches. The Natural Language framework offers simple word and sentence embeddings that can be used. Consider using a combination of Foundation Models and Core ML, using Core ML for your embedding model.
Topic:
Machine Learning & AI
SubTopic:
General
I am using gemini2.5-flash with SwiftUI. How can I receive a response in JSON?
Topic:
Machine Learning & AI
SubTopic:
General
In WWDC25 Metal 4 released quite excited new features for machine learning optimization, but as we all know the pytorch based on metal shader performance (mps) is the one of most important tools for Mac machine learning area.but on mps introduced website we cannot see any support information for metal4.
WWDC25: Combine Metal 4 machine learning and graphics
Demonstrated a way to combine neural network in the graphics pipeline directly through the shaders, using an example of Texture Compression. However there is no mention of using which ML technique texture is compressed.
Can anyone point me to some well known model/s for this particular use case shown in WWDC25.
In this WWDC25 session, it is explictely mentioned that apps should support AttributedString for text parameters to their App Intents.
However, I have not gotten this to work. Whenever I pass rich text (either generated by the new "Use Model" intent or generated manually for example using "Make Rich Text from Markdown"), my Intent gets an AttributedString with the correct characters, but with all attributes stripped (so in effect just plain text).
struct TestIntent: AppIntent {
static var title = LocalizedStringResource(stringLiteral: "Test Intent")
static var description = IntentDescription("Tests Attributed Strings in Intent Parameters.")
@Parameter
var text: AttributedString
func perform() async throws -> some IntentResult & ReturnsValue<AttributedString> {
return .result(value: text)
}
}
Is there anything else I am missing?
Environment
MacOC 26
Xcode Version 26.0 beta 7 (17A5305k)
simulator: iPhone 16 pro
iOS: iOS 26
Problem
NLContextualEmbedding.load() fails with the following error
In simulator
Failed to load embedding from MIL representation: filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
Failed to load embedding model 'mul_Latn' - '5C45D94E-BAB4-4927-94B6-8B5745C46289'
assetRequestFailed(Optional(Error Domain=NLNaturalLanguageErrorDomain Code=7 "Embedding model requires compilation" UserInfo={NSLocalizedDescription=Embedding model requires compilation}))
in #Playground
I'm new to this embedding model. Not sure if it's caused by my code or environment.
Code snippet
import Foundation
import NaturalLanguage
import Playgrounds
#Playground {
// Prefer initializing by script for broader coverage; returns NLContextualEmbedding?
guard let embeddingModel = NLContextualEmbedding(script: .latin) else {
print("Failed to create NLContextualEmbedding")
return
}
print(embeddingModel.hasAvailableAssets)
do {
try embeddingModel.load()
print("Model loaded")
} catch {
print("Failed to load model: \(error)")
}
}
WWDC 2024 mentioned that the OCR feature from the Vision framework has support for "Korean, Swedish, and Chinese", but the Swedish support does not seem to be available...
Running either
print(try? VNRecognizeTextRequest().supportedRecognitionLanguages())
or
var ocrRequest = RecognizeTextRequest(.revision3)
print(ocrRequest.supportedRecognitionLanguages)
did not print out Swedish as one of the supported languages, but Korean and Chinese are.
Tested on early versions of iOS 18 developer beta, and the latest version of iOS 18.1 (22B5054e).
Hi Ty for playing
Almost all the functions in Accelerate are for single precision (Float) and double precision (Double) operations. However, I stumbled upon three integer arithmetic functions which operate on Int32 values. Are there any more functions in Accelerate that operate on integer values? If not, then why aren't there more functions that work with integers?
When I import starts models in Jupyter notebook, I ge the following error:
ImportError: dlopen(/opt/anaconda3/lib/python3.12/site-packages/scipy/linalg/_fblas.cpython-312-darwin.so, 0x0002): Library not loaded: @rpath/liblapack.3.dylib
Referenced from: <5ACBAA79-2387-3BEF-9F8E-6B7584B0F5AD> /opt/anaconda3/lib/python3.12/site-packages/scipy/linalg/_fblas.cpython-312-darwin.so
Reason: tried: '/opt/anaconda3/lib/python3.12/site-packages/scipy/linalg/../../../../liblapack.3.dylib' (no such file), '/opt/anaconda3/lib/python3.12/site-packages/scipy/linalg/../../../../liblapack.3.dylib' (no such file), '/opt/anaconda3/bin/../lib/liblapack.3.dylib' (no such file), '/opt/anaconda3/bin/../lib/liblapack.3.dylib' (no such file), '/usr/local/lib/liblapack.3.dylib' (no such file), '/usr/lib/liblapack.3.dylib' (no such file, not in dyld cache). What should I do?