Hello everyone,
I’m working on a project to automate software controls inside non-standard macOS applications—specifically custom-drawn audio plugins (like the Roland TR-909 VST).
The Challenge:
These VST interfaces do not expose their buttons, knobs, or dials via the standard macOS Accessibility tree (NSAccessibility / event taps). Because they are custom-rendered, standard automation tools are blind to them.
My Current Hybrid Approach:
I am combining two of Apple's local machine learning technologies to solve this without sending data to the cloud:
-
Step 1: Text-Based Layout Mapping (Vision Framework) I capture a screenshot of the targeted window using Quartz Window Services and run a local
VNRecognizeTextRequestto extract coordinates for all text labels. This works exceptionally well for text buttons like"OPTION"or"ABOUT". -
Step 2: Contextual & Non-Text Element Interpretation (FoundationModels Framework) For controls that lack text labels (such as blank step sequencer buttons, parameter knobs, or toggle light states), I pass the screenshot as an
Attachmentinto the new localLanguageModelSession. I ask the model to ground coordinates relative to the text landmarks mapped in Step 1.
Here is a simplified snippet of how I am feeding the visual context into the local model:
import Foundation
import FoundationModels
import Cocoa
func analyzePluginInterface(cgImage: CGImage) async {
guard SystemLanguageModel.default.isAvailable else {
print("Local model not downloaded or available.")
return
}
let instructions = """
You are a screen-aware assistant. Your job is to locate GUI controls
on a custom 1024x802 VST window.
"""
let session = LanguageModelSession(instructions: instructions)
do {
let response = try await session.respond {
"Look at this screenshot of the VST window."
Attachment(cgImage)
"Locate the blank step-sequencer buttons located below the instrument channel labels."
"What are the center coordinates (X, Y) for the first active step?"
}
print("Model Grounding Output: \(response.content)")
} catch {
print("Inference failed: \(error)")
}
}
My Questions for the Community:
- Performance & Latency: The local
LanguageModelSession.respondcall takes several seconds to run on device. For real-time DAW automation, this is a bottleneck. Has anyone experimented with using a custom LoRA adapter or a smaller model profile to speed up spatial coordinate inference? - Coordinate Stability: Multimodal models can sometimes hallucinate coordinates (bounding box values). What strategies are you using to constrain the model output to precise pixel boundaries on varying display scaling configurations (Retina vs non-Retina)?
- Alternative Solutions: Are there newer on-device vision APIs (perhaps in CoreML or Vision) that are better suited for bounding-box grounding of abstract graphics (like dials/knobs) than a general language model session?
Would love to hear how others are approaching screen-aware GUI interpretation with these new frameworks!
Thanks!