Approaching Custom VST GUI Automation: Combining local Vision OCR with the new FoundationModels framework for screen-grounding

Hello everyone,

I’m working on a project to automate software controls inside non-standard macOS applications—specifically custom-drawn audio plugins (like the Roland TR-909 VST).

The Challenge:

These VST interfaces do not expose their buttons, knobs, or dials via the standard macOS Accessibility tree (NSAccessibility / event taps). Because they are custom-rendered, standard automation tools are blind to them.

My Current Hybrid Approach:

I am combining two of Apple's local machine learning technologies to solve this without sending data to the cloud:

  1. Step 1: Text-Based Layout Mapping (Vision Framework) I capture a screenshot of the targeted window using Quartz Window Services and run a local VNRecognizeTextRequest to extract coordinates for all text labels. This works exceptionally well for text buttons like "OPTION" or "ABOUT".

  2. Step 2: Contextual & Non-Text Element Interpretation (FoundationModels Framework) For controls that lack text labels (such as blank step sequencer buttons, parameter knobs, or toggle light states), I pass the screenshot as an Attachment into the new local LanguageModelSession. I ask the model to ground coordinates relative to the text landmarks mapped in Step 1.

Here is a simplified snippet of how I am feeding the visual context into the local model:

import Foundation
import FoundationModels
import Cocoa

func analyzePluginInterface(cgImage: CGImage) async {
    guard SystemLanguageModel.default.isAvailable else {
        print("Local model not downloaded or available.")
        return
    }
    
    let instructions = """
    You are a screen-aware assistant. Your job is to locate GUI controls 
    on a custom 1024x802 VST window.
    """
    
    let session = LanguageModelSession(instructions: instructions)
    
    do {
        let response = try await session.respond {
            "Look at this screenshot of the VST window."
            Attachment(cgImage)
            "Locate the blank step-sequencer buttons located below the instrument channel labels."
            "What are the center coordinates (X, Y) for the first active step?"
        }
        print("Model Grounding Output: \(response.content)")
    } catch {
        print("Inference failed: \(error)")
    }
}

My Questions for the Community:

  1. Performance & Latency: The local LanguageModelSession.respond call takes several seconds to run on device. For real-time DAW automation, this is a bottleneck. Has anyone experimented with using a custom LoRA adapter or a smaller model profile to speed up spatial coordinate inference?
  2. Coordinate Stability: Multimodal models can sometimes hallucinate coordinates (bounding box values). What strategies are you using to constrain the model output to precise pixel boundaries on varying display scaling configurations (Retina vs non-Retina)?
  3. Alternative Solutions: Are there newer on-device vision APIs (perhaps in CoreML or Vision) that are better suited for bounding-box grounding of abstract graphics (like dials/knobs) than a general language model session?

Would love to hear how others are approaching screen-aware GUI interpretation with these new frameworks!

Thanks!

Approaching Custom VST GUI Automation: Combining local Vision OCR with the new FoundationModels framework for screen-grounding
 
 
Q