Problems integrating Hypervisor.framework APIC and IOAPIC

Introduction

I'm trying to integrate support for the APIC implementation added to Hypervisor.framework back in macOS 12 into the open source Qemu VMM. Qemu contains a VMM-side software implementation of the APIC, but it shows up as a major performance constraint in profiling, so it'd be nice to use the in-kernel implementation. I've previously submitted DTS TSIs (case 3345863) for this and received some high level pointers, but I'm told the forums are now the focus for DTS.

I've got things working to what feels like 95%, but I'm still tripping up on a few things. FreeBSD and macOS guests are successfully booting and running most of the time, but there are sporadic stalls which point towards undelivered interrupts. Linux fails early on. A number of key test cases are failing in the 'apic' and 'ioapic' test suites that are part of the open source 'kvm-unit-tests' project, and I've run out of ideas for workarounds.

Broadly, I'm doing this:

  1. When calling hv_vm_create, I pass the HV_VM_ACCEL_APIC flag.
  2. The VM uses the newer hv_vcpu_run_until() API. After VM exits, query hv_vcpu_exit_info() in case there's anything else to do.
  3. Page fault VM exits in the APIC's MMIO range are forwarded to hv_vcpu_apic_write and hv_vcpu_apic_read respectively. (With hv_vcpu_exit_info check and post-processing if no_side_effect returns true.)
  4. Writes to the APICBASE MSR do some sanity checks (throw exception on invalid state transitions etc) and update the MMIO range via hv_vmx_vcpu_set_apic_address() if necessary. HVF seems to do its own additional handling for the actual APIC state changes. (Moving the MMIO and enabling the APIC at the same time fails: FB14021745)
  5. Various machinery and state handling around INIT and STARTUP IPIs for bringing up the other vCPUs. This was fiddly to get working but I think I've got it now.
  6. MSIs from virtual devices are delivered via hv_vm_lapic_msi.
  7. Reads and writes for PIC and ELCR I/O ports are forwarded to the hv_vm_atpic_port_write/hv_vm_atpic_port_read APIs. (In theory, interrupt levels on the PIC are controlled via hv_vm_atpic_assert_irq/hv_vm_atpic_deassert_irq but all modern OSes disable the PIC anyway.)
  8. Page faults for the IOAPIC's MMIO range are forwarded to hv_vm_ioapic_read/hv_vm_ioapic_write. Virtual devices deliver their interrupts using hv_vm_ioapic_assert_irq/hv_vm_ioapic_deassert_irq/hv_vm_ioapic_pulse_irq for level/edge-triggered interrupts respectively.

Now for the parts where I'm stuck and I'm either doing something wrong, or there are bugs in HVF's implementation:

Issues I'm running into

IOAPIC:

1. Unmasking during raised interrupt level, test_ioapic_level_mask test case:

  1. Guest enables masking on a particular level-triggered interrupt line. (MMIO write to ioredtbl entry)
  2. The virtual device raises interrupt level to 1. VMM calls hv_vm_ioapic_assert_irq().
  3. No interrupt, because masked, so far so good.
  4. The guest unmasks the interrupt via another write to the ioredtbl entry.

At this point I would expect the interrupt to be delivered to the vCPU. This is not the case. Even another call to hv_vm_ioapic_assert_irq() after unmasking will have no effect. Only if we deassert and reassert does the guest receive anything. (This is my current workaround, but it is rather ugly because I essentially need to maintain shadow state to detect the situation.)

2. Retriggering, test case test_ioapic_level_retrigger:

  1. The vCPU enters a interrupts-disabled section (cli instruction)
  2. The virtual device asserts level-triggered interrupt. VMM calls hv_vm_ioapic_assert_irq().
  3. The vCPU leaves the interrupts-disabled section (sti instruction) and starts executing other code (or halts, as in the test case)
  4. Interrupt is delivered to vCPU, runs interrupt handler.
  5. Interrupt handler signals EOI. Note that interrupt is still asserted.
  6. Outside the interrupt handler, the vCPU briefly disables interrupts again (cli)
  7. The vCPU once again re-enables interrupts (sti) and halts (hlt)

Here we would expect the interrupt to be delivered again, but it is not. I don't currently have a workaround for this because none of these steps causes hv_vcpu_run_until exits where this condition could be detected.

3. Coalescing, test_ioapic_level_coalesce:

  1. The virtual device asserts a level-triggered interrupt line.
  2. The vCPU enters the corresponding handler.
  3. The device de-asserts the interrupt level.
  4. The device re-asserts the interrupt.
  5. The device once again de-asserts the interrupt.
  6. The interrupt handler sets EOI and returns.

We would expect the interrupt handler to only run once in this sequence of events, but as it turns out, it runs a second time! This is less critical than the previous 2 unexpected behaviours, because spurious interrupts are usually only slightly detrimental to performance, whereas undelivered interrupts can cause system hangs. However, this doesn't exactly instill confidence in the implementation.

I've submitted the above, as they look like bugs to me - either in the implementation, or lack of documentation - as FB14425412.

APIC

To work around the HVF IOAPIC problems mentioned above, I tried to use the HVF APIC implementation in isolation without the ATPIC and IOAPIC implementations. Instead, I provided (VMM side) software implementations of these controllers. However, the software IOAPIC needs to receive end-of-interrupt notifications from the APIC. This is what I understood the HV_APIC_CTRL_IOAPIC_EOI flag to be responsible for, so I passed it to hv_vcpu_apic_ctrl() during vCPU initialisation. The software IOAPIC implementation receives all the MMIO writes, maintains IOAPIC state, and calls hv_vm_send_ioapic_intr() whenever interrupts should be delivered to the VM.

However, I have found that hv_vcpu_exit_info() never returns HV_VM_EXITINFO_IOAPIC_EOI. When the HVF APIC is in xAPIC mode, I can detect writes to offset 0xb0 in the MMIO write handler and query hv_vcpu_exit_ioapic_eoi() for the vector whose handler has run. However, once the APIC is in x2APIC mode, there are no exits for the x2APIC MSR accesses, so I can't see how I might get those EOI notifications.

  • Am I interpreting the purpose of HV_APIC_CTRL_IOAPIC_EOI correctly? Do I need to do anything other than hv_vcpu_apic_ctrl to make it work?
  • How should I be receiving the EOI notifications? I was expecting vCPU run exits, but this does not appear to be the case?

Again, either a crucial step is missing from the documentation, or there's a bug in the implementation. I've submitted this as FB14425590.

My Questions:

  1. Has anyone got the HVF APIC/IOAPIC working for the general purpose case, i.e. guest OS agnostic, all edge cases handled?
  2. The issues I've run into - are these bugs in HVF? Do I need extra support code/workarounds to make the edge cases work?
  3. Is using the APIC without the HVF's IOAPIC an intended supported use case or am I wasting my time on this "split" setup?
Problems integrating Hypervisor.framework APIC and IOAPIC
 
 
Q