Swift performance with Accelerate when evaluating random binary trees

I am trying to move a program I currently have implemented in Python to Swift. Performance is critical as it is about a randomness-based search algorithm. What I need to do is to evaluate random binary trees where each non-leaf node represents a basic arithmetic or logical operation and each leaf-node represents a (large) vector of numbers.

I managed to get Accelerate's BNNS functions to do the calculations but still they are slower than my (much simpler, Pandas-based) Python approach, which takes less than half the time on average and with similar circumstances.

It would be great if someone could review my code and tell me whether there is any further potential for optimisation and/or any better approach. In below code I only cover the add (addition) operations but the others are very similar in structure. I also left out the part which generates the trees (and ensures they are "legit" in terms of consecutive operations) as I don't think that is particularly relevant here - happy to also add it however in case you think that is of help.

import Accelerate

enum NodeValue {
  case add // Addition
  case sub // Substraction
  case mul // Multiplication
  case div // Division
  case sml // Smaller
  case lrg // Larger
  case met(columnIndex: Int) // This is a index of return_data and used for the leafs of the hierarchy tree
}

final class Node {
  var value: NodeValue
  var lhs: Node?
  var rhs: Node?
   
  init(value: NodeValue, lhs: Node?, rhs: Node?) {
    self.value = value
    self.lhs = lhs
    self.rhs = rhs
  }
   
  func evaluate_signal(return_data: [Int: [Float16]]) -> [Any] {
    // Determine output
    switch self.value {
    case .add:
      let eval_left = lhs!.evaluate_signal(return_data: return_data) as! [Float16]
      let eval_right = rhs!.evaluate_signal(return_data: return_data) as! [Float16]
      let leftDescriptor = BNNSNDArrayDescriptor.allocate(initializingFrom: eval_left,
                                shape: .vector(eval_left.count))
      let rightDescriptor = BNNSNDArrayDescriptor.allocate(initializingFrom: eval_right,
                                 shape: .vector(eval_left.count))
      let resultDescriptor = BNNSNDArrayDescriptor.allocateUninitialized(scalarType: Float16.self,
                                        shape: .vector(eval_left.count))
      let layer = BNNS.BinaryArithmeticLayer(inputA: leftDescriptor,
                          inputADescriptorType: BNNS.DescriptorType.sample,
                          inputB: rightDescriptor,
                          inputBDescriptorType: BNNS.DescriptorType.sample,
                          output: resultDescriptor,
                          outputDescriptorType: BNNS.DescriptorType.sample,
                          function: BNNS.ArithmeticBinaryFunction.add)
      try! layer!.apply(batchSize: 1,
               inputA: leftDescriptor,
               inputB: rightDescriptor,
               output: resultDescriptor)
      let resultVector: [Float16] = resultDescriptor.makeArray(of: Float16.self)!
      leftDescriptor.deallocate()
      rightDescriptor.deallocate()
      resultDescriptor.deallocate()
      return resultVector
    case .sub:
      // Similar code for substraction
    case .mul:
      // Similar code for multiplication
    case .div:
      // Similar code for division
    case .sml:
      // Similar code for comparison if smaller
    case .lrg:
      // Similar code for comparison if larger
    case .met(let columnIndex):
      return return_data[columnIndex]!
    }
  }
}

BNNS Arithmetic layer functions are memory bound for simple operations like these, and as such will not be as performant as code that is able to keep operands in CPU registers. In addition layer creation and destruction adds overhead, if possible we in general recommend structuring code so the layer is only created once. To obtain the best performance, you may wish to investigate the simd functionality of Swift rather than invoking BNNS, and aim to keep all operands in registers rather than writing them to/from memory at each node.

Thanks but the simd functionality (as far as I understand it) wouldn't solve my problem as it is only capable to handle small vectors. In my case, the vectors with real numbers which are present in the leaf nodes have a length of around 20k. But maybe I am missing something hese?

On the other item about keeping operands in registers: Would you maybe have an example? Sorry i come from a math/statistics background and am still getting familiar with proper programming.

Swift performance with Accelerate when evaluating random binary trees
 
 
Q