Accelerate

RSS for tag

Make large-scale mathematical computations and image calculations with high-performance, energy-efficient computation using Accelerate.

Accelerate Documentation

Posts under Accelerate tag

24 Posts
Sort by:
Post not yet marked as solved
0 Answers
52 Views
Hi there I'm writing some audio plug-ins that use biquad filtering of incoming audio. The audio is supplied to me as vectors of doubles. I am using the Accelerate callbacks of vDSP_biquad_CreateSetupD, vDSP_biquad_DestroySetupD and vDSP_biquadD on a vDSP_biquad_SetupD struct. When the user changes the filter parameters, I want to update the coefficients of the biquad filter. I assumed that I would be able to use the new vDSP_biquad_SetCoefficientsDouble callback, but that requires a vDSP_biquad_Setup rater than a vDSP_biquad_SetupD — i.e. a single-precision vector, rather than the double-precision vector that I would have thought it would require. Is that an error? How do I update the coefficients of a double-precision object? Thanks in advance, Michael
Posted
by mjsnorris.
Last updated
.
Post not yet marked as solved
1 Answers
182 Views
I'm using M1pro and have successfully installed Numpy with Accelerate following, and it really speedup my programs. I also ran np.test() to check the correctness and every test passed. However, I can't install Scipy with Accelerate, since the official document said Accelerate has a LAPACK of too old version. I can't even find a scipy that can pass scipy.test(). I tried the codes below: conda install numpy 'libblas=*=*accelerate' conda install scipy np.test() as fails, sp.test() can't even finish conda install numpy 'libblas=*=*openblas' conda install scipy Both np.test() and sp.test() can finish, but with many failures. I believe the bugs are due to Conda. pip install --no-binary :all: --no-use-pep517 numpy pip install scipy np.test() has no failure and went fast, sp.test() uses OpenBLAS and has 3 failures. This is the best version I have found. So my question is: can we find a reliable version of scipy on M1? Considering the popularity of scipy, I think it's not a high-living expectation. And a question for Apple: is there really a plan to upgrade the LAPACK in Accelerate?
Posted
by billysrh.
Last updated
.
Post marked as solved
2 Answers
198 Views
I tried to read in the ldoor matrix and attempted the LLT factorization but it gives me: "parseLdoor[55178:5595352] Factored does not hold a completed matrix factorization. (lldb)" Because the ldoor matrix is large I have not been able to discover the issue. I am unsure if the matrix data was converted correctly via the SparseConvertFromCoordinate function. Otoh, I was able to use the same code to get the correct answers for the simple 4x4 example used in the Sparse Solver documentation. Any help would be appreciated. Here is my code ... without the ldoor matrix
Posted Last updated
.
Post not yet marked as solved
4 Answers
871 Views
I have written a simple test c++ program (below) that takes the square root of a negative number and then tries to print it out. I would like to trap the floating point exception caused by taking the square root of a negative number (e.g., I'd like the program to halt with an error after the floating point exception). On Intel Macs, I know how to do this. Is this possible on an Apple Silicon Mac? #include <cmath> #include <iostream> int main() { const double x = -1.0; double y = x; y = sqrt(y); // floating point exception...possible to build program so it terminates here? std::cout << y << "\n"; return 0; }
Posted Last updated
.
Post not yet marked as solved
2 Answers
216 Views
Reading a solution given in a book to adding the elements of an input array of doubles, an example is given with Accelerate as func challenge52c(numbers: [Double]) -> Double { var result: Double = 0.0 vDSP_sveD(numbers, 1, &result, vDSP_Length(numbers.count)) return result } I can understand why Accelerate API's don't adhere to Swift API design guidelines, why is it that they don't seem to use Cocoa guidelines either? Are there other conventions or precedents that I'm missing?
Posted
by Curiosity.
Last updated
.
Post not yet marked as solved
2 Answers
616 Views
My clients are medical researchers researching methods for characterizing patients' gait from raw accelerometry by matching the data stream against a set of "templates." for the various characteristics. Their (labyrinthine) pseudocode appears to slide a snippet ("template;" kernel?) across a data stream looking for goodness-of-fit by correlation coefficient. This is for each of several templates, so performance is at a premium. As I read the name and the 13-word description, vDSP.correlate(_:withKernel:) does this — in some way. However, the set of numbers that emerge from my playground experiments don't make sense: Identical segments are scored 0.0 (should be 1.0, right?). Merely similar matches show values barely distinguishable from the rest, and are often well outside the range -1.0 ... 1.0. Clearly I'm doing it wrong. Web searches don't tell me anything, but I'm naïve on the subject. Am I mistaken in hoping this vDSP function does what I want? "Yes, you're mistaken" is an acceptable answer. Bonus if you can point me to a correct solution. If I'm on the right track, how can I generate inputs so I can interpret the output as fits my needs? Note: Both streams are normalized to µ = 0.0 and σ = 1.0, by vDSP, and validated by all the unit tests I've done so far.
Posted
by fritza.
Last updated
.
Post not yet marked as solved
0 Answers
427 Views
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
Posted Last updated
.
Post not yet marked as solved
3 Answers
1.9k Views
I have bought a new Air with M1 Chip last week. It is Big Sur version 11.2.3. My code on RStudio is extremely slow, it takes around 7 minutes on this new laptop. I have tried to use R (rather than RStudio), and the same happens. I've checked it with my sister's Air (MacOS Mojave 10.14.6), and it takes only seconds to run the same code. What would be the reason that my 1-week-old laptop is very slow to run the R code? And what would be the solutions? Any help is so appreciated!
Posted
by bngzdmr.
Last updated
.
Post not yet marked as solved
1 Answers
279 Views
Hi, Xcode fails to build a very simple code shown below ONLY IF build configuration is Debug and produces "Undefined symbols for architecture arm64:  "(extension in Accelerate):Accelerate.AccelerateMutableBuffer< where A: Swift.MutableCollection>.withUnsafeMutableBufferPointer((inout Swift.UnsafeMutableBufferPointer<A.Accelerate.AccelerateBuffer.Element>) throws -> A1) throws -> A1", referenced " If build configuration is Release, the build is success and it runs just fine. Note: The code below is a just sample one to reproduce the issue easily. I don't want to use Accelerate framework or those pointer functions in viewDidLoad() for the actual project. import UIKit import Accelerate class ViewController: UIViewController {   override func viewDidLoad() {     super.viewDidLoad()     // Do any additional setup after loading the view.          let b = UnsafeMutablePointer<Float>.allocate(capacity: 10)     var c = UnsafeMutableBufferPointer(start: b, count: 10)     c.withUnsafeMutableBufferPointer{ buf in       let base = buf.baseAddress       print("test ",base!)     }   } } Xcode version is 13.2.1 target iOS version is 15.2 The only workaround I know for now is just to build for release. But I really need debugger for a real project which I'm working on. Any help, advice or comment would be appreciated. Best Regards, Hikaru
Posted
by hiktsu.
Last updated
.
Post not yet marked as solved
3 Answers
301 Views
I've implemented FFT using the Accelerate frame work and I'm not sure I've done it correctly. For starters, I don't like that the imaginary array is filled with zeroes, this seems to be a waste of memory. However the more serious issue is that I get intermittent crashes when using the vDSP API (malloc errors) I've read the online docs, tried to follow several online samples. Could somebody more knowledgeable with these APIs have a look? See attached file: FFT.swift
Posted
by pnadeau.
Last updated
.
Post marked as solved
2 Answers
412 Views
Hi, I am a Julia/C++ developper and a total novice with apple ecosystem. I look for examples and documentation about the SparseSolve C++ API from the Accelerate framework. So far, I have only found Swift and Objective C documentation. Any hint form the community ?
Posted Last updated
.
Post not yet marked as solved
3 Answers
288 Views
I am using the Accelerate Framework to convert YUV Data to ARGB Data for a Video Call App. The framework works great, However when I hold calls I use a place holder image sent from the server. That image causes issues sometimes because of its size. Accelerate is telling me that it's range of interest is larger than the input buffer(roiLargerThanInputBuffer). I am not sure exactly how to address this issue. Any thoughts or suggestions would be greatly appreciated.  The problem was that my video buffer stream's pixel buffer width and height changed from the server side. That being said all that needed to be done is to check for when it changes and then remove the current vImage_buffer from memory and reinitialize a new one with the correct size. Is it proper to tell the accelerate framework to change the vImage_buffer width and height this way. It seems to work well.  if myBuffer.height != destinationBuffer.height {             free(destinationBuffer.data)                 error = vImageBuffer_Init(&destinationBuffer,                                           vImagePixelCount(myBuffer.height),                                           vImagePixelCount(myBuffer.height),                                           cgImageFormat.bitsPerPixel,                                           vImage_Flags(kvImageNoFlags))                 guard error == kvImageNoError else {                     return nil                 }         } Thanks
Posted
by Cartisim.
Last updated
.
Post not yet marked as solved
1 Answers
429 Views
I am really lost and very much a newbie to programming. I am trying to use the BLAS and LAPACK libraries. I am programming in VSCODE and at the terminal when I'm trying to run my code I am using this command - g++ MV_mult_sequential.cpp -I/usr/local/include -L/usr/local/lib -llapack -lblas. I think there is some way I could be using Accelerate to do this, but again I have no clue. I have no idea if I'm doing anything correctly. I could use some guidance.
Posted
by yeshlurn.
Last updated
.
Post not yet marked as solved
1 Answers
581 Views
Please help me, really urgent, please. The compatibilty of m1max chip troubled me hundreds of hour. 1、Please show me how to speed up source downloaded from github, such as numpy 、pandas or any other source, by fully using the CPU and GPU chips. (python3.8 and 3.9) can I do it just like this? Step 1: download source from github Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply) 2、How is the compatibility of Accelate and Metal? Can work with most of the source? Any tips? such as https://github.com/microsoft/qlib 3、which gcc to install? show me the code when I do it, some error happens, gcc(version 4.2.1 installed by brew) cannot compile some source, such as "ecos". Moreover, I cannot compile many sources directly by python3 setup.py install (without accelerate) How to config the gcc? which version to use on m1max 4、sometimes I can compile source by brew. but extremely unconvenient, because I need to install packages on vitual environment (e.g. conda env)other than on base path. what should I do? can I install brew on vitual environment? or just use brew to build the source, then I install by pip on vitual env? or can I config the brew to install on only vitual environment? Just show me the code 5、to compile, do I also need to install g++? witch version, show me the code 6、show me how to speed up python program by GPU and parallel computing on Accelerate
Posted
by jefftang.
Last updated
.
Post not yet marked as solved
0 Answers
417 Views
Project is based on python3.8 and 3.9, containing some C and C++ source How can I do parallel computing on CPU and GPU of M1max In deed, I buy Mac m1max for the strong GPU to do quantitative finance, for which the speed is extremely important. Unfortunately, cuda is not compatible with Mac. Show me how to do it, thx. Are Accelerate(for CPU) and Metal(for GPU) can speed up any source by building like this: Step 1: download source from github Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply) 2、how is the compatibility of such method? I need speed up numpy, pandas and even a open souce project, such as https://github.com/microsoft/qlib 3、just show me the code 4、when compiling C++, C source, a lot of errors were reported, which gcc and g++ to choose? the default gcc installed by brew is 4.2.1, which cannot work. and I even tried to download gcc from the offical website of ARM, still cannot work. give me a hint. thx so much urgent
Posted
by jefftang.
Last updated
.
Post not yet marked as solved
0 Answers
281 Views
Hi all, I've spent some time experimenting with the BNNS (Accelerate) LSTM-related APIs lately and despite a distinct lack of documentation (even though the headers have quite a few) a got most things to a point where I think I know what's going on and I get the expected results. However, one thing I have not been able to do is to get this working if inputSize != hiddenSize. I am currently only concerned with a simple unidirectional LSTM with a single layer but none of my permutations of gate "iw_desc" matrices with various 2D layouts and reordering input-size/hidden-size made any difference, ultimately BNNSDirectApplyLSTMBatchTrainingCaching always returns -1 as an indication of error. Any help would be greatly appreciated. PS: The bnns.h framework header file claims that "When a parameter is invalid or an internal error occurs, an error message will be logged. Some combinations of parameters may not be supported. In that case, an info message will be logged.", and yet, I've not been able to find any such messages logged to NSLog() or stderr or Console. Is there a magic environment variable that I need to set to get more verbose logging?
Posted
by andi.
Last updated
.
Post not yet marked as solved
0 Answers
462 Views
Hi I am porting some applications to M1 that make extensive use of vDSP. I found in many cases there to be a minimal speed-up, which I put down to Rosetta doing a good job translating SSE instructions into equivalent Neon instructions in the vDSP library. To try and understand this more I started profiling various areas of code and have found situations where the performance of translated code runs faster than natively. Often native code speed is similar or faster as expected, but there are a notable numbers of cases where it is not. This is not what I expected. I include a sample below to show a somewhat contrived and trivial routine exhibiting the effect. I have built it using XCode 12.5.1 in Release with an 11.3 deployment target. The Mac is running macOS 11.6. On my M1 Mac mini the Rosetta build takes around 900-1000 µs to run to completion, switching to native code it takes around 1500-1600 µs. I can make various adjustments to the data size or types of vDSP operations used to find scenarios where native builds are faster, that is not difficult, but it shouldn't be necessary. I can understand why vDSP could perhaps perform similarly across native vs translated runs, but surely it should never be the case that translated code could beat native code by a margin like this. What is going on, and is it expected? Thanks, Matt #include <iostream> #include <sys/types.h> #include <sys/sysctl.h> // determine if process is running through Rosetta translation int processIsTranslated() {   int ret = 0;   size_t size = sizeof(ret);   if (sysctlbyname("sysctl.proc_translated", &ret, &size, NULL, 0) == -1)   {    if (errno == ENOENT)      return 0;    return -1;   }   return ret; } int main(int argc, const char * argv[]) {   // print translation status   if(processIsTranslated() == 1)     std::cout << "Rosetta" << std::endl;   else     std::cout << "Native" << std::endl;       // size of test   vDSP_Length array_len = 512;   const int iterations = 10000;       // allocate and clear memory   float* buf1_ptr = (float*)malloc(array_len * sizeof(float));   float* buf2_ptr = (float*)malloc(array_len * sizeof(float));   float* buf3_ptr = (float*)malloc(array_len * sizeof(float));   float* buf4_ptr = (float*)malloc(array_len * sizeof(float));   if(!buf1_ptr) return EXIT_FAILURE;   if(!buf2_ptr) return EXIT_FAILURE;   if(!buf3_ptr) return EXIT_FAILURE;   if(!buf4_ptr) return EXIT_FAILURE;   memset(buf1_ptr, 0, array_len * sizeof(float));   memset(buf2_ptr, 0, array_len * sizeof(float));   memset(buf3_ptr, 0, array_len * sizeof(float));   memset(buf4_ptr, 0, array_len * sizeof(float));       // start timer   __uint64_t start_ns = clock_gettime_nsec_np(CLOCK_UPTIME_RAW);   // scalar constants   const float scalar1 = 10;   const float scalar2 = 11;   // loop test   for(int i = 0; i < iterations; i++)   {     vDSP_vsadd(buf1_ptr, 1, &scalar1, buf2_ptr, 1, array_len);     vDSP_vsadd(buf1_ptr, 1, &scalar2, buf3_ptr, 1, array_len);     vDSP_vadd(buf2_ptr, 1, buf3_ptr, 1, buf4_ptr, 1, array_len);   }       // report test time   __uint64_t end_ns = clock_gettime_nsec_np(CLOCK_UPTIME_RAW);   double time_us = (end_ns - start_ns) / 1000.f;   std::cout << time_us << " us" << std::endl;       // clean up   if(buf1_ptr) free(buf1_ptr);   if(buf2_ptr) free(buf2_ptr);   if(buf3_ptr) free(buf3_ptr);       return 0; }
Posted Last updated
.
Post not yet marked as solved
1 Answers
436 Views
vDSP.convolve() reverses the kernel before applying it. For example, the following uses a kernel of 10 elements where the first element is 1.0 and the rest of the elements are 0.0. Applying this kernel to a vector should return the same vector. let values = (0 ..< 30).map { Double($0) } var kernel = Array.init(repeating: 0.0, count: 10) kernel[0] = 1.0 let result = vDSP.convolve(values, withKernel: kernel) print("kernel: \(kernel)") print("values: \(values)") print("result: \(result)") Applied to a values array containing elements 0.0, 1.0, 2.0, etc. the first results should be 0.0, 1.0, 2.0, etc, but instead the results start at 9.0 and increase from there: kernel: [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] values: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0] result: [9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0] If instead the kernel is reversed, placing the 1.0 at the end of the kernel: let values = (0 ..< 30).map { Double($0) } var kernel = Array.init(repeating: 0.0, count: 10) kernel[9] = 1.0 let result = vDSP.convolve(values, withKernel: kernel) print("kernel: \(kernel)") print("values: \(values)") print("result: \(result)") The results are now correct: kernel: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] values: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0] result: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]
Posted
by jolonf.
Last updated
.
Post not yet marked as solved
1 Answers
385 Views
vDSP.convolve() returns an array with length: values.count - kernel.count But shouldn't the result array have length: values.count - kernel.count + 1 I ran the following which prints out the size of the results array with various combinations of values and kernel lengths: for i in 0 ..< 10 {   let values = Array.init(repeating: 1.0, count: 1000 + i)   for j in 0 ..< 10 {     let kernel = Array.init(repeating: 1.0, count: 100 + j)     let result = vDSP.convolve(values, withKernel: kernel)           print("values[\(values.count)], kernel[\(kernel.count)], result[\(result.count)], result[\(result.count - 1)] = \(result[result.count - 1])")   } } As you can see the results array always has length values.count - kernel.count: values[1000], kernel[100], result[900], result[899] = 100.0 values[1000], kernel[101], result[899], result[898] = 101.0 values[1000], kernel[102], result[898], result[897] = 102.0 values[1000], kernel[103], result[897], result[896] = 103.0 values[1000], kernel[104], result[896], result[895] = 104.0 values[1000], kernel[105], result[895], result[894] = 105.0 values[1000], kernel[106], result[894], result[893] = 106.0 values[1000], kernel[107], result[893], result[892] = 107.0 values[1000], kernel[108], result[892], result[891] = 108.0 values[1000], kernel[109], result[891], result[890] = 109.0 values[1001], kernel[100], result[901], result[900] = 100.0 values[1001], kernel[101], result[900], result[899] = 101.0 values[1001], kernel[102], result[899], result[898] = 102.0 values[1001], kernel[103], result[898], result[897] = 103.0 values[1001], kernel[104], result[897], result[896] = 104.0 values[1001], kernel[105], result[896], result[895] = 105.0 ... However, the result array should have length values.count - kernel.count + 1. For example, if instead of using the returned result array, a result array is passed to vDSP.convolve, with length values.count - kernel.count + 1 the last value has a valid result: for i in 0 ..< 10 {   let values = Array.init(repeating: 1.0, count: 1000 + i)   for j in 0 ..< 10 {     let kernel = Array.init(repeating: 1.0, count: 100 + j)     var result = Array.init(repeating: 0.0, count: values.count - kernel.count + 1)     vDSP.convolve(values, withKernel: kernel, result: &result)           print("values[\(values.count)], kernel[\(kernel.count)], result[\(result.count)], result[\(result.count - 1)] = \(result[result.count - 1])")   } } values[1000], kernel[100], result[901], result[900] = 100.0 values[1000], kernel[101], result[900], result[899] = 101.0 values[1000], kernel[102], result[899], result[898] = 102.0 values[1000], kernel[103], result[898], result[897] = 103.0 values[1000], kernel[104], result[897], result[896] = 104.0 values[1000], kernel[105], result[896], result[895] = 105.0 values[1000], kernel[106], result[895], result[894] = 106.0 values[1000], kernel[107], result[894], result[893] = 107.0 values[1000], kernel[108], result[893], result[892] = 108.0 values[1000], kernel[109], result[892], result[891] = 109.0 values[1001], kernel[100], result[902], result[901] = 100.0 values[1001], kernel[101], result[901], result[900] = 101.0 values[1001], kernel[102], result[900], result[899] = 102.0 values[1001], kernel[103], result[899], result[898] = 103.0 values[1001], kernel[104], result[898], result[897] = 104.0 values[1001], kernel[105], result[897], result[896] = 105.0 If the result array is created with length values.count - kernel.count + 2 then we get the following runtime error: error: Execution was interrupted, reason: EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0). The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation. Indicating the extra element in the result array is valid and vDSP.convolve() is returning a result array which is one element too short.
Posted
by jolonf.
Last updated
.