Accelerate

RSS for tag

Make large-scale mathematical computations and image calculations with high-performance, energy-efficient computation using Accelerate.

Accelerate Documentation

Posts under Accelerate tag

33 results found
Sort by:
Post not yet marked as solved
45 Views

vImage vs CoreImage vs MetalPerformaceShaders strengths and weaknesses

While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
Asked Last updated
.
Post not yet marked as solved
115 Views

how to speed up open source packages on M1max by Acclerate and Metal

Please help me, really urgent, please. The compatibilty of m1max chip troubled me hundreds of hour. 1、Please show me how to speed up source downloaded from github, such as numpy 、pandas or any other source, by fully using the CPU and GPU chips. (python3.8 and 3.9) can I do it just like this? Step 1: download source from github Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply) 2、How is the compatibility of Accelate and Metal? Can work with most of the source? Any tips? such as https://github.com/microsoft/qlib 3、which gcc to install? show me the code when I do it, some error happens, gcc(version 4.2.1 installed by brew) cannot compile some source, such as "ecos". Moreover, I cannot compile many sources directly by python3 setup.py install (without accelerate) How to config the gcc? which version to use on m1max 4、sometimes I can compile source by brew. but extremely unconvenient, because I need to install packages on vitual environment (e.g. conda env)other than on base path. what should I do? can I install brew on vitual environment? or just use brew to build the source, then I install by pip on vitual env? or can I config the brew to install on only vitual environment? Just show me the code 5、to compile, do I also need to install g++? witch version, show me the code 6、show me how to speed up python program by GPU and parallel computing on Accelerate
Asked
by jefftang.
Last updated
.
Post not yet marked as solved
144 Views

How to fully apply parallel computing on CPU and GPU of M1max

Project is based on python3.8 and 3.9, containing some C and C++ source How can I do parallel computing on CPU and GPU of M1max In deed, I buy Mac m1max for the strong GPU to do quantitative finance, for which the speed is extremely important. Unfortunately, cuda is not compatible with Mac. Show me how to do it, thx. Are Accelerate(for CPU) and Metal(for GPU) can speed up any source by building like this: Step 1: download source from github Step 2: create a file named "site.cfg"in this souce file, and add content: [accelerate] libraries=Metal, Acelerate, vecLib Step 3: Terminal: NPY_LAPACK_Order=accelerate python3 setup.py build Step 4: pip3 install . or python3 setup.py install ? (I am not sure which method to apply) 2、how is the compatibility of such method? I need speed up numpy, pandas and even a open souce project, such as https://github.com/microsoft/qlib 3、just show me the code 4、when compiling C++, C source, a lot of errors were reported, which gcc and g++ to choose? the default gcc installed by brew is 4.2.1, which cannot work. and I even tried to download gcc from the offical website of ARM, still cannot work. give me a hint. thx so much urgent
Asked
by jefftang.
Last updated
.
Post not yet marked as solved
129 Views

BNNSLayerParametersLSTM with hiddenSize != inputSize

Hi all, I've spent some time experimenting with the BNNS (Accelerate) LSTM-related APIs lately and despite a distinct lack of documentation (even though the headers have quite a few) a got most things to a point where I think I know what's going on and I get the expected results. However, one thing I have not been able to do is to get this working if inputSize != hiddenSize. I am currently only concerned with a simple unidirectional LSTM with a single layer but none of my permutations of gate "iw_desc" matrices with various 2D layouts and reordering input-size/hidden-size made any difference, ultimately BNNSDirectApplyLSTMBatchTrainingCaching always returns -1 as an indication of error. Any help would be greatly appreciated. PS: The bnns.h framework header file claims that "When a parameter is invalid or an internal error occurs, an error message will be logged. Some combinations of parameters may not be supported. In that case, an info message will be logged.", and yet, I've not been able to find any such messages logged to NSLog() or stderr or Console. Is there a magic environment variable that I need to set to get more verbose logging?
Asked
by andi.
Last updated
.
Post not yet marked as solved
1.1k Views

R painfully slow on Air M1 - Big Sur

I have bought a new Air with M1 Chip last week. It is Big Sur version 11.2.3. My code on RStudio is extremely slow, it takes around 7 minutes on this new laptop. I have tried to use R (rather than RStudio), and the same happens. I've checked it with my sister's Air (MacOS Mojave 10.14.6), and it takes only seconds to run the same code. What would be the reason that my 1-week-old laptop is very slow to run the R code? And what would be the solutions? Any help is so appreciated!
Asked
by bngzdmr.
Last updated
.
Post not yet marked as solved
280 Views

Why does the execution of vDSP operations sometimes take longer in M1 native code than through Rosetta translation?

Hi I am porting some applications to M1 that make extensive use of vDSP. I found in many cases there to be a minimal speed-up, which I put down to Rosetta doing a good job translating SSE instructions into equivalent Neon instructions in the vDSP library. To try and understand this more I started profiling various areas of code and have found situations where the performance of translated code runs faster than natively. Often native code speed is similar or faster as expected, but there are a notable numbers of cases where it is not. This is not what I expected. I include a sample below to show a somewhat contrived and trivial routine exhibiting the effect. I have built it using XCode 12.5.1 in Release with an 11.3 deployment target. The Mac is running macOS 11.6. On my M1 Mac mini the Rosetta build takes around 900-1000 µs to run to completion, switching to native code it takes around 1500-1600 µs. I can make various adjustments to the data size or types of vDSP operations used to find scenarios where native builds are faster, that is not difficult, but it shouldn't be necessary. I can understand why vDSP could perhaps perform similarly across native vs translated runs, but surely it should never be the case that translated code could beat native code by a margin like this. What is going on, and is it expected? Thanks, Matt #include <iostream> #include <sys/types.h> #include <sys/sysctl.h> // determine if process is running through Rosetta translation int processIsTranslated() {   int ret = 0;   size_t size = sizeof(ret);   if (sysctlbyname("sysctl.proc_translated", &ret, &size, NULL, 0) == -1)   {    if (errno == ENOENT)      return 0;    return -1;   }   return ret; } int main(int argc, const char * argv[]) {   // print translation status   if(processIsTranslated() == 1)     std::cout << "Rosetta" << std::endl;   else     std::cout << "Native" << std::endl;       // size of test   vDSP_Length array_len = 512;   const int iterations = 10000;       // allocate and clear memory   float* buf1_ptr = (float*)malloc(array_len * sizeof(float));   float* buf2_ptr = (float*)malloc(array_len * sizeof(float));   float* buf3_ptr = (float*)malloc(array_len * sizeof(float));   float* buf4_ptr = (float*)malloc(array_len * sizeof(float));   if(!buf1_ptr) return EXIT_FAILURE;   if(!buf2_ptr) return EXIT_FAILURE;   if(!buf3_ptr) return EXIT_FAILURE;   if(!buf4_ptr) return EXIT_FAILURE;   memset(buf1_ptr, 0, array_len * sizeof(float));   memset(buf2_ptr, 0, array_len * sizeof(float));   memset(buf3_ptr, 0, array_len * sizeof(float));   memset(buf4_ptr, 0, array_len * sizeof(float));       // start timer   __uint64_t start_ns = clock_gettime_nsec_np(CLOCK_UPTIME_RAW);   // scalar constants   const float scalar1 = 10;   const float scalar2 = 11;   // loop test   for(int i = 0; i < iterations; i++)   {     vDSP_vsadd(buf1_ptr, 1, &scalar1, buf2_ptr, 1, array_len);     vDSP_vsadd(buf1_ptr, 1, &scalar2, buf3_ptr, 1, array_len);     vDSP_vadd(buf2_ptr, 1, buf3_ptr, 1, buf4_ptr, 1, array_len);   }       // report test time   __uint64_t end_ns = clock_gettime_nsec_np(CLOCK_UPTIME_RAW);   double time_us = (end_ns - start_ns) / 1000.f;   std::cout << time_us << " us" << std::endl;       // clean up   if(buf1_ptr) free(buf1_ptr);   if(buf2_ptr) free(buf2_ptr);   if(buf3_ptr) free(buf3_ptr);       return 0; }
Asked Last updated
.
Post not yet marked as solved
322 Views

Floating point exception trapping on M1

I have written a simple test c++ program (below) that takes the square root of a negative number and then tries to print it out. I would like to trap the floating point exception caused by taking the square root of a negative number (e.g., I'd like the program to halt with an error after the floating point exception). On Intel Macs, I know how to do this. Is this possible on an Apple Silicon Mac? #include <cmath> #include <iostream> int main() { const double x = -1.0; double y = x; y = sqrt(y); // floating point exception...possible to build program so it terminates here? std::cout << y << "\n"; return 0; }
Asked Last updated
.
Post not yet marked as solved
855 Views

Tensorflow acceleration on macOS

Would it be possible to use GPU acceleration when training a Tensorflow model on macOS? How’s the performance when we training the same model on an Apple-chip platform?
Asked Last updated
.
Post not yet marked as solved
274 Views

vDSP.convolve incorrectly reverses kernel?

vDSP.convolve() reverses the kernel before applying it. For example, the following uses a kernel of 10 elements where the first element is 1.0 and the rest of the elements are 0.0. Applying this kernel to a vector should return the same vector. let values = (0 ..< 30).map { Double($0) } var kernel = Array.init(repeating: 0.0, count: 10) kernel[0] = 1.0 let result = vDSP.convolve(values, withKernel: kernel) print("kernel: \(kernel)") print("values: \(values)") print("result: \(result)") Applied to a values array containing elements 0.0, 1.0, 2.0, etc. the first results should be 0.0, 1.0, 2.0, etc, but instead the results start at 9.0 and increase from there: kernel: [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] values: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0] result: [9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0] If instead the kernel is reversed, placing the 1.0 at the end of the kernel: let values = (0 ..< 30).map { Double($0) } var kernel = Array.init(repeating: 0.0, count: 10) kernel[9] = 1.0 let result = vDSP.convolve(values, withKernel: kernel) print("kernel: \(kernel)") print("values: \(values)") print("result: \(result)") The results are now correct: kernel: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] values: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0] result: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]
Asked
by jolonf.
Last updated
.
Post not yet marked as solved
257 Views

vDSP.convolve returns wrong sized array?

vDSP.convolve() returns an array with length: values.count - kernel.count But shouldn't the result array have length: values.count - kernel.count + 1 I ran the following which prints out the size of the results array with various combinations of values and kernel lengths: for i in 0 ..< 10 {   let values = Array.init(repeating: 1.0, count: 1000 + i)   for j in 0 ..< 10 {     let kernel = Array.init(repeating: 1.0, count: 100 + j)     let result = vDSP.convolve(values, withKernel: kernel)           print("values[\(values.count)], kernel[\(kernel.count)], result[\(result.count)], result[\(result.count - 1)] = \(result[result.count - 1])")   } } As you can see the results array always has length values.count - kernel.count: values[1000], kernel[100], result[900], result[899] = 100.0 values[1000], kernel[101], result[899], result[898] = 101.0 values[1000], kernel[102], result[898], result[897] = 102.0 values[1000], kernel[103], result[897], result[896] = 103.0 values[1000], kernel[104], result[896], result[895] = 104.0 values[1000], kernel[105], result[895], result[894] = 105.0 values[1000], kernel[106], result[894], result[893] = 106.0 values[1000], kernel[107], result[893], result[892] = 107.0 values[1000], kernel[108], result[892], result[891] = 108.0 values[1000], kernel[109], result[891], result[890] = 109.0 values[1001], kernel[100], result[901], result[900] = 100.0 values[1001], kernel[101], result[900], result[899] = 101.0 values[1001], kernel[102], result[899], result[898] = 102.0 values[1001], kernel[103], result[898], result[897] = 103.0 values[1001], kernel[104], result[897], result[896] = 104.0 values[1001], kernel[105], result[896], result[895] = 105.0 ... However, the result array should have length values.count - kernel.count + 1. For example, if instead of using the returned result array, a result array is passed to vDSP.convolve, with length values.count - kernel.count + 1 the last value has a valid result: for i in 0 ..< 10 {   let values = Array.init(repeating: 1.0, count: 1000 + i)   for j in 0 ..< 10 {     let kernel = Array.init(repeating: 1.0, count: 100 + j)     var result = Array.init(repeating: 0.0, count: values.count - kernel.count + 1)     vDSP.convolve(values, withKernel: kernel, result: &result)           print("values[\(values.count)], kernel[\(kernel.count)], result[\(result.count)], result[\(result.count - 1)] = \(result[result.count - 1])")   } } values[1000], kernel[100], result[901], result[900] = 100.0 values[1000], kernel[101], result[900], result[899] = 101.0 values[1000], kernel[102], result[899], result[898] = 102.0 values[1000], kernel[103], result[898], result[897] = 103.0 values[1000], kernel[104], result[897], result[896] = 104.0 values[1000], kernel[105], result[896], result[895] = 105.0 values[1000], kernel[106], result[895], result[894] = 106.0 values[1000], kernel[107], result[894], result[893] = 107.0 values[1000], kernel[108], result[893], result[892] = 108.0 values[1000], kernel[109], result[892], result[891] = 109.0 values[1001], kernel[100], result[902], result[901] = 100.0 values[1001], kernel[101], result[901], result[900] = 101.0 values[1001], kernel[102], result[900], result[899] = 102.0 values[1001], kernel[103], result[899], result[898] = 103.0 values[1001], kernel[104], result[898], result[897] = 104.0 values[1001], kernel[105], result[897], result[896] = 105.0 If the result array is created with length values.count - kernel.count + 2 then we get the following runtime error: error: Execution was interrupted, reason: EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0). The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation. Indicating the extra element in the result array is valid and vDSP.convolve() is returning a result array which is one element too short.
Asked
by jolonf.
Last updated
.
Post not yet marked as solved
401 Views

xcode sse problem

I was build SSE performance work on mac intel. But I found the SSE4.1 version of performance in xcode 12.4 is not as good as xcode 10.1, so I checked the assembly of my code. The one _mm_mul_epi() was translated into three pmuludq, which is the SSE2 instruction.This was normal when compiling on xcode 10.1 and _mm_mul_epi() was translated into pmuldq. Does anyone know how to fix this issue?
Asked
by wadewang.
Last updated
.
Post not yet marked as solved
1.3k Views

Why? eGPU support and Big Sur

If you are going to sell your customers on eGPU's by blackmagic in store, why would you not plan on the integration of said eGPU processors via your update to BigSur? I spent over a $1000 on what is now a useless brick of a processing unit. The only solution that BlackMagic and Apple have to offer is to revert my system to Catalina, and Studio 16. Big Sur has been out how long now, and still no support for items you are shlepping in store Apple? Shame on you. Get the fix quick. Your business practices are showing, and I find them offensively grotesque. Keep nickel and diming your customers into oblivion as your quality degrades. I haven't considered going back to windows in decades, but now I am busting out the user manuals. Abandon ship. Apple, what a let down you have become. You used to be the pride and joy of design. Now you are the bane and boon.
Asked Last updated
.
Post not yet marked as solved
603 Views

How can I perform an audio noise reduction like the Voice Memos app?

Recently, the Voice Memos app from Apple got a new feature: a magic wand that performs noise reduction. This noise reduction seems to process live while the recorded audio is playing, since it doesn't pause the played audio. In the Apple documentation, there's a single reference - https://developer.apple.com/documentation/accelerate/signal_extraction_from_noise to a noise reduction, by performing a discrete cossine transform - https://en.wikipedia.org/wiki/Discrete_cosine_transform, removing the unwanted frequencies below a threshold, and then performing the inverse transform. My question is: is it a viable approach for performing live processing? If yes, how can I perform it? By calling installTap or maybe creating a custom AudioUnit?
Asked
by miguelfs.
Last updated
.
Post not yet marked as solved
830 Views

Accelerate framework uses only one core on Mac M1

The following C program (dgesv_ex.c) #include stdlib.h #include stdio.h /* DGESV prototype */ extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv, double* b, int* ldb, int* info ); /* Main program */ int main() { /* Locals */ int n = 10000, info; /* Local arrays */ /* Initialization */ double *a = malloc(n*n*sizeof(double)); double *b = malloc(n*n*sizeof(double)); int *ipiv = malloc(n*sizeof(int)); for (int i = 0; i n*n; i++ ) { a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5; } for(int i=0;in*n;i++) { b[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5; } /* Solve the equations A*X = B */ dgesv( &amp;n, &amp;n, a, &amp;n, ipiv, b, &amp;n, &amp;info ); free(a); free(b); free(ipiv); exit( 0 ); } /* End of DGESV Example */ compiled on a Mac mini M1 with the command clang -o dgesv_ex dgesv_ex.c -framework accelerate uses only one core of the processor (as also shown by the activity monitor) me@macmini-M1 ~ % time ./dgesv_ex ./dgesv_ex 35,54s user 0,27s system 100% cpu 35,758 total I checked that the binary is of the right type: me@macmini-M1 ~ % lipo -info dgesv Non-fat file: dgesv is architecture: arm64 As a comparaison, on my Intel MacBook Pro I get the following output : me@macbook-intel ˜ % time ./dgesv_ex ./dgesv_ex 142.69s user 0,51s system 718% cpu 19.925 total Is it a known problem ? Maybe a compilation flag or else ?
Asked
by mottelet.
Last updated
.
Post not yet marked as solved
1.2k Views

DFT vs FFT in Accelerate Framework vDSP

Hi all,I'm implementing FFT using Accelerate FW vDSP functions. I noticed the comment of FFT that says "DFT should be used instead where possible". Does anyone know the reasonign for this? Tradionally FFT is the fast implementation of the DFT. Just wondering should I use the DFT functions if they are faster than the FFT.Cheers!
Asked
by vesap.
Last updated
.