Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Metal Performance Shaders Documentation

Posts under Metal Performance Shaders tag

46 Posts
Sort by:
Post not yet marked as solved
1 Replies
150 Views
Hello, furthermore I test Metal and at the moment I get an internal error 0000000e. I added the WWDC 20 Debug GPU-side error in Metal recommendation with no effect on error or more description. What I do is using the compute command encoder to calculate 8-bit pearson hashes over a count of 256^3 UInt array and do the follow in the Metal code (1) first calculate over given (sorted) 3 UInts as hashA (f.e. 0,0,0 is first, 0,0,1 is second...) (2) calculate a new Pearson hash with this 3 UInt an the hashA with result hashB (3) calculate a newer Pearson hash with these (now) 3 Uint hashA and hashB to my final hash (4) In 3 loops do the same 1-4 for all steps 1 to 3 to looking for the hash collisions and add 1 to a counter if the collision is found but not my 3 UInts from step 1 - this is the result I want. The algorithm is not fine but works for is Swift implementation as single core an multi core. But Metal leaves method in the middle of work and I see some "side" effects. For example next two parts of source code differ with typeof loop variable and hard coded array length (2. code). This code begins with internal error 0000000e message and starts to write 99 in the 256^3 long result float array. Somewhere within the working it crash. But let the program run two times and the crash is not at the same position in loop. Today at first run result array at 596891 did not set 0 next is set next not set and later no set, at second run here is the 99 set up to offset 14700544 but not from offset 14700545 to end at third run the last set was at offset 708541 in the result array all after some before are not set.     for (int maybe0 = 0; maybe0 <= 255 && runAgain; maybe0++) {         for (int maybe1 = 0; maybe1 <= 255 && runAgain; maybe1++) {             for (int maybe2 = 0; maybe2 <= 255 && runAgain; maybe2++) {                 const uchar baseForHashA[] = {                     (uchar) maybe0,                     (uchar) maybe1,                     (uchar) maybe2                 };                 const int anzahlElementeMaybeA = *(&baseForHashA+1)-baseForHashA;                 uchar maybeHashA = 0;                 for (int i=0; i<anzahlElementeMaybeA; i++)  {                     int locationA = (maybeHashA ^ baseForHashA[i]);                     maybeHashA = WIKIPEDIA_EN_TABLE[locationA];                 }                 if (maybeHashA == hashA) {                     const uchar baseForHashB[] = {                         (uchar) maybe0,                         maybeHashA,                         (uchar) maybe1,                         (uchar) maybe2                     };                     const int anzahlElementeMaybeB = *(&baseForHashB+1)-baseForHashB;                     uchar maybeHashB = 0;                     for (int i=0; i<anzahlElementeMaybeB; i++)  {                         int locationB = (maybeHashB ^ baseForHashB[i]);                         maybeHashB = WIKIPEDIA_EN_TABLE[locationB];                     }                     const int left = maybeHashB;                     const int right = hashB;                     if (left == right) {                         result [index] = offset +99;                         return;                         offset += 1;                         if (                             (uchar) arr1[index] == maybe0 &&                             (uchar) arr2[index] == maybe1 &&                             (uchar) arr3[index] == maybe2                             ){                                 result[index] = offset;                                 runAgain = false;                         }                     }                 }             }         }     } This code write 99 to the complete 256^3 long result float array     for (uchar maybe0 = 0; maybe0 <= 255 && runAgain; maybe0++) {         for (uchar maybe1 = 0; maybe1 <= 255 && runAgain; maybe1++) {             for (uchar maybe2 = 0; maybe2 <= 255 && runAgain; maybe2++) {                 const uchar baseForHashA[] = {                     maybe0,                     maybe1,                     maybe2                 };                 uchar maybeHashA = 0;                 for (int i=0; i<3; i++)  {                     int locationA = (maybeHashA ^ baseForHashA[i]);                     maybeHashA = WIKIPEDIA_EN_TABLE[locationA];                 }                 if (maybeHashA == hashA) {                     const uchar baseForHashB[] = {                         maybe0,                         maybeHashA,                         maybe1,                         maybe2                     };                     uchar maybeHashB = 0;                     for (int i=0; i<4; i++)  {                         int locationB = (maybeHashB ^ baseForHashB[i]);                         maybeHashB = WIKIPEDIA_EN_TABLE[locationB];                     }                     const int left = maybeHashB;                     const int right = hashB;                                                              if (left == right) {                         result [index] = offset +99;                         return;                         offset += 1;                         if (                             arr1[index] == maybe0 &&                             arr2[index] == maybe1 &&                             arr3[index] == maybe2                             ){                                 result[index] = offset;                                 runAgain = false;                         }                     }                 }             }         } I need only 6 lines more running Metal code. Help!!! thx
Posted
by Bastie.
Last updated
.
Post not yet marked as solved
2 Replies
291 Views
I'm training a basic model using an M1 MBA with tensorflow-metal 0.7.0 and tensorflow-macos 2.11 installed, using Python 3.10 on macOS 13.2.1. CPU-based training runs as expected with about 10 s/epoch on this model. However, GPU-based training is orders of magnitude slower and doesn't learn. Here's a model to generate Irish poetry, based upon the example https://github.com/susanli2016/Natural-Language-Processing-in-TensorFlow/blob/master/Irish%20Lyrics%20generated%20poetry.ipynb. CPU training on this dataset takes 10 s/epoch. The ETA with GPU training with a batch size of 32 is over 2.5 hours, and many minutes for a batch size of 2048, and 20 s for a batch size of the length of the training data. Furthermore, GPU training does not work—there is no increase in accuracy. import numpy as np import os import platform import subprocess import tensorflow as tf from textwrap import wrap from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam from tensorflow.keras import regularizers import tensorflow.keras.utils as ku from tensorflow.python.framework.ops import disable_eager_execution, enable_eager_execution # use the GPU disable_eager_execution() irish_lyrics_file = '/tmp/irish-lyrics-eof.txt' irish_lyrics_url = 'https://raw.githubusercontent.com/AliAkbarBadri/nlp-tf/master/irish-lyrics-eof.txt' if not os.path.isfile(irish_lyrics_file): subprocess.run(["curl", "-L", irish_lyrics_url, "-o", irish_lyrics_file]) with open(irish_lyrics_file, 'r') as fd: data = fd.read() corpus = data.lower().split('\n') tokenizer = Tokenizer() tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 # create input sequences using list of tokens input_sequences = [] for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[: i+1] input_sequences.append(n_gram_sequence) # pad sequences max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = np.array(pad_sequences(input_sequences, maxlen = max_sequence_len, padding='pre')) # Create predictors and label xs, labels = input_sequences[:, :-1], input_sequences[:,-1] ys = ku.to_categorical(labels, num_classes=total_words) xs = tf.convert_to_tensor(xs) ys = tf.convert_to_tensor(ys) model = Sequential() model.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) model.add(Bidirectional(LSTM(150))) model.add(Dense(total_words, activation='softmax')) adam = Adam(learning_rate=0.01) model.compile(loss='categorical_crossentropy', optimizer=adam, metrics = ['accuracy']) batch_size = 32 steps_per_epoch = int(np.ceil(xs.shape[0]/batch_size)) history = model.fit(xs, ys, epochs=100, batch_size=batch_size, steps_per_epoch=steps_per_epoch, verbose=1) ku.plot_model(model, show_shapes=True) model.summary() import matplotlib.pyplot as plt def plot_graphs(history, string): plt.plot(history.history[string]) plt.xlabel('Epochs') plt.ylabel(string) plt.show() plot_graphs(history, 'accuracy'); index_word_dict = {index: word for word, index in tokenizer.word_index.items()} seed_text = 'A poor emigrants daughter' next_words = 100 for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1).item() if predicted in index_word_dict: seed_text += ' ' + index_word_dict[predicted] print('\n'.join(wrap(seed_text)))
Posted
by essandess.
Last updated
.
Post not yet marked as solved
2 Replies
308 Views
Xcode GPU Frame Capture shows that "PreZ Test Fails" percent is zero. I can't understand what is wrong.. I drawed the opaque primitives with depth test, no alpha test, no alpha blend. I thought that "Hidden Surface Removal" removes hidden surfaces, so There is no killed fragments by "PreZ Test Kill". But I couldn't find the column about "Hidden Surface Removal". It looks that xcode gpu frame capture doesn't show the data about hidden surface removal. I tested it on iphone 13 mini(ios 16.3), M1 Mac Ventura
Posted Last updated
.
Post marked as solved
2 Replies
557 Views
I've got the following code that attempts to use MPSImageScale shader to flip and convert a texture: /// mtlDevice and mtlCommandBuffer are obtained earlier /// srcTex and dstTex are valid and existing MTLTexture objects with the same descriptors MPSScaleTransform scale{}; scale.scaleX = 1; scale.scaleY = -1; auto scaleShader = [[MPSImageScale alloc] initWithDevice:mtlDevice]; if ( scaleShader == nil ) { return ErrorType::OUT_OF_MEMORY; } scaleShader.scaleTransform = &scale; [scaleShader encodeToCommandBuffer:mtlCommandBuffer sourceTexture:srcTex destinationTexture:dstTex; No matter what I do, I keep getting the EXC_BAD_ACCESS with the last line with the assembly stopping before endEncoding: 0x7ff81492d804 <+1078>: callq 0x7ff81492ce5b ; ___lldb_unnamed_symbol373$$MPSImage -> 0x7ff81492d809 <+1083>: movq -0x98(%rbp), %rdi 0x7ff81492d810 <+1090>: movq 0x3756f991(%rip), %rsi ; "endEncoding" All Metal objects are valid and I did all that I could to ensure that they are not culprits here, including making sure that the pixel format of both textures is the same, even if this is not required for MPS shaders. What am I missing?
Posted
by BartW.
Last updated
.
Post not yet marked as solved
0 Replies
499 Views
I'm working on an audio processing app and am creating an AVAudioUnit extension as a part of it. I need to train a small neural network in the app and use it to process audio in real-time in the AudioUnit. The network is mostly convolutions and is ideal for running on the GPU but it should run in real-time on the CPU. The problem that I'm currently facing is that none of the ML frameworks seem to be safe to use for inference within custom AVAudioUnit kernels. My understanding is that only C and C++ should be used in these kernels (in addition to the other rules of real-time computing). Objective-C and Swift are discouraged per the documentation. My background is primarily in ML so I'm newer to Apple development and especially new to real-time development in this ecosystem. I've investigated CoreML, MPS, BNNS/Accelerate, and MLCompute so far but I'm not certain that any of them are safe to use. Any feedback would be greatly appreciated!
Posted
by zimmerk4.
Last updated
.
Post not yet marked as solved
0 Replies
509 Views
Hi everyone, I have this Error report when I try to render a spécial type of project On blender, i have a Mac book pro M1 Pro 10CPU/16GPU OSX Ventura i used blender 2.9/3.3/3.4/3.5 i use adaptative Subdvision on many Mesh to increase 8K texture, displacement, Normal Map… It is failed only when I try to render On GPU metal experimental, but CPU render is Ok ! (but really more longer) i had try to fixed it And Install the last version of OSX but it’s not fixed I can crash the render with juste One mesh —> adaptative subdivision —> Render Metal GPU if everyone Can help me ! (sorry for my bad English)
Posted
by Goliath.
Last updated
.
Post not yet marked as solved
2 Replies
394 Views
Hello! I'm a long-standing user of MPSCNN framework, it usually works fine but during the implementation of one my recent networks I started to get this errors: 2023-01-06 00:17:46.017908+0600 -[44642:879994] [GPUDebug] Invalid device load executing kernel function "cnnConvWinograd_8x8_3x3_32x32_256" encoder: "", dispatch: 0, at offset 120``` Also, strangely networks produces different results over the same inputs across multiple runs. I assume there are some race conditions inside which causes that. Is it possible to somehow enforce MPSCNN to use some other implementation of convolution? Or I'm stuck with it forever?
Posted
by s1ddok.
Last updated
.
Post not yet marked as solved
4 Replies
931 Views
Hello All, I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. CUDA Streams Metal device: Mac Studio with M1 Ultra. (write the code on Metal-cpp) I creating several MTLCommandBuffer in 1 MTLCommandQueue or several MTLCommandQueue with more MTLCommandBuffer. Regarding Metal resources, there are two options: Buffers (MTLBuffer) was created with an option MTLResourceStorageModeShared. In the profiler, all Command buffers are performed sequentially on the timeline of Compute. Buffers (MTLBuffer) was created with an option "MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked". In the profiler, I really saw the parallelism. But the maximum number of threads in the Compute timeline is always no more than 2 (see pictures). Also weird. Computing commands do not depend on each other. METAL Compute timeline About performance: [1] In the first variant, the performance is the same for different amounts of MTLCommandQueue and MTLCommandBuffer. [2] In the second variant, the performance for one MTLCommandBuffer is greater than for 2 or more. Question: why is this happening? How to parallelize the work of the compute kernels to get an increase performance? Addition information: Also, the CUDA code is rewritten in OpenCL, and it is perfectly parallelized in Windows(NVIDIA/AMD/Intel) if several OpenCL queues are running. The same code running on M1 Ultra works the same way with 1 or with many OpenCL queues. In turn, Metal is faster than OpenCL, so I am trying to figure out exactly Metal, and make the kernels work in parallel on Metal.
Posted
by abdyla_v.
Last updated
.
Post marked as solved
2 Replies
604 Views
Hello All, I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. But I ran into a problem, for some reason all the kernels on Compute are always executed sequentially. I tried to create several MTLCommandBuffer in 1 MTLCommandQueue. Also created several MTLCommandQueue with more MTLCommandBuffer. Or I used several CPU threads. But the result is always the same. In the profiler, I always observe that CommandBuffer works in order. Screenshots from profilers for CUDA and Metal are below. CUDA Profiler Metal Profiler Metal Profiles I even created a simple kernel that does the sum of some numbers, I run this kernel with dispatchThreads((1,1,1),(1,1,1)) parameters, and I also cannot get these kernels to work in parallel. Anyone can help me? Is there a solution or is this the specifics of Metal on M1 work?
Posted
by abdyla_v.
Last updated
.
Post not yet marked as solved
0 Replies
399 Views
I don't know if I'm going to get an answer to this, but basically I get different mean values when running MPSImageStatisticsMeanAndVariance shader than when I use either of CUDA nppiMeanStdDev, custom OpenCL and shader and CPU code. The difference for mean is significant. For a sample image I get { 0.36, 0.30, 0.22 } // MPS { 0.55, 0.43, 0.21 } // Any other method Deviation/Variance is slightly different as well (though much less), but it might be due to the difference in the mean value. My first guess is that MTLTexture transforms underlying data somehow (sRBG->linear?) and the mean is calculated from that transformed data, instead of from the original data. But maybe there's something else going on that I'm missing? How can I achieve parity between Metal and other methods? Any assistance would be appreciated.
Posted
by BartW.
Last updated
.
Post not yet marked as solved
1 Replies
520 Views
I am drawing stuff onto an off-screen MTLTexture. (using Skia Canvas) At a later point, I want to render this MTLTexture into a CAMetalLayer to display it on the screen. Since I was using Skia for the off-screen drawing operations, my code is quite simple and I don't have the typical Metal setup (no MTLLibrary, MTLRenderPipelineDescriptor, MTLRenderPassDescriptor, MTLRenderEncoder, etc). I now simply want to draw that MTLTexture into a CAMetalLayer, but haven't figured out how to do so simply. This is where I draw my stuff to the MTLTexture _texture (Skia code): - (void) renderNewFrameToCanvas(Frame frame) { if (_skContext == nullptr) { GrContextOptions grContextOptions; _skContext = GrDirectContext::MakeMetal((__bridge void*)_device, // TODO: Use separate command queue for this context? (__bridge void*)_commandQueue, grContextOptions); } @autoreleasepool { // Lock Mutex to block the runLoop from overwriting the _texture std::lock_guard lockGuard(_textureMutex); auto texture = _texture; // Get & Lock the writeable Texture from the Metal Drawable GrMtlTextureInfo fbInfo; fbInfo.fTexture.retain((__bridge void*)texture); GrBackendRenderTarget backendRT(texture.width, texture.height, 1, fbInfo); // Create a Skia Surface from the writable Texture auto skSurface = SkSurface::MakeFromBackendRenderTarget(_skContext.get(), backendRT, kTopLeft_GrSurfaceOrigin, kBGRA_8888_SkColorType, nullptr, nullptr); auto canvas = skSurface->getCanvas(); auto surface = canvas->getSurface(); // Clear anything that's currently on the Texture canvas->clear(SkColors::kBlack); // Converts the Frame to an SkImage - RGB. auto image = SkImageHelpers::convertFrameToSkImage(_skContext.get(), frame); canvas->drawImage(image, 0, 0); // Flush all appended operations on the canvas and commit it to the SkSurface canvas->flush(); // TODO: Do I need to commit? /* id<MTLCommandBuffer> commandBuffer([_commandQueue commandBuffer]); [commandBuffer commit]; */ } } Now, since I have the MTLTexture _texture in memory, I want to draw it to the CAMetalLayer _layer. This is what I have so far: - (void) setup { // I set up a runLoop that calls render() 60 times a second. // [removed to simplify] _renderPassDescriptor = [[MTLRenderPassDescriptor alloc] init]; // Load the compiled Metal shader (PassThrough.metal) auto baseBundle = [NSBundle mainBundle]; auto resourceBundleUrl = [baseBundle URLForResource:@"VisionCamera" withExtension:@"bundle"]; auto resourceBundle = [[NSBundle alloc] initWithURL:resourceBundleUrl]; auto shaderLibraryUrl = [resourceBundle URLForResource:@"PassThrough" withExtension:@"metallib"]; id<MTLLibrary> defaultLibrary = [_device newLibraryWithURL:shaderLibraryUrl error:nil]; id<MTLFunction> vertexFunction = [defaultLibrary newFunctionWithName:@"vertexPassThrough"]; id<MTLFunction> fragmentFunction = [defaultLibrary newFunctionWithName:@"fragmentPassThrough"]; // Create a Pipeline Descriptor that connects the CPU draw operations to the GPU Metal context auto pipelineDescriptor = [[MTLRenderPipelineDescriptor alloc] init]; pipelineDescriptor.label = @"VisionCamera: Frame Texture -> Layer Pipeline"; pipelineDescriptor.vertexFunction = vertexFunction; pipelineDescriptor.fragmentFunction = fragmentFunction; pipelineDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm; _pipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineDescriptor error:nil]; } - (void) render() { @autoreleasepool { // Blocks until the next Frame is ready (16ms at 60 FPS) auto drawable = [_layer nextDrawable]; std::unique_lock lock(_textureMutex); auto texture = _texture; MTLRenderPassDescriptor* renderPassDescriptor = [[MTLRenderPassDescriptor alloc] init]; renderPassDescriptor.colorAttachments[0].texture = drawable.texture; renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionClear; renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColor(); id<MTLCommandBuffer> commandBuffer([_commandQueue commandBuffer]); auto renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor]; [renderEncoder setLabel:@"VisionCamera: PreviewView Texture -> Layer"]; [renderEncoder setRenderPipelineState:_pipelineState]; [renderEncoder setFragmentTexture:texture atIndex:0]; [renderEncoder endEncoding]; [commandBuffer presentDrawable:drawable]; [commandBuffer commit]; lock.unlock(); } } And along with that, I have created the PassThrough.metal shader which is just for passing through a texture: #include <metal_stdlib> using namespace metal; // Vertex input/output structure for passing results from vertex shader to fragment shader struct VertexIO { float4 position [[position]]; float2 textureCoord [[user(texturecoord)]]; }; // Vertex shader for a textured quad vertex VertexIO vertexPassThrough(const device packed_float4 *pPosition [[ buffer(0) ]], const device packed_float2 *pTexCoords [[ buffer(1) ]], uint vid [[ vertex_id ]]) { VertexIO outVertex; outVertex.position = pPosition[vid]; outVertex.textureCoord = pTexCoords[vid]; return outVertex; } // Fragment shader for a textured quad fragment half4 fragmentPassThrough(VertexIO inputFragment [[ stage_in ]], texture2d<half> inputTexture [[ texture(0) ]], sampler samplr [[ sampler(0) ]]) { return inputTexture.sample(samplr, inputFragment.textureCoord); } Running this crashes the app with the following exception: validateRenderPassDescriptor:782: failed assertion `RenderPass Descriptor Validation Texture at colorAttachment[0] has usage (0x01) which doesn't specify MTLTextureUsageRenderTarget (0x04) This now raises three questions for me: Do I have to do all of that Metal setting up, packing along the PassThrough.metal shader, render pass stuff, etc just to draw the MTLTexture to the CAMetalLayer? Is there no simpler way? Why is the code above failing? When is the drawing from Skia actually committed to the MTLTexture? Do I need to commit the command buffer (as seen in my TODO)?
Posted
by mrousavy.
Last updated
.
Post not yet marked as solved
2 Replies
488 Views
my code: guard let drawable = view.currentDrawable else { return }     let renderPassDescriptor = MTLRenderPassDescriptor()     renderPassDescriptor.colorAttachments[0].texture = drawable.texture //There is a crash     renderPassDescriptor.colorAttachments[0].loadAction = .clear     renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColor(red: 0, green: 0, blue: 0, alpha: 0) error info: MTLTextureDescriptor has height (9964) greater than the maximum allowed size of 8192. > validateTextureDimensions > validateTextureDimensions:1075: failed assertion `MTLTextureDescriptor has height (9964) greater than the maximum allowed size of 8192.' please help: I try to put a canvas on an image,as you can see,the size of this canvas is too large to exceed the GPU processing limit。Is there a better way that don't change size to solve this problem
Posted Last updated
.
Post marked as solved
8 Replies
1.1k Views
Hello. Curios over here and wanted to ask a certain question. I'm currently doing some research on Metal API and came across some real good demos that uses Metal. However, I stumbled upon A product called MoltenVK. What I heard about this it uses Metal 2 and 3 for portability for vulkan to translate vulkan code to metal code(if this is even true). I wanted your guys opinion about this moltenvk that actually uses metal. Is apple ok with this even though it uses metal api? Let me know. any opinion is fine here
Posted Last updated
.
Post not yet marked as solved
0 Replies
1k Views
I tried training my model on my M1 Pro using Tensorflow's mixed-precision, hoping it will boost the performance, but I got an error: .../mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:289:0: error: 'mps.select' op failed to verify that all of {true_value, false_value, result} have same element type .../mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:289:0: note: see current operation: %5 = "mps.select"(%4, %3, %2) : (tensor<1xi1>, tensor<1xf16>, tensor<1xf32>) -> tensor<1xf16>
Posted Last updated
.
Post not yet marked as solved
0 Replies
550 Views
Hi guys. With the new Metal 3 API will devs develop more games for us on mac as well? Good quality gaming on mac with the current apple silicon is really possible now.
Posted
by Boztik.
Last updated
.
Post not yet marked as solved
0 Replies
471 Views
As part of my automated testing on real devices I would like to get the pipeline statistics from my compute kernels (ALU, Memory, Control Flow, Occupancy, etc). I'm able to generate a GPU trace in code without issue but that requires manually opening up the trace to find the values I'm interested in. What I would like is an API to read that trace or some counter set which provides the statistics I'm interested in. Is this possible?
Posted
by mrbauer1.
Last updated
.