Does __syncthreads() synchronize all threads in the grid?

The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4]. Example of using __syncthreads(): (source) … Read more

When to call cudaDeviceSynchronize?

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially. So, for example, So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any). cudaDeviceSynchronize may … Read more

Is it possible to run CUDA on AMD GPUs?

Nope, you can’t use CUDA for that. CUDA is limited to NVIDIA hardware. OpenCL would be the best alternative. Khronos itself has a list of resources. As does the StreamComputing.eu website. For your AMD specific resources, you might want to have a look at AMD’s APP SDK page. Note that at this time there are several initiatives to translate/cross-compile CUDA … Read more

How to verify CuDNN installation?

Installing CuDNN just involves placing the files in the CUDA directory. If you have specified the routes and the CuDNN option correctly while installing caffe it will be compiled with CuDNN. You can check that using cmake. Create a directory caffe/build and run cmake .. from there. If the configuration is correct you will see these lines: If everything is … Read more

How do I select which GPU to run a job on?

The problem was caused by not setting the CUDA_VISIBLE_DEVICES variable within the shell correctly. To specify CUDA device 1 for example, you would set the CUDA_VISIBLE_DEVICES using or The former sets the variable for the life of the current shell, the latter only for the lifespan of that particular executable invocation. If you want to specify more than one device, use … Read more

NVIDIA NVML Driver/library version mismatch

Surprise surprise, rebooting solved the issue (I thought I had already tried that). The solution Robert Crovella mentioned in the comments may also be useful to someone else, since it’s pretty similar to what I did to solve the issue the first time I had it.

Cudamemcpy function usage

It’s not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer. The simplest approach (I think) is to “flatten” the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates: … Read more

How to get the CUDA version?

As Jared mentions in a comment, from the command line: (or /usr/local/cuda/bin/nvcc –version) gives the CUDA compiler version (which matches the toolkit version). From application code, you can query the runtime API version with or the driver API version with As Daniel points out, deviceQuery is an SDK sample app that queries the above, along with … Read more

How to get the CUDA version?

As Jared mentions in a comment, from the command line: (or /usr/local/cuda/bin/nvcc –version) gives the CUDA compiler version (which matches the toolkit version). From application code, you can query the runtime API version with or the driver API version with As Daniel points out, deviceQuery is an SDK sample app that queries the above, along … Read more