In week 1, we explored setting up PyCUDA on Google Colab and discussed some CUDA concepts like threads, blocks, and grids. We also wrote some basic CUDA C++ code and Python code to invoke it on a GPU.
The code for week 1 can be found in this Google Colab notebook; you can open your own (click "new notebook"). Make sure that you are using the "T4 GPU" runtime when running the cells (you will find it under runtime --> change runtime type). This code compare the runtime of multipying N numbers with and without the use of GPU. N is set to 1000, to make things interesting, try bigger values (1000000 or even 100000000).
!pip install pycuda
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
from time import perf_counter
module = SourceModule('''
__global__ void multiply(float *dest, float *a, float *b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
dest[i] = a[i] * b[i];
}
''')
multiply = module.get_function("multiply")
N = 1000
a = np.random.randn(N).astype(np.float32)
b = np.random.randn(N).astype(np.float32)
dest = np.zeros_like(a)
t0 = perf_counter()
multiply(cuda.Out(dest), cuda.In(a), cuda.In(b), block=(1024,1,1), grid=((N//1024)+1,1,1))
t1 = perf_counter()
print(f"Multiplied {N} pairs of numbers")
print(f"GPU computation time: {t1 - t0}")
dest2 = np.zeros_like(a)
t0 = perf_counter()
for i in range(N):
dest2[i] = a[i] * b[i]
t1 = perf_counter()
print(f"CPU computation time: {t1 - t0}")