Easily access your GPU with Pytorch

Luckily accessible to systems with NVIDIA graphics processing unit (GPU). Did you know of a ridiculous way to use GPU functionality using Python libraries intended and primarily for machine learning (ML) applications?
Don’t worry if you don’t speed up the ins and outs of ML, as we won’t use it in this article. Instead, I’ll show you how to access and use the GPU using the Pytorch library. We will use popular numerical libraries to compare the running time of Python programs numpy, Run on the CPU and use Pytorch for equivalent code on the GPU.
Before we proceed, let’s quickly review what GPU and Pytorch are.
What is a GPU?
The GPU is a specialized electronic chip originally designed to quickly manipulate and change memory to accelerate image creation in the frame buffer used for output to output to the display. Its utility as a fast image manipulation device is based on its ability to perform many computations simultaneously and is still used for this purpose.
However, GPUs have become very valuable recently in machine learning, training and development of large language models. Their inherent ability to perform highly feasible calculations enables them to adopt complex mathematical models and simulations in these fields, making them ideal in these fields.
What is Pytorch?
Pytorch is an open source machine learning library developed by Facebook AI Research Laboratory (FAIR). It is widely used in natural language processing and computer vision applications. Two main reasons why Pytorch can be used for GPU operation are
- One of the core data structures of Pytorch is a tensor. Tensors are similar to arrays and matrices in other programming languages, but are optimized when running on GPUs.
- Pytorch has CUDA support. Pytorch is seamlessly integrated with CUDA, a parallel computing platform and programming model developed by NVIDIA for general purpose computing on its GPU. This allows Pytorch to directly access the GPU hardware and accelerate numerical calculations. CUDA will enable developers to use Pytorch to write software that takes full advantage of GPU acceleration.
All in all, Pytorch’s support for GPU operation through CUDA and its effective tensor manipulation makes it an excellent tool for developing GPU-accelerated Python capabilities with high computing requirements.
As we will show later, you won’t have Use Pytorch to develop machine learning models or train large language models.
In the rest of this article, we will set up our development environment, install Pytorch and run through some examples where we will compare some computed heavy Pytorch implementations with equivalent Numpy implementations and see the performance differences we found.
Prerequisites
You need to use an NVIDIA GPU on your system. To check your GPU, issue the following command when prompted. I’m using a Windows subsystem for Linux (WSL).
$ nvidia-smi
>>
(base) PS C:Usersthoma> nvidia-smi
Fri Mar 22 11:41:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti WDDM | 00000000:01:00.0 On | N/A |
| 32% 24C P8 9W / 285W | 843MiB / 12282MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1268 C+G ...tilityHPSystemEventUtilityHost.exe N/A |
| 0 N/A N/A 2204 C+G ...ekyb3d8bbwePhoneExperienceHost.exe N/A |
| 0 N/A N/A 3904 C+G ...calMicrosoftOneDriveOneDrive.exe N/A |
| 0 N/A N/A 7068 C+G ...CBS_cw5n
etc ..
If the command is not recognized and you are sure that you have a GPU, it may mean that you are missing the NVIDIA driver. Just follow the rest of the instructions in this article and you should install it as part of the process.
Although the Pytorch installation package can include the CUDA library, your system must still have the appropriate NVIDIA GPU driver installed. These drivers must communicate with the graphics processing unit (GPU) hardware for your operating system. The CUDA toolkit includes drivers, but if you use Pytorch’s bundled CUDA, you just need to make sure the GPU drivers are up to date.
Click this link to go to the NVIDIA website and install the latest drivers compatible with system and GPU specifications.
Build our development environment
As a best practice, we should create a separate development environment for each project. I use conda, but please use any method that works for you.
If you want to have not used along the Conda route, you must first install Miniconda (recommended) or Anaconda.
Please note that at the time of writing, Pytorch currently only officially supports Python versions 3.8 to 3.11.
#create our test environment
(base) $ conda create -n pytorch_test python=3.11 -y
Now activate your new environment.
(base) $ conda activate pytorch_test
Now we need to get the appropriate Conda installation command for Pytorch. This will depend on your operating system, the programming language you choose, the preferred package manager and the CUDA version.
Fortunately, Pytorch provides a useful web interface that makes it easy to set up. So to get started, visit the Pytorch website…
Click Get Started
The link is near the top of the screen. From there, scroll down a little until you see it,
Click each box in the appropriate location for the system and specifications. As you do, you’ll be in Run this Command
The output field changes dynamically. Once you have finished selecting, copy the final command text displayed and type it into the command window prompt.
For me, this is: –
(pytorch_test) $ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
We will install Jupyter, Pandas, and Matplotlib to enable us to run Python code in a notebook with sample code.
(pytroch_test) $ conda install pandas matplotlib jupyter -y
Enter now jupyter notebook
Go to your command prompt. You should see your Jupyter laptop in your browser. If this does not happen automatically, you may jupyter notebook
Order.
Near the bottom, there will be a URL that you should copy and paste into your browser to start your Jupyter notebook.
Your URL is different from mine, but it should look like this: –
Test our settings
The first thing we have to do is test our setup. Please enter the following to enter the jupyter unit and run it.
import torch
x = torch.rand(5, 3)
print(x)
You should see output similar to the following.
tensor([[0.3715, 0.5503, 0.5783],
[0.8638, 0.5206, 0.8439],
[0.4664, 0.0557, 0.6280],
[0.5704, 0.0322, 0.6053],
[0.3416, 0.4090, 0.6366]])
Additionally, to check if your GPU drivers and CUDA are enabled and accessed by Pytorch, run the following command:
import torch
torch.cuda.is_available()
This should output True
If everything is OK.
If everything works fine, we can proceed with the example. If not, go back and check your installation process.
NB On the following time, I ran numpy and pytorch processes several times in a row and spent the best time for each process. This does benefit Pytorch runs, as there is a small overhead in the first call of each Pytorch run, but overall, I think it’s a fairer comparison.
Example 1 – Simple array math operation.
In this example, we set up two identical one-dimensional arrays and perform a simple addition to each array element.
import numpy as np
import torch as pt
from timeit import default_timer as timer
#func1 will run on the CPU
def func1(a):
a+= 1
#func2 will run on the GPU
def func2(a):
a+= 2
if __name__=="__main__":
n1 = 300000000
a1 = np.ones(n1, dtype = np.float64)
# had to make this array much smaller than
# the others due to slow loop processing on the GPU
n2 = 300000000
a2 = pt.ones(n2,dtype=pt.float64)
start = timer()
func1(a1)
print("Timing with CPU:numpy", timer()-start)
start = timer()
func2(a2)
#wait for all calcs on the GPU to complete
pt.cuda.synchronize()
print("Timing with GPU:pytorch", timer()-start)
print()
print("a1 = ",a1)
print("a2 = ",a2)
Timing with CPU:numpy 0.1334826999955112
Timing with GPU:pytorch 0.10177790001034737
a1 = [2. 2. 2. ... 2. 2. 2.]
a2 = tensor([3., 3., 3., ..., 3., 3., 3.], dtype=torch.float64)
We see a slight improvement when using pytorch on numpy, but we missed a key point. We are not using the GPU because our Pytorch tensor data is still in CPU memory.
To move data to GPU memory, we need to add device='cuda'
Directive when creating a tensor. Let’s see if it makes a difference.
# Same code as above except
# to get the array data onto the GPU memory
# we changed
a2 = pt.ones(n2,dtype=pt.float64)
# to
a2 = pt.ones(n2,dtype=pt.float64,device='cuda')
After rerunning the changes we got,
Timing with CPU:numpy 0.12852740001108032
Timing with GPU:pytorch 0.011292399998637848
a1 = [2. 2. 2. ... 2. 2. 2.]
a2 = tensor([3., 3., 3., ..., 3., 3., 3.], device='cuda:0', dtype=torch.float64)
More like this, with a speed of more than 10 times.
Example 2 – A slightly more complex array operation.
In this example, we will use the built-in multidimensional matrix to multidimensional matrix matmul Operations available in Pytorch and Numpy libraries. Each array will be 10000 x 10000 and contains random floating point numbers between 1 and 100.
# NUMPY first
import numpy as np
from timeit import default_timer as timer
# Set the seed for reproducibility
np.random.seed(0)
# Generate two 10000x10000 arrays of random floating point numbers between 1 and 100
A = np.random.uniform(low=1.0, high=100.0, size=(10000, 10000)).astype(np.float32)
B = np.random.uniform(low=1.0, high=100.0, size=(10000, 10000)).astype(np.float32)
# Perform matrix multiplication
start = timer()
C = np.matmul(A, B)
# Due to the large size of the matrices, it's not practical to print them entirely.
# Instead, we print a small portion to verify.
print("A small portion of the result matrix:n", C[:5, :5])
print("Without GPU:", timer()-start)
A small portion of the result matrix:
[[25461280. 25168352. 25212526. 25303304. 25277884.]
[25114760. 25197558. 25340074. 25341850. 25373122.]
[25381820. 25326522. 25438612. 25596932. 25538602.]
[25317282. 25223540. 25272242. 25551428. 25467986.]
[25327290. 25527838. 25499606. 25657218. 25527856.]]
Without GPU: 1.4450852000009036
Now available for Pytorch versions.
import torch
from timeit import default_timer as timer
# Set the seed for reproducibility
torch.manual_seed(0)
# Use the GPU
device = 'cuda'
# Generate two 10000x10000 tensors of random floating point
# numbers between 1 and 100 and move them to the GPU
#
A = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(device)
B = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(device)
# Perform matrix multiplication
start = timer()
C = torch.matmul(A, B)
# Wait for all current GPU operations to complete (synchronize)
torch.cuda.synchronize()
# Due to the large size of the matrices, it's not practical to print them entirely.
# Instead, we print a small portion to verify.
print("A small portion of the result matrix:n", C[:5, :5])
print("With GPU:", timer() - start)
A small portion of the result matrix:
[[25145748. 25495480. 25376196. 25446946. 25646938.]
[25357524. 25678558. 25675806. 25459324. 25619908.]
[25533988. 25632858. 25657696. 25616978. 25901294.]
[25159630. 25230138. 25450480. 25221246. 25589418.]
[24800246. 25145700. 25103040. 25012414. 25465890.]]
With GPU: 0.07081239999388345
This time, Pytorch runs 20 times better than Numpy runs. Great stuff.
Example 3 – Combining CPU and GPU code.
Sometimes, not all processing can be done on the GPU. The daily use case for this is graphical data. Of course, you can manipulate data with the GPU, but usually the next step is to see what the final dataset looks like using the graph.
If the data resides in GPU memory, the data cannot be drawn, so you have to move it back to CPU memory before calling the drawing function. Is it worth the overhead of moving a lot of data from the GPU to the CPU? Let’s find out.
In this example, we will solve this polarity equation between 0 and 2π in the (x,y) coordinate term and then plot the result graph.

Don’t be too stubborn about mathematics. This is just an equation, and when plotting, when using X, Y coordinate systems and solving, it looks good.
Even with millions of X and Y values, Numpy can solve this problem in milliseconds, so to make it more interesting we will use 100 million (x,y) coordinates.
First is the Numpy code.
%%time
import numpy as np
import matplotlib.pyplot as plt
from time import time as timer
start = timer()
# create an array of 100M thetas between 0 and 2pi
theta = np.linspace(0, 2*np.pi, 100000000)
# our original polar formula
r = 1 + 3/4 * np.sin(3*theta)
# calculate the equivalent x and y's coordinates
# for each theta
x = r * np.cos(theta)
y = r * np.sin(theta)
# see how long the calc part took
print("Finished with calcs ", timer()-start)
# Now plot out the data
start = timer()
plt.plot(x,y)
# see how long the plotting part took
print("Finished with plot ", timer()-start)
This is the output. Would you guess in advance that it looks like this? I definitely didn’t!

Now, let’s see what the equivalent Pytorch implementation looks like and how fast we get.
%%time
import torch as pt
import matplotlib.pyplot as plt
from time import time as timer
# Make sure PyTorch is using the GPU
device = 'cuda'
# Start the timer
start = timer()
# Creating the theta tensor on the GPU
theta = pt.linspace(0, 2 * pt.pi, 100000000, device=device)
# Calculating r, x, and y using PyTorch operations on the GPU
r = 1 + 3/4 * pt.sin(3 * theta)
x = r * pt.cos(theta)
y = r * pt.sin(theta)
# Moving the result back to CPU for plotting
x_cpu = x.cpu().numpy()
y_cpu = y.cpu().numpy()
pt.cuda.synchronize()
print("Finished with calcs", timer() - start)
# Plotting
start = timer()
plt.plot(x_cpu, y_cpu)
plt.show()
print("Finished with plot", timer() - start)
And our output.

The calculation part is 10 times that of Numpy. The data plotting uses approximately Pytorch and Numpy versions, which is expected because the data is still in CPU memory and the GPU has no other role in processing.
However, overall, we shaved about 40% of the total run time, which is excellent.
Summary
This article shows how to speed up NVIDIA GPUs of machine learning libraries (usually used for AI applications) for non-ML numeric Python code using Pytorch (usually used for AI applications). It compares the standard Numpy (CPU-based) implementation with GPU-accelerated Pytorch equivalents to show the performance advantages of running tensor-based operations on the GPU.
You don’t need to do machine learning to benefit from Pytorch. If you have access to the NVIDIA GPU, Pytorch provides an easy and efficient way to significantly speed up computationally intensive numerical operations, even with general-purpose Python code.