Reinforcement learning becomes simple: Establishing a Q learning agent in Python

In Go World Champion, Lee Sedol faces an opponent that is not made of blood and blood, but of code lines.
Soon it became apparent that humans were lost.
In the end, Lee Sedol lost 4:1.
Last week, I watched the documentary Alphago again and found it fascinating again.
Something scary? Alphago does not get the play style from databases, rules, or strategy books.
Instead, it has fought against millions of times and has learned how to win in the process.
Moving 37 in the second game is the moment the world understands: this AI doesn’t work like humans – it plays better.
Alphago combines supervised learning, reinforcement learning and search. One fascinating part is that its strategy comes from what emerges by fighting against its own learning – using reinforcement learning improves over time.
Now, we use reinforcement learning not only in games, but also in robotics such as gripping weapons or home robots, such as energy optimization, such as reducing energy consumption in data centers or traffic controls, such as through traffic light optimization.
And, in modern agents, we now use large language models as well as reinforcement learning (e.g., reinforcement learning from human feedback) to make Chatgpt, Claude or Gemini’s response more human-like.
In this article, I will show you exactly how it works and how we can better understand the mechanics using simple games: TIC TAC TOE.
What is reinforcement learning?
When we observe a baby learning to walk, we see: It stands up, falls, tries again – at some point taking the first step.
No teacher shows the baby how to do it. Instead, the baby tries to walk different movements through trial and error.
This is a baby’s reward when it can stand or take a few steps. After all, its goal is to be able to walk. If it falls, there is no reward.
The learning process of trials, errors and rewards is the basic idea behind reinforcement learning (RL).
Reinforcement learning is a learning method in which agents learn through interaction with the environment, which results in rewards.
Its goal is to get as many rewards as possible in the long run.
- Contrary to supervised learning, there is no “correct answer” or label. Agents must find out which decisions are good on their own.
- Contrary to unsupervised learning, the goal is not to find hidden patterns in the data, but to perform actions that maximize rewards.
How RL agents think, decide and learn
In order for RL agents to learn, it requires four things: about the current position (state), what it can do (action), the goals it wants to achieve (rewards), and the performance of past strategies (values).
Agent action, get feedback and get better.
For this, four things are needed:
1) Policy/Strategy
This is a rule or strategy for an agent to decide which actions to perform in a certain state. In simple cases, this is a lookup table. In more complex applications, such as neural networks, it is a function.
2) Reward signal
Rewards are feedback from the environment. For example, this could be +1 of win, 0 of draw, and -1 of loss. The goal of an agent is to collect as many rewards as possible through as many steps as possible.
3) Value function
This feature estimates the expected rewards of a country. The reward indicates whether the action is “good” or “bad”. The value function estimates a good state of a state – not only immediately, but also considers the future rewards that an agent can start from the state. Therefore, this value function estimates a country’s long-term benefits.
4) Environmental Model
One model tells the agent: “If I act A in state, I may end up in state and receive a reward R.”
However, in modelless methods such as Q-learning, this is not necessary.
Exploration and Exploration: Move 37 – and what we can learn from
You may remember the 2nd match between Alphago and Lee Sedol 37:
This is an unusual move for us humans, but it was later called a genius.
Why does this algorithm do this?
Computer programs are trying something new. This is called exploration.
Both reinforcement learning requires two: agents must find a balance between exploitation and exploration.
- Exploitation means that the agent uses an action already known.
- On the other hand, exploration is the first attempt by an agent. It tried them because they might be better than the action you already know.
Agents try to find the best strategy through trial and error.
With reinforcement learning
Let’s take a look at the enhanced learning of super famous games.
You may also have played Tic Tac Toe since childhood.

The game is perfect for an introductory example because it doesn’t require a neural network, the rules are clear, we can implement the game with just a little python:
- Our agents start with zero knowledge about the game. It starts like a human being watching the game for the first time.
- Agents are gradually evaluating each game situation: a score of 0.5 means “I don’t know yet if I want to win here.” A 1.0 means “this situation will almost certainly lead to victory.
- By attending many parties, the agent observed what worked and adapted to his strategy.
Target? For each turn, the agent should choose the action that brings the highest long-term reward.
In this section, we will gradually build such an RL system and create the file tictactoerl.py.
→ You can find all the code in this GitHub repository.
1. Create a game environment
In reinforcement learning, agents learn through interaction with the environment. It determines what a status is (e.g., the current board), what measures are allowed (e.g., where you can be bet), and what feedback is available in the lawsuit (e.g., if you win +1).
In theory, we call this setup the Markov decision-making process: the model consists of states, actions, and rewards.
First, we created a class Tictactoe. This manages the game board, we create it as a 3×3 numpy array and manages the game logic:
- The reset (self) feature launches a new game.
- Functions available _actions() returns all free fields.
- Functional steps (self, action, player) perform game movement. Here we return the new state, reward (1 = Win, 0.5 = draw, -10 = invalid action) and game state. In this example, we will punish -10 for invalid actions so that agents learn to avoid them quickly – a common technique in small RL environments.
- The function check_winner() checks if the player is consecutively three X or O, so it wins.
- Using Render_Gui(), we display the current board with Matplotlib as X and O graphs.
import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import random
from collections import defaultdict
# Tic Tac Toe Spielumgebung
class TicTacToe:
def __init__(self):
self.board = np.zeros((3, 3), dtype=int)
self.done = False
self.winner = None
def reset(self):
self.board[:] = 0
self.done = False
self.winner = None
return self.get_state()
def get_state(self):
return tuple(self.board.flatten())
def available_actions(self):
return [(i, j) for i in range(3) for j in range(3) if self.board[i, j] == 0]
def step(self, action, player):
if self.done:
raise ValueError("Spiel ist vorbei")
i, j = action
if self.board[i, j] != 0:
return self.get_state(), -10, True
self.board[i, j] = player
if self.check_winner(player):
self.done = True
self.winner = player
return self.get_state(), 1, True
elif not self.available_actions():
self.done = True
return self.get_state(), 0.5, True
return self.get_state(), 0, False
def check_winner(self, player):
for i in range(3):
if all(self.board[i, :] == player) or all(self.board[:, i] == player):
return True
if all(np.diag(self.board) == player) or all(np.diag(np.fliplr(self.board)) == player):
return True
return False
def render_gui(self):
fig, ax = plt.subplots()
ax.set_xticks([0.5, 1.5], minor=False)
ax.set_yticks([0.5, 1.5], minor=False)
ax.set_xticks([], minor=True)
ax.set_yticks([], minor=True)
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.5, 2.5)
ax.grid(True, which='major', color='black', linewidth=2)
for i in range(3):
for j in range(3):
value = self.board[i, j]
if value == 1:
ax.plot(j, 2 - i, 'x', markersize=20, markeredgewidth=2, color='blue')
elif value == -1:
circle = plt.Circle((j, 2 - i), 0.3, fill=False, color='red', linewidth=2)
ax.add_patch(circle)
ax.set_aspect('equal')
plt.axis('off')
plt.show()
2. Programming Q Learning Agent
Next, we define the learning section: Our agent
It decides what actions to perform in a certain state to receive as many rewards as possible.
The agent uses the classic RL method Q learning. Q value is stored for each combination of the state and action, which is the estimated long-term benefit of the action.
The most important method is:
- use
choose_action(self, state, actions)
Function, the agent decides in each game situation whether to choose an action that is already known (exploitation) or try new actions that have not been fully tested (explored).This decision is based on the so-called ε-greedy method:
For the probability of ε= 0.1, the agent chose a random action (exploration) with a probability of 90% (1 – ε), which selected the most well-known action currently based on its Q-table (development).
- Usage functions
update(state, action, reward, next_state, next_actions)
We adjust the Q value based on the good movement and what happens later. This is the central learning step for the agent.
# Q-Learning-Agent
class QLearningAgent:
def __init__(self, alpha=0.1, gamma=0.9, epsilon=0.1):
self.q_table = defaultdict(float)
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
def get_q(self, state, action):
return self.q_table[(state, action)]
def choose_action(self, state, actions):
if random.random()
In mine AlternativeI write abstracts of articles published on the fields of technology, Python, data science, machine learning, and AI regularly. If you are interested, please check it out or subscribe.
3. Training Agents
The actual learning process begins with this step. During the training, brokers learn through trial and error. Agents play many games, remember which moves work well and adapt to their strategies.
During the training, the agent learns the rewards of their actions, how their actions affect their later status and how they develop strategies in the long run.
- Usage functions
train(agent, episodes=10000)
We define 10,000 games for an agent against a simple random opponent. In each episode, the broker (Player 1) takes action, followed by the opponent (Player 2). After each move, the agent will learnupdate()
. - We save how many wins, draws and defeats every 1000 games.
- Finally, we draw the learning curve with matplotlib. It shows how agents can improve over time.
# Training mit Lernkurve
def train(agent, episodes=10000):
env = TicTacToe()
results = {"win": 0, "draw": 0, "loss": 0}
win_rates = []
draw_rates = []
loss_rates = []
for episode in range(episodes):
state = env.reset()
done = False
while not done:
actions = env.available_actions()
action = agent.choose_action(state, actions)
next_state, reward, done = env.step(action, player=1)
if done:
agent.update(state, action, reward, next_state, [])
if reward == 1:
results["win"] += 1
elif reward == 0.5:
results["draw"] += 1
else:
results["loss"] += 1
break
opp_actions = env.available_actions()
opp_action = random.choice(opp_actions)
next_state2, reward2, done = env.step(opp_action, player=-1)
if done:
agent.update(state, action, -1 * reward2, next_state2, [])
if reward2 == 1:
results["loss"] += 1
elif reward2 == 0.5:
results["draw"] += 1
else:
results["win"] += 1
break
next_actions = env.available_actions()
agent.update(state, action, reward, next_state2, next_actions)
state = next_state2
if (episode + 1) % 1000 == 0:
total = sum(results.values())
win_rates.append(results["win"] / total)
draw_rates.append(results["draw"] / total)
loss_rates.append(results["loss"] / total)
print(f"Episode {episode+1}: Wins {results['win']}, Draws {results['draw']}, Losses {results['loss']}")
results = {"win": 0, "draw": 0, "loss": 0}
x = [i * 1000 for i in range(1, len(win_rates) + 1)]
plt.plot(x, win_rates, label="Win Rate")
plt.plot(x, draw_rates, label="Draw Rate")
plt.plot(x, loss_rates, label="Loss Rate")
plt.xlabel("Episodes")
plt.ylabel("Rate")
plt.title("Lernkurve des Q-Learning-Agenten")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
4. Visualization of the board of directors
Use the main program “if name == “main”:”: “We define the starting point of the program. It ensures that when we execute the script, the training of the agent runs automatically. We use render_gui()
How to display the Tictactoe board as a graph.
# Hauptprogramm
if __name__ == "__main__":
agent = QLearningAgent()
train(agent, episodes=10000)
# Visualisierung eines Beispielbretts
env = TicTacToe()
env.board[0, 0] = 1
env.board[1, 1] = -1
env.render_gui()
Terminal execution
We save the code in the file tictactoerl.py.
In the terminal, we now navigate to the corresponding directory where we store our tictactoerl.py and execute the file using the command “python tictactoerl.py”.
In the terminal, we can see how many games our agents win after each episode 1000:

In the visualization, we see the learning curve:

The final thought
With Tictactoe, we used a simple game and some Python, but we could easily see how enhanced learning works:
- The agent begins without any prior knowledge.
- It develops strategies through feedback and experience.
- The results gradually improve not because it knows the rules, but because it learns.
In our example, the opponent is a random proxy. Next, we can see how our Q Learning Agent fights another learning agent or against ourselves.
Reinforcement learning shows us that machine intelligence is created not only through knowledge or information, but through experience, feedback, and adaptation.