Reinforcement Learning - I

May 31, 2025

15 read

I will be writing these series of blogs for specifically targeting people who have zero to decent knowledge of Machine or Deep learning. I am writing these blogs after going through the lectures of the Google Deep minds course on RL

Reinforcement Learning is a sub class of Deep learning in which we are always trying to make something ( mostly an ai model ) which learns on it’s own to achieve a specific goal, without any human trying to tell it what is the most optimal way of achieving that goal.

According to the formal definition of RL, it is the action or process of Strengthening Learning

Introduction

The first question that comes into your mind would be, “What the fuck is even RL ?”. Let’s first answer this question ( I would be using RL as the keyword referencing to Reinforcement Learning ). The first step towards this technology was unknowingly was taken by Alan Turing from the The Imitation Game. So, the whole idea of RL is that if we want a program to work exactly the same as a adult human mind, we would have to make a model that has all the information about making the exact right decision, but there can be another way of achieving this, for an adult mind we would have these three things about it

The initial state of mind, say at birth
The education to which it has been subjected
Other experiences, which may or may not be described as education

If we have all of these points in our access, then wouldn’t it be more optimal to make a program that can think like a 1 year old baby or maybe younger than that then, take that program from the same set of stages that a adult mind goes through and therefore we will have a RAG implementation. But there is an assumption in this way of doing things is that, we hope that there is so little mechanism in the child-brain that something like that can be easily programmed.

Basically RL is nothing but training a model to achieve a specific goal, but rather than straight up trying to teach it the best way to do something, we make a model that can learn and then make that model interact with the environment and then decide doing which action is resulting in reward and which action results not in a reward.

The Interaction Loop

The entire flow of a RL agent learning involves a few components, we will go through all of them one by one

The Agent
The Observation
The environment
The Action
The Goal

The Agent

This is the main program ( the model ) that we are trying to train for achieving a specific goal.

The Observation

This is the input given to the agent once is has taken a action in the environment.

The environment

This is the place where the agent is based in, this is the place with which the agent is interacting and getting feedback from.

The Action

This is the action which the agent takes after getting the observation and the reward from the environment

What RL requires us to think about ?

Long term consequences of taking a specific action, meaning we not only want to think about the actions current consequences but also how will it affect the environment in the upcoming observations
We actively have to gather information about all the actions being taken by the agent and also their respective consequences. Using this information would help us to predict the future which will in return help us to take in account the long term consequences of a action
The last one is uncertainty, this is a major thing that we have to deal with. As we are training a model to not only perform a specific task rather we are training it to learn. So, we also have to take in account to tackle the environment which we are not aware about or being exposed to.

How does a agent interact with the environment ?

At each time step t the agent

Receives a observation Ot and maybe a reward Rt, the reward can be the internal function of the Observation or it can be given as a input also.
Then the agent also executes a action At, given the observation Ot from the environment

and the environment

Receives a action At
On the basis of this action executed return a observation Ot+1 or a Reward Rt+1

The reward Rt is a scalar feedback signal, it represents how well a agent is doing, its job is to maximize the cumulative reward.

Gt = Rt+1 + Rt+2 + Rt+3 + …..

We call this Gt return

The entire goal is to select actions which maximize this return. There can be some actions which can be not rewarding in the current timestep but maybe reward in the longs term.For Example, deciding for a helicopter to refuel it right now or not is such a scenario. Refuelling it right now may reduce the reward as it will alose increase the time, but in the long run it will decrease the time as the helicopter might now get stuck anywhere as it was refuelled earlier.

Core components of a agent

State
Policy
Value function
Model

State ( St )

The history is the full sequence of observations, actions, rewards. We can use this history ( Ht ) is make the state of the agent from scratch.

The agents state is not exactly the equal is the history, the agent may not see the entire state, maybe it sees some observations and some are hidden from it.

St = Ot = environment

When the observation is equal to the environment state then the agent has full observability.

Markov Decision Process ( MDPs )

These are useful mathematical frameworks, which helps provide a structured way to define the dynamics of the interaction between the agent and the environment, particularly focusing on the state.

A decision process is Markov if

This means that the state of the agent contains all we need to know from the history, but this doesn’t mean the state has everything from the history. We can say that he agents state St is the compression of Ht.

Policy ( π )

The policy of the agent describes it’s behaviour, it is a map from the agent state to action. It basically defines what action should the agent take given a certain state.

Value Function

This tells you how good it is to be in the current state of the agent, more precisely how much future reward can we expect by being in this state

Discount Factor γ∈[0,1]

If γ = 0, then we only care about the current reward.

If γ closer to 1, then we clearly care about the future reward.

This factor helps to maintain the balance between the short-term and the long-term reward.

Model

A model is the agent’s internal representation of the environment. It tries to predict how will the environment change when the agent takes a certain action.

It basically has two components ➖

Transition Model ( P ) It predicts the probability of reaching state St’ from state S after taking action a
Reward Model ( R ) It predicts the expected immediate reward when taking action A on state St

If you have read till here, you have got the introduction to RL and now, you are ready to read the next blog. Till the time being you can try and read more about the topics you were introduced in this blog.

Saksham

Software engineer