Reinforcement Learning

Reinforcement learning is an interesting concept. In simple words, supervised learning is kind of micro management. At each point, on each step, the machine is corrected by measuring how wrong it is. Reinforcement learning works by rewarding the right as much as penalizing the wrong - based on the final outcome rather than each minor step. That generates a lot more capable agents that can create miracles. Sure that sounds great. Wish our management understands that!

Note that most of the early RL literature talks about playing computer games. That is because they are the easiest to relate and model. Rest assured the reinforcement learning has a lot more to offer. Most of the AI miracles we see today have an element of RL in them.

Elements of Reinforcement Learning

This is how an RL application works. An agent interacts with the environment and tries to build a model of the environment based on the rewards that it gets. Here is the detail about the different entities involved in the reinforcement learning.


Environment is the subset of the world involved in our experiment. It defines the world for the RL agent that we create. The interaction of the environment with the agent forms a continuous loop:

  • Accept actions from the agent
  • Produce the new state for the agent
  • Reward the agent where appropriate


The agent is what we try to train as we proceed. It acts based on the state of the environment and what it has learnt so far and processes the new state and rewards. Doing so, it learns to achieve the defined goals. It interacts with the environment in the loop:

  • Sense the state of environment and any rewards offered
  • Select an action based on inputs from the environment


It represents the situation in the environment that the agent uses to make decisions. The state may not directly point to a win or lose. It is just a situation that the agent should interpret in order to make further decisions. It is usually denoted by St for state at time t.


Reward is a scalar value that the environment may return when the agent selects and action. It could be positive or negative. Not every action has a reward (zero reward). The reward is defined based on one or more goals to be achieved. The value and details of the reward are defined by the designer. This is one of the important factors that defines the convergence of the training.

It is usually denoted by rt for reward at time t


Action is what the agent does. Action could be discrete - choosing one of the N possible alternatives. Or it could be continuous - choosing from a range of values. This action is calculated based on the situation. It is usually denoted by At for the action at time t.


Policy is the mapping of state and action that the agent defines for itself. It could be deterministic or stochastic. Deterministic policy implies that the agent will always take the given action for a given state on the other hand, stochastic policy adds some randomness to this. It is denoted by \u03c0

Value Function

This is used to assign a value to each state. Over time, the agent uses this function to decide the action with the expectation of a better resulting state. Formally, value is the expected long-term accumulation of reward, starting from s, following the policy \u03c0

There are two types of value functions:

  • The Value Function - goodness of a given state S following the policy \u03c0. It is denoted by V\u03c0(s)
  • Value-State Function - Goodness of state S, taking action A by the policy \u03c0. It is denoted by Q\u03c0(s, a)


Based on all this, the agent tries to develop a model of the environment as a mapping of state, action and reward. It can look like this:

StateActionNext StateRewardProbability

Process of Learning

The agent can follow two main paths to learning: Model free learning and Model based learning. Model free learning is purely based on trial and error. It can work well in small problems. But for complex problems, we have to use model based learning.

Learning could be episodic and continuous. Episodic learning involves learning over several iterations of a well defined goal. An example is learning to play a Tic-Tac-Toe game. It is finite and has a well defined end. The agent can learn it over multiple episodes of this game.

On the other hand, continuous learning involves a never ending process. For example playing pokemon. There is not end to the game. We just gather more and more points based on what we do.

Exploration & Exploitation

This is a very important concept in reinforcement learning. There are people who follow the beaten path, and are too scared of doing something that they are not sure about. And then there are some who want to try newer and newer stuff with the hope that they can learn something more. Neither extreme is good. The first is called exploitation - because the policy just exploits what it has already learnt. The other is called exploration - because the policy tries to explore things that it has not yet seen.

A healthy balance between the exploration and exploitation is important to make sure the agent learns well. This is an important aspect that defines the agent's ability to learn.

Approaches to Learning

There are two main approaches to implement Reinforcement Learning.

  • Value Function Based - Define a model for the value function and create a policy based on that. Over the process of learning, improve this value function in order to perfect the policy
  • Direct Policy Search - Here, we model the policy itself. Then adjust the model parameters to get the highest reward.


Things are easier said than done. We have many open questions here:

  • Representation - how do we represent large state spaces and action spaces?
  • Generalization - How to generalize the states? We cannot train the agent for every possible state. The policy should generalize itself over multiple states.
  • Exploration / Exploitation dilemma - This has been a perennial dilemma that mankind has seen. How / when to should the agent explore?
  • Temporal credit assignment. This is perhaps the most important problem in Reinforcement learning. How to assign the rewards to a previous step. For example, the agent wins/loses at the 100th step. How do we calculate the reward for what it did at the 10th step. Perhaps that was the only right think he did. Or it could have been the beginning of the failure. How should we choose that?

These are the major problems that the researchers have answered over the years - to bring Reinforcement Learning to the state we see today.