Reinforcement learning is an interesting concept. In simple words, supervised learning is kind of micro management. At each point, on each step, the machine is corrected by measuring how wrong it is. Reinforcement learning works by rewarding the right as much as penalizing the wrong - based on the final outcome rather than each minor step. That generates a lot more capable agents that can create miracles. Sure that sounds great. Wish our management understands that!
Note that most of the early RL literature talks about playing computer games. That is because they are the easiest to relate and model. Rest assured the reinforcement learning has a lot more to offer. Most of the AI miracles we see today have an element of RL in them.
This is how an RL application works. An agent interacts with the environment and tries to build a model of the environment based on the rewards that it gets. Here is the detail about the different entities involved in the reinforcement learning.
Environment is the subset of the world involved in our experiment. It defines the world for the RL agent that we create. The interaction of the environment with the agent forms a continuous loop:
The agent is what we try to train as we proceed. It acts based on the state of the environment and what it has learnt so far and processes the new state and rewards. Doing so, it learns to achieve the defined goals. It interacts with the environment in the loop:
It represents the situation in the environment that the agent uses to make decisions. The state may not directly point to a win or lose. It is just a situation that the agent should interpret in order to make further decisions. It is usually denoted by St for state at time t.
Reward is a scalar value that the environment may return when the agent selects and action. It could be positive or negative. Not every action has a reward (zero reward). The reward is defined based on one or more goals to be achieved. The value and details of the reward are defined by the designer. This is one of the important factors that defines the convergence of the training.
It is usually denoted by rt for reward at time t
Action is what the agent does. Action could be discrete - choosing one of the N possible alternatives. Or it could be continuous - choosing from a range of values. This action is calculated based on the situation. It is usually denoted by At for the action at time t.
Policy is the mapping of state and action that the agent defines for itself. It could be deterministic or stochastic. Deterministic policy implies that the agent will always take the given action for a given state on the other hand, stochastic policy adds some randomness to this. It is denoted by π
This is used to assign a value to each state. Over time, the agent uses this function to decide the action with the expectation of a better resulting state. Formally, value is the expected long-term accumulation of reward, starting from s, following the policy π
There are two types of value functions:
Based on all this, the agent tries to develop a model of the environment as a mapping of state, action and reward. It can look like this:
The agent can follow two main paths to learning: Model free learning and Model based learning. Model free learning is purely based on trial and error. It can work well in small problems. But for complex problems, we have to use model based learning.
Learning could be episodic and continuous. Episodic learning involves learning over several iterations of a well defined goal. An example is learning to play a Tic-Tac-Toe game. It is finite and has a well defined end. The agent can learn it over multiple episodes of this game.
On the other hand, continuous learning involves a never ending process. For example playing pokemon. There is not end to the game. We just gather more and more points based on what we do.
This is a very important concept in reinforcement learning. There are people who follow the beaten path, and are too scared of doing something that they are not sure about. And then there are some who want to try newer and newer stuff with the hope that they can learn something more. Neither extreme is good. The first is called exploitation - because the policy just exploits what it has already learnt. The other is called exploration - because the policy tries to explore things that it has not yet seen.
A healthy balance between the exploration and exploitation is important to make sure the agent learns well. This is an important aspect that defines the agent's ability to learn.
There are two main approaches to implement Reinforcement Learning.
Things are easier said than done. We have many open questions here:
These are the major problems that the researchers have answered over the years - to bring Reinforcement Learning to the state we see today.