About Machine Learning ( Part 10: Reinforcement Learning )
Reinforcement Learning (RL) is an exciting field of machine learning where an agent learns how to behave in an environment in order to maximize cumulative rewards. The foundation of RL lies in Markov Decision Processes (MDPs), which provide a mathematical framework for modeling decision-making scenarios. In this blog, we will explore the core concepts of RL and MDPs, and how they interconnect to help an agent learn optimal behavior.
What is Reinforcement Learning?
Reinforcement Learning is a paradigm where an agent interacts with an environment and learns how to take actions that maximize a certain notion of cumulative reward. The agent doesn’t know the optimal actions at the beginning and instead learns through trial and error. Here are the core components of RL:
- Agent: The learner or decision maker.
- Environment: Everything the agent interacts with.
- State ($s$): A description of the current situation in the environment.
- Action ($a$): A choice made by the agent.
- Reward ($r$): A feedback signal from the environment after an action is taken.
- Policy ($\pi$): A strategy that defines the agent’s behavior, mapping states to actions.
- Value Function ($V(s)$): A function that estimates how good it is for the agent to be in state $s$.
In RL, the agent’s objective is to learn a policy that maximizes the total cumulative reward over time.
Exploration vs. Exploitation
One of the key challenges in RL is balancing exploration and exploitation. Exploration means trying out new actions to discover their outcomes, while exploitation means using the known best actions to maximize reward. Striking the right balance is crucial for learning an optimal policy.
Markov Decision Process (MDP)
The formal model that defines an RL problem is the Markov Decision Process (MDP). An MDP provides a mathematical framework for modeling decision-making in situations where outcomes are partially random and partially under the control of the agent. It is defined by the following components:
- States ($S$): A set of all possible states in the environment. Each state $s \in S$ represents the environment’s configuration at a specific time.
- Actions ($A$): A set of all possible actions the agent can take. Actions $a \in A$ influence the state of the environment.
- State Transition Probability ($P(s’|s, a)$): The probability of transitioning from state $s$ to state $s’$ by taking action $a$.
- Reward Function ($R(s, a)$): The reward the agent receives after taking action $a$ in state $s$.
- Discount Factor ($\gamma$): A value between 0 and 1 that represents the importance of future rewards relative to immediate rewards.
An agent’s goal is to learn a policy $\pi(s)$ that maps states to actions in such a way that it maximizes the total expected reward.
The Markov Property
One key feature of MDPs is the Markov Property, which means that the future state depends only on the current state and action, and not on the history of past states. In other words, the process has no memory, and the future is independent of the past given the present state.
This property simplifies the decision-making process for the agent because it only needs to consider the current state when making decisions.
Value Function and Q-Function
To find an optimal policy, we use the value function and Q-function, which estimate the long-term reward the agent can expect.
Value Function
The value function $V(s)$ estimates the expected return (reward) the agent can achieve from a given state $s$. Mathematically, it is defined as:
$$
V(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \Big| s_0 = s \right]
$$
Here, $r_t$ represents the reward at time step $t$, and $\gamma$ is the discount factor that determines the importance of future rewards.
Q-Function
The Q-function $Q(s, a)$, or action-value function, estimates the expected return from a state-action pair. It is defined as:
$$
Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \Big| s_0 = s, a_0 = a \right]
$$
In this case, the agent evaluates the potential reward of taking action $a$ in state $s$ and following the optimal policy thereafter.
Bellman Equation
The Bellman equation provides a recursive relationship for the value function and Q-function. It expresses the value of a state in terms of the expected reward and the value of future states.
For the value function, the Bellman equation is:
$$
V(s) = \max_a \left( R(s, a) + \gamma \sum_{s’} P(s’|s, a) V(s’) \right)
$$
For the Q-function, the Bellman equation is:
$$
Q(s, a) = R(s, a) + \gamma \sum_{s’} P(s’|s, a) \max_{a’} Q(s’, a’)
$$
These recursive equations are fundamental for solving RL problems and finding the optimal policy.
About Machine Learning ( Part 10: Reinforcement Learning )
https://kongchenglc.github.io/blog/2025/03/11/Machine-Learning-10/