Blog

Monte Carlo Methods for Reinforcement Learning
The Monte Carlo method for reinforcement learning learns directly from the episodes of experiences gained during the interaction with the environment without any prior knowledge of Markov Decision Process(MDP) transitions.

What are Monte Carlo Methods?

In reinforcement learning, Monte Carlo methods are a family of algorithms used to estimate the value of states, actions, or state-action combinations derived from real experiences or sampled trajectories. The main concept of Monte Carlo methods is to utilize repeated random sampling to calculate numerical estimates for values that are difficult to determine analytically.

Key Concepts in Monte Carlo Methods

Some of the key terminologies used in Monte Carlo methods are defined below −
- Episode − This defines the sequence of states, actions, and rewards from the beginning to the terminating state (until the time limit).
- Return (G_t) − The total accumulated reward from the time step ton ward’s in an episode.
- Value Function (V) − A function predicts the expected rewards for a specific state or state-action pair.
Monte Carlo Policy Evaluation

Monte Carlo Methods calculate the value of a state or actions by averaging the return from multiple episodes. The fundamental process involves simulating one or more episodes and employing the results to update the value function.

For a given state of s, the Monte Carlo Estimate for the state value V(s) is given by −

V(s)=1N∑i=1NGi

Where,
- i is the episode index.
- s is index of state.
- N is the number of episodes in which the state s is visited.
- Gi is the sum of discounted rewards observed from the i-th episode where state s is visited.
For every episode, there will be a sequence of states and rewards. From these rewards, we can calculate the return by definition, which is just the sum of all future rewards.

Step-by-step Process for Estimation

Following is the description on the step-by-step process for the Monte Carlo method −
- Create an episode − The agent engages with the environment according to its policy, producing a series is states, actions, and rewards.
- Determine the return − For every state (or state-action pair), compute the overall return (total rewards) from that point onward.
- Revise the value assessment − Revise the value function by calculating the average of the recorded rewards for each state.
Off-Policy and On-Policy Methods

In Monte Carlo methods, we can differentiate between on-policy and off-policy methods by considering if the policy employed to create episodes matches the policy undergoing enhancement.

On-policy methods

The policy employed to create episodes is identical to the policy that is currently being assessed. This shows that the agent is acquiring knowledge from the experiences produced by its own actions according to the existing policy.

For example, First-Visit Monte Carlo, the first time a state engages with an episode and its reward is used to update the value estimate.

Off-Policy methods

The policy used to generate episodes can be different from the policy being improved. This allows the agent to learn from trajectories generated by any policy, not necessarily the one it is trying to optimize.

For example, Sampling can be used to adjust the updates to the value function when episodes are generated under a behavior policy that is different from the target policy.

Monte Carlo Control

Monte Carlo control algorithms aim to estimate the value function and, additionally, improve the policy iteratively. This is primarily done using methods like −
- Monte Carlo Exploration − One of the common challenges in reinforcement learning is to maintain the right balance between exploration and exploitation. Monte Carlo techniques employ an exploration approach like epsilon-greedy or SoftMax to promote exploration during the process of learning from the gathered experience.
- Monte Carlo Control − The main idea is to improve the policy by improving the action-value function Q(s, a), which represents the expected reward from state s following action a.
Monte Carlo Control Algorithm

Following is the algorithm of Monte Carlo control −
- Initialize Q(s,a) value for all state-action pairs and π(s) – policy.
- For each episode, follow policy π to generate a state-reward-action sequence.
- Calculate the return Gt for each state-action pair (s,a) in the episode.
- Update Q(s,a) using the average of the return Gt for each state-action pair −Q(s,a)=Q(s,a)+α(Gt−Q(s,a))
- Improve the policy π(s) by choosing the action a to minimize Q(s,a).
The process is repeated iteratively till the policy improves and converges to an optimal policy.

Applications of Monte Carlo Methods

Monte Carlo methods are widely used in various reinforcement learning scenarios especially in environments that are unclear and require the agent to depend on experience instead of a model. Some of the applications include −
- Games − Monte Carlo techniques can be used for designing board games like chess, card games, and various other games that require strategic decision-making.
- Robotics − Monte Carlo techniques assist agents in robotics to develop navigation, manipulation, and other task policies by exploring their surroundings and gaining insights from real-world interactions.
- Financial Modeling − Monte Carlo techniques can be used to stimulate stock prices, determine option values, and optimize portfolios, especially when conventional methods may struggle because of the complexity of financial markets.
Limitations of Monte Carlo Methods

Certain limitations of Monte Carlo methods that have to be addressed are −
- High Variance − It can exhibit high variance in estimates since outcomes from different episodes might vary, especially with fewer episodes.
- Inefficiency with long episodes − It is less efficient with long episodes or postponed rewards, as it needs to wait until an episode concludes to adjust values.
- Lack of Bootstrap − Unlike alternative techniques, it does not bootstrap (revise estimates using other estimates), which slows down the learning process in extensive state spaces.
October 4, 2025

Actor-Critic Reinforcement Learning Method

What is Actor-Critic Method?

The actor-critic algorithm is a type of reinforcement learning that integrates policy-based techniques with value-based approaches. This combination is to overcome the limitations of employing each technique individually.

In the actor-critic framework, an agent (actor) formulates a strategy for making decisions, while a value function (critic) evaluates the actions taken by the actor. At the same time, the critic analyzes these actions by assessing their quality and value. This dual role allows the method to maintain balance between exploration and exploitation, by using the benefits of both policy and value functions.

Working of Actor-Critic Method

Actor-critic methods is a combination of policy-based and value-based techniques primarily aimed at learning a policy that enhances the expected cumulative reward. The two main components required are −

Actor − This component is responsible for selecting actions based on the current policy. It is usually denoted as Π_θ(a|s), which represents the probability of taking action a in state s.
Critic − The critic assesses the actions taken by the actor by estimating the value function. It is denoted by V(s), that calculates the expected return.

Step-by-step Working of Actor-Critic method

The main objective of actor-critic methods is that the actor chooses an action(policy), while the critic assesses the quality of those actions (value function), and this feedback is utilized to enhance both the actor’s policy and the critic’s value assessment. Following is the pseudo-algorithm for actor-critic methods −

Begin with initializing the actor’s policy parameters, critic’s value function, environment and choose an initial state s₀.
Sample {s_t,a_t} using the policy Πθ from the actor-network.
Evaluate the advantage function. It is called as TD error δ. In actor-critic algorithm, the advantage function is generated by critic network.
Evaluate the gradient.
Update the policy parameters (θ)
Adjust the critic’s weights based on value-based RL. δt represents the advantage function.
Repeat the above steps in sequence to find the optimal policy.

Advantages of Actor-Critic Method

The actor-critic method offers various advantages −

Enhanced Sample Efficiency − The integrated approach of actor-critic algorithms results in better sample efficiency, requiring less interactions with the environment to reach optimal performance.
Faster Convergence − The technique’s capacity to simultaneously update the policy and value function leads to quicker convergence during training, allowing for quicker adaption to the learning task.
Flexibility in Action Spaces − Actor-Critic models can effectively manage both discrete and continuous action spaces, providing adaptability for various reinforcement learning challenges.
Off-Policy Learning − Acquires knowledge from previous experiences, even if it doesn’t strictly adhere to the present policy.

Challenges of Actor-Critic Methods

Some of the key challenges of actor-critic methods that should be addressed are −

High variance − Even with the advantage function, actor-critic methods still experience with high variance while estimating gradient. This challenge can be addressed by using methods such as Generalized Advantage Estimation (GAE).
Training Stability − Simultaneous training of the actor and critic can lead to instability, particularly when there is a poor alignment between the actor’s policy and the critic’s value function. This challenge can be addressed by using techniques like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).
Bias-Variance Tradeoff − The balance between bias and variance in calculating the policy gradient can occasionally result in slower convergence, making it quite challenging in the field of reinforcement learning.

Variants of Actor-Critic Methods

Some of the key variants of actor-critic method include −

Advantage Actor-Critic (A3C) − A2C(Advantage Actor-critic) is a variant of the actor-critic algorithm that incorporates the idea of the advantage function.
Asynchronous Advantage Actor-Critic (A3C) − This approach employs several agents operating in parallel to improve a collective policy and value function. It also helps in stabilizing training and enhancing efficiency.
Soft Actor-Critic (SAC) − SAC is an off-policy approach that integrates entropy regularization to promote exploration. Its objective is to optimize both the expected return and the uncertainty of the policy. The key feature in SAC includes balancing exploration and exploitation by adding an entropy term to the reward.
Deep Deterministic Policy Gradient (DDPG) − DDPG is designed for environments that involve continuous action spaces. It merges the actor-critic method with the deterministic policy gradient. The key feature of DDPG includes using a deterministic policy and a target network to stabilize training.
Q-Prop − Q-Prop is another Actor-critic approach. In the previous methods, temporal difference learning is implemented to decrease variance allowing bias to increase. Q-Prop decreases the variance in gradient computation without introducing bias by using the idea of control variate.

Advantage Actor-Critic (A2C)

A2C(Advantage Actor-critic) is a variant of the actor-critic algorithm that incorporates the idea of the advantage function. The function evaluates the extent to which an action is better when compared to the average action in a given state. Including this advantageous information, A2C directs the learning process towards actions that are more valuable than the usual action performed in that state.

Algorithm of A2C

The steps involved in the algorithm of A2C are −

Initialize the policy parameters, the value function parameters, and the environment.
The agent interacts with the environment by taking actions according to the current policy and receives rewards in return.
Calculate the Advantage Function A(s,a) based on the current policy and value estimates.
Simultaneously update the actor’s parameters using policy gradient and critic’s parameters using the value-based method.

Asynchronous Advantage Actor-Critic (A3C)

The Asynchronous Advantage Actor-Critic (A3C) algorithm was introduced by Volodymyr Mnih and colleagues in 2016. This is primarily designed for employing asynchronous updates from various parallel agents, which helps in overcoming stability and sample efficiency problems found in traditional reinforcement learning algorithms.

Algorithm of A3C

The step-by-step breakdown of the A3C algorithm −

Initialize the global network.
Launch concurrent workers, each equipped with its individual local network. These workers interact with the environment to collect experiences (state, action, reward, next state).
At every step throughout an episode, the worker observes the state, chooses an action based on the current policy, and receives the reward and the next state. Additionally, the worker calculates the advantage function to measure the difference between the predicted value and the actual reward that is expected.
Update the critic (value function) and actor (policy).
As one worker refreshes its local model, the gradients from several workers are combined asynchronously to modify the global model. This will allow the updates of each worker to be independent, which reduces correlation and leads to more stable and efficient training.

Advantage Actor-Critic (A2C) Vs. Asynchronous Advantage Actor-Critic (A3C)

The table below demonstrates the key differences between A2C (Advantage Actor-Critic) and A3C (Asynchronous Advantage Actor-Critic) −

Feature	A2C (Advantage Actor-Critic)	A3C (Asynchronous Advantage Actor-Critic)
Parallelism	It uses a single worker (agent) to update the model; hence it is called single-threaded.	It uses multiple workers in parallel to explore the complete environment hence it is called multi-threaded.
Model Updates	Updates are performed synchronously, using gradients from the workers.	Updates occur asynchronously among various workers, each of them independently updating the global model.
Rate of learning	Standard gradient descent is applied, and the model is updated after every step.	Asynchronous updates enable more regular and distributed model modifications, which may enhance stability and accelerate convergence.
Stability	Less stable since synchronous updates can lead the model to converge too fast.	Comparatively more stable as a result of asynchronous updates from various workers, decreasing the correlation among the updates.
Efficiency	Less efficient since only a single worker explores the environment.	More efficient in sampling since multiple workers explore the environment in parallelly.
Implementation	Easy to implement.	Relatively complex, since it has to manage multiple agents.
Convergence Speed	Slower convergence since only one agent is learning from experience at a time.	Faster convergence due to parallel agents exploring different parts of the environment simultaneously.
Computation Cost	Lower computational cost.	Higher computational cost.
Use Case	Suitable for simpler environments with less computational resources.	Suitable for more complex environments where parallelism and more robust exploration are necessary.

October 4, 2025

SARSA Reinforcement Learning

SARSA stands for State-Action-Reward-State-Action, which is a modified version of the Q-learning algorithm where the target policy is the same as the behavior policy. The two consecutive state-action pairs and the immediate reward received by the agent while transitioning from the first state to the next state determine the updated Q value, so this method is called SARSA.

What is SARSA?

State-Action-Reward-State-Action (SARSA) is a reinforcement learning algorithm that explains a series of events in the process of learning. It is one of the effective ‘On Policy’ learning techniques for agents to make the right choices in various situations. The main idea behind SARSA is trial and error. The agent takes action in a situation, observes the consequence, and modifies its plan based on the result.

For example, assume you are teaching a robot how to walk through a maze. The robot starts at a particular position, which is the ‘state’, and your goal is to find the best route to the end of the maze. The robot has the option to move in various directions during each step, referred to as ‘action’. The robot is given feedback in the form of incentives, either positive or negative, to indicate how well it is performing.

The equation for updated statements in the SARSA algorithm is as follows −

Components of SARSA

Some of the core components of SARSA algorithm include −

State(S) − A state is a reflection of the environment, containing all details about the agent’s present situation.
Action(A) − An action represents the decision made by the agent depending on its present condition. The action it chose from the repository causes a change from the current state to the next state. This shift is how the agent engages with its environment to generate desired results.
Reward(R) − Reward is a variable provided by the environment in response to the agent’s action within a specific state. This feedback signal shows the instant outcome of the agent’s choice. Rewards help the agent learn by showing which actions are desirable in certain situations.
Next State(S’) − When the agent acts in a specific state, it causes a shift to a different situation called the “next state.” This new state (s’) is the agent’s updated environment.

Working of SARSA Algorithm

The SARSA reinforcement learning algorithm allows agents to learn and make decisions in an environment by maximizing cumulative rewards over time using the State-Action-Reward-State-Action sequence. It involves an iterative cycle of engaging with the environment, gaining insights from past events, and enhancing the decision-making strategy. Let’s analyze the working of the SARSA algorithm −

Q-Table Initialization − SARSA begins by initializing Q(S,A) , which denotes the state-action pair to arbitrary values. In this process, the starting state (s) is determined, and initial action (A) is chosen by employing an epsilon-greedy algorithm policy replying to current Q-values.
Exploration Vs. Exploitation − Exploitation involves using already known values that were estimated previously to improve the chance of receiving rewards in the learning process. On the other hand, exploration involves selecting actions that may result in short-term benefits but could help discover better actions and rewards in the future.
Action execution and Feedback − Once the chosen action (A) is executed, it results in a reward (R) and a transition to the next state (S’).
Q-Value Update − The Q-value of the current state-action pair is updated based on the received and the new state. The next action (A’) is selected from the values updated in the Q-table.
Iteration and Learning − The above steps are repeated until the state terminates. Throughout the process, SARSA updates its Q-values continuously by considering the transitions of state-action-reward. These improvements enhance the algorithm’s capacity to anticipate future rewards for state-action pairs, directing the agent toward making improved decisions in the long run.

SARSA Vs Q-Learning

SARSA and Q-learning are two algorithms in reinforcement learning that belong to value-based methods. SARSA follows the current policy, whereas Q-learning doesn’t follow the current policy. This variance impacts the way in which each algorithm adjusts its action-value function. Some differences are tabulated below −

Feature	SARSA	Q-Learning
Policy Type	On-policy	Off-Policy
Update Rule	Q(s,a) = Q(s,a) + ɑ(r + γmax_aQ(s’,a)-Q(s,a))	Q(s,a) = Q(s,a) + ɑ(r + γ Q(s’,a’)-Q(s,a))
Convergence	Slower convergence to the optimal policy.	Typically faster convergence to the optimal policy.
Exploration Vs Exploitation	Exploration directly influences learning updates.	Exploration policy can differ from learning policy.
Policy Update	Updates the action-value function based on the action actually taken.	Updates the action-value function, assuming the best possible action is always taken.
Use case	Suitable for environments where stability is important.	Suitable for environments where efficiency is important.
Example	Healthcare, traffic management, personalized learning.	Gaming, robotics, financial trading

October 4, 2025

REINFORCE Algorithm
What is REINFORCE Algorithm?

The REINFORCE algorithm is a type of policy gradient algorithm in reinforcement learning that is based on Monte Carlo methods. The simple way to implement this algorithm is by employing gradient ascent to enhance a policy by directly increasing the expected cumulative reward. This algorithm does not require a model of the environment and is thus categorized as a model-free method.

Key Concepts of REINFORCE Algorithm

Some key concepts that are related to the REINFORCE algorithm are briefly described below −
- Policy Gradient Methods − The REINFORCE algorithm is a type of policy gradient method, which are algorithms that enhance a policy by following the gradient of the expected cumulative reward.
- Monte Carlo Methods − The Reinforce Algorithm represents a form of the Monte Carlo method, as it utilizes sampling to evaluate desired quantities.
How does REINFORCE Algorithm Work?

The Reinforce Algorithm was introduced by Ronald J. Williams in 1992. The main goal of this algorithm is to maximize the expected cumulative rewards by adjusting the policy parameters. This algorithm trains the agents to make sequential decisions in an environment. The step-by-step breakdown of the Reinforce Algorithm is −

Episode Sampling

The algorithm begins by sampling a complete episode of interaction with the environment, where the agent follows its current policy. An episode consists of a sequence of states, actions, and rewards until the state terminates.

Trajectory of states, actions, and rewards

The agent records the trajectory of interactions − (s₁,a₁,r₁,……s_t,a_t,r_t) where s represents the states, a represents the actions taken, and r represents the rewards received at each step.

Return Calculations

The return G_t The return represents the cumulative reward an agent expects to receive from time t onwards.

G_t = r_t + γr_t+1 + γ²r_t+2

Calculate the Policy Gradient

Compute the gradient of the expected return concerning the policy’s parameters. To achieve this, it is necessary to calculate the gradient of the log livelihood for the selected course of action.

Update the policy

After computing the gradient of the expected cumulative reward, the policy parameters are updated in the direction that increases the expected reward.

Repeat the above steps until the state terminates. Unlike temporal difference learning (Q-learning and SARSA), which focuses on immediate rewards. Reinforce enables the agent to learn from the full sequence of states, actions, and rewards.

Advantages of REINFORCE Algorithm

Some of the advantages of the REINFORCE algorithm are −
- Model-free − The REINFORCE algorithm doesn’t require a model of the environment, making it appropriate for situations where the environment is not known or hard to model.
- Simple and intuitive − The algorithm is easy to understand and implement.
- Able to handle high-dimensional action spaces − In contrast to value-based methods, the REINFORCE algorithm can handle continuous and high-dimensional action spaces.
Disadvantages of REINFORCE Algorithm

Some of the disadvantages of REINFORCE algorithm are −
- High Variance − The REINFORCE Algorithm may experience significant variance in its gradient estimates, which can slow down the learning process and make it unstable.
- Inefficient sample use − The algorithm needs a fresh set of samples for each gradient calculation, which may be less efficient than techniques that utilize samples multiple times.
October 4, 2025
Q-Learning
Q-learning is a value-based reinforcement learning algorithm that enables models to iteratively learn and improve over time by taking the correct actions. While these correct actions are considered rewards, the bad actions are penalties.

What is Q-Learning in Reinforcement Learning?

Reinforcement learning is a machine learning approach in which a learning agent learns over time to make the right decisions in a certain environment by interacting continuously. The agent, in the process of learning, experiences various situations in the environment, which are called “states.” The agent, while being in a particular state, performs an action picked from the set of actionable actions that fetches rewards or penalties. Over time, the learning agent learns to maximize these rewards to behave correctly in any state. Q-learning is one such algorithm that uses Q-values, also called action values, to iteratively improve the behavior of the learning agent.

Key Components of Q-Learning

Q-learning model functions through an iterative process with several components working together to train a model. The iterative process consists of the agent learning through exploration of the environment and continuously updating the model. Q-learning consists of the following components −
- Agents − The agent is the entity that functions and performs tasks in a given environment.
- States − The state is a variable that specifies an agent’s current situation within an environment.
- Actions − The agent’s behavior in a particular state.
- Rewards − The idea behind reinforcement learning is either providing a positive or negative response to the agent’s actions.
- Episodes − An episode occurs when an agent reaches a point where it cannot take any more actions and terminates.
- Q-values − The Q-value is the measurement used to assess an action in a specific state.
How does Q-Learning Works?

Q-Learning works through trial-and-error experiences to learn the outcome of a particular action carried out by an agent in an environment. The Q-learning process involves modeling optimal behavior by learning an optimal action value function called Q-function. There are two methods to determine the Q-values −

Temporal Difference

The temporal difference equation determines the Q-value by evaluating the current state and action agents and the previous state and action to determine the differences.

The Temporal Difference can be represented as −

Q(s,a) = Q(s,a) + ɑ(r + γmax_aQ(s’,a)-Q(s,a))

Where,

s represents current state of the agent.

a represents current action picked from the Q-table.

s’ represents the next state, where the agent terminates.

a’ represents the next best action to be picked using current Q-value estimation.

r represents the current reward observed from the environment in response to the current action.

γ ( &0 and <=1) is the discounting factor for future rewards.

ɑ is step length taken to update the estimation of Q(s,a).

Bellman Equation

Mathematician Richard Bellman developed this equation in 1957 as a way to make optimal decisions using recursion. In the context of Q-learning, Bellman’s equation is utilized to determine the value of a specific state and evaluate its relative placement. The optimal state is determined by the state with the highest value.

The Bellman’s equation can be represented as −

Q(s,a) = r(s,a) + ɑ max_aQ(s’,a)

Where,

Q(s,a) represents the expected reward for an action ‘a’ in state ‘s’.

R (s,a) represents the reward earned when action a is carried out in state ‘s’.

ɑ is the discount factor, which denotes the significance of future rewards.

max_aQ(s’,a) represents the maximum Q-value for the next state s’ and every possible action.

Q-Learning Algorithm

The Q-learning algorithm involves the agent learning through exploring the environment and updating the Q-table based on the received rewards. Q-table is a repository that stores rewards associated with optimal actions for each state in a given environment. The steps involved in the Q-learning algorithm process include −

The following are the steps in the Q-learning algorithm −
- Initialization of Q-table − The first step involves initializing Q-table to monitor the progress related to actions taken in different states.
- Observation − The agent observes the present state of the environment.
- Action − The agent decides to take action within the environment. After the completion, the model observes if the action is helpful in the environment.
- Update − After the action is completed, it’s time to update the Q-table with the results.
- Repeat − Repeat performing steps 2-4 until the model achieves a termination state.
Advantages of Q-Learning

The Q-learning approach in reinforcement learning offers various benefits such as −
- This learning approach, which is trial and error, resembles how people learn, making it almost ideal.
- This learning approach doesn’t stick to a policy, which enables it to optimize to the fullest to get the best possible result.
- This model-free, off-policy approach improves the flexibility to work in environments whose parameters cannot be dynamically stated.
- The model has the ability to fix mistakes while training, and there is very little probability that the fixed mistake would happen again.
Disadvantages of Q-Learning

The Q-learning approach in reinforcement learning also has some disadvantages such as −
- It is quite challenging for this approach to find the right balance between trying new actions and sticking with what’s already known.
- The Q-learning model sometimes exhibits excessive optimism and overestimates how good a particular action or strategy is.
- Sometimes, it is time-consuming for a Q-learning model to determine the optimal strategy when faced with multiple problem-solving options.
Applications of Q-Learning

The Q-learning models can improve processes in various scenarios. Some of the fields include −
- Gaming − Q-learning algorithms can teach gaming systems to reach expert levels of skill in various games by learning the best strategy to progress.
- Recommendation Systems − Q-learning algorithms can be utilized to improve recommendation systems, like advertising platforms.
- Robotics − Q-learning algorithms enable robots to learn how to perform different tasks like manipulating objects, avoiding obstacles, and transporting items.
- Autonomous Vehicles − Q-learning algorithms are used to train self-driving cars to make driving choices like changing lanes or coming to a halt.
- Supply Chain − Q-learning models can enhance the efficiency of supply chains by optimizing the path for products to market.
October 4, 2025
Exploitation and Exploration in Machine Learning
In machine learning, exploration is the action of allowing an agent to discover new features about the environment, while exploitation is making the agent stick to the existing knowledge gained. If the agent continuously exploits past experiences, it likely gets stuck. On the other hand, if it continues to explore, it might never find a good policy, which results in exploration-exploitation dilemma.

Exploitation in Machine Learning

Exploitation is a strategy in reinforcement learning that an agent leverages to make decisions in a state from the existing knowledge to maximize the expected reward. The goal of exploitation is utilizing what is already known about the environment to achieve the best outcome.

Key Aspects of Exploitation

The key aspects of exploitation include −
- Maximizing reward − The main objective of exploitation is maximizing the expected reward based on the current understanding of the environment. This involves choosing an action based on learned values and rewards that would yield the highest outcome.
- Improving the efficiency of decision − Exploitation helps in making efficient decisions, especially by focusing on high-reward actions, which reduce the computational cost of performing exploration.
- Risk Management − Exploitation inherently has a low level of risk as it focuses more on tried and tested actions, reducing the uncertainty associated with less familiar choices.
Exploration in Machine Learning

Exploration is an action that enables agents to gain knowledge about the environment or model. The exploration process chooses actions with unpredictable results to collect information about the states and rewards that the performed actions will result in.

Key Aspects of Exploration

The key aspects of exploration include −
- Gaining information − The main objective of exploration is to allow an agent to gather information by performing new actions in a state that can improve understanding of the model or environment.
- Reduction of Uncertainty − The main objective of exploration is to allow an agent to gather information by performing new actions in a state that can improve understanding of the model or environment.
- State space coverage − In specific models that include extensive or continuous state spaces, exploration ensures that a sufficient variety of regions in the state space are visited to prevent learning that is biased towards a small number of experiences.
Action Selection

The objective of reinforcement learning is to teach the agent how to behave under various states. The agent learns what actions to perform during the training process using various approaches like greedy action selection, epsilon-greedy action selection, upper confidence bound action selection, etc.

Exploration Vs. Exploitation Tradeoff

The idea of using the agent’s existing knowledge versus trying a random action is called the exploitation-exploration trade-off. When the agent explores, it can enhance its existing knowledge and achieve improvement over time. In the other case, if it uses the existing knowledge, it receives a greater reward right away. Since the agent cannot perform both tasks simultaneously, there is a compromise.

The distribution of resources should depend on the requirements of both streams, alternating based on the current state and the complexity of the learning task.

Techniques for Balancing Exploration and Exploitation

The following are some techniques for balancing exploration and exploitation in reinforcement learning −

Epsilon-Greedy Action Selection

In reinforcement learning, the agent usually selects an action based on its reward. The agent always chooses the optimal action to generate the maximum reward possible for the given state. In Epsilon-Greedy action selection, the agent uses both exploitation to gain insights from the prior knowledge and exploration to look for new options.

The epsilon-greedy method usually chooses the action with the highest expected reward. The goal is to achieve a balance between exploration and exploitation. With the small probability of ε, we opt to explore instead of exploiting what the agent has learned so far.

Multi-Armed Bandit Frameworks

The multi-armed bandit framework provides a formal bases for managing the balance between exploration and exploitation in sequential decision-making problems. They offer algorithms that analyze the trade-off between exploration and exploitation based on various reward systems and circumstances.

Upper Confidence Bound

The Upper Confidence Bound (UCB) is a popular algorithm for balancing exploration and exploitation in reinforcement learning. This algorithm is based on the principle of optimism in the face of uncertainty. It chooses actions that optimize the upper confidence limit of the expected reward. This indicates that it takes into account both the mean reward of an action and the uncertainty or variability in that reward.
October 4, 2025

Reinforcement Learning Algorithms

Reinforcement learning algorithms are a type of machine learning algorithm used to train agents to make optimal decisions in an environment. Algorithms like Q-learning, policy gradient methods, and Monte Carlo methods are commonly used in reinforcement learning. The goal is to maximize the agent’s cumulative reward over time.

What is Reinforcement Learning (RL)?

Reinforcement Learning is a machine learning approach where an agent (software entity) is trained to interpret the environment by performing actions and monitoring the results. For every good action, the agent gets positive feedback, and for every bad action, the agent gets negative feedback. It’s inspired by how animals learn from their experiences, making decisions based on the consequences of their actions.

Types of Reinforcement Learning Algorithms

Reinforcement learning algorithms can be categorized into two main types: model-based and model-free. The distinction lies in how they identify the optimal policy Ï −

Model-Based Reinforcement Learning Algorithms − The agent develops a model of the environment and predicts the outcome of actions in various states. After the model is acquired, the agent uses it to strategize and predict future outcomes without directly engaging with the environment. This method will improve the efficiency of decision-making since it doesn’t completely depend on trial and error.
Model-Free Reinforcement Learning Algorithms − The model does not maintain a model of the environment. Rather, it acquires a policy or value function through interactions with the environment.

Model-Based Reinforcement Learning Algorithms

Following are some essential model-based optimization and control algorithms −

1. Dynamic Programming

Dynamic programming is a mathematical framework developed to solve complex problems especially in decision making and control scenarios. It has a set of algorithms that can be used to determine optimal policies when the agent knows everything about the environment, i.e., the agent has a perfect model of the surroundings. Some of the algorithms of dynamic programming in reinforcement learning are −

Value Iteration

Value Iteration is a dynamic programming algorithm used to calculate optimal policy. It calculates the value of each state based on the assumption that the agent will follow the optimal policy. The update policy is based on Bellman equations −

V(s)=maxa∑s′,rP(s′,r|s,a)(R(s,a,s′)+γV(s′))

Policy Iteration

Policy iteration is a two step optimization procedure to simultaneously find an optimal value function V_Î and the corresponding optimal policy Î . The steps involved are −

Policy Evaluation − For a given policy, calculate the value function for every state using the Bellman equation.
Policy Improvement − Using the current value functions, improve the policy by choosing an action that maximizes the expected return.

This process alternates between evaluation and improvement until the policy reaches the optimal policy.

2. Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search is a heuristic search algorithm. It uses a tree structure to explore possible actions and states. This makes MCTS particularly useful for decision-making in complex environments.

Model-Free Reinforcement Learning Algorithms

Following are the list of some essential model-free algorithms −

1. Monte Carlo Learning

Monte Carlo learning is a technique in reinforcement learning that focuses on estimating value functions and developing policies based on real experiences instead of depending on the model or dynamics of the environment. Monte Carlo techniques usually use the concept of averaging over multiple episodes of interaction with the environment to compete estimates of expected return.

2. Temporal Difference Learning

Temporal difference(TD) learning is one of the model-free reinforcement learning techniques whose aim is to evaluate the value function of a policy by using the experiences an agent collects during its interactions with the environment. In comparison with Monte Carlo methods, that update value estimates only after the completion of an entire episode, while TD learning updates incrementally after each action is taken and each reward is received, making it the best choice of decision making.

3. SARSA

SARSA is an on-policy, model-free reinforcement learning algorithm method used for learning the action-value function Q(s,a). It stands for State-Action-Reward-State-Action, and updates its action-value estimates based on the actions that the agent actually takes during its interactions with the environment.

4. Q-Learning

Q-learning is a model-free, off-policy reinforcement learning technique used to learn the optimal action-value function Q*(s,a), which gives the maximum expected reward for any state-action pair. The main objective of Q-learning is to discover the best policy by evaluating the optimal action-value function, which represents the maximum expected reward from state s when performing an action a and thereafter following the optimal policy.

5. Policy Gradient Optimization

Policy gradient optimization is a class of reinforcement learning algorithms that focuses on directly optimizing the policy instead of learning a value function. These techniques modify the parameters of a parametric policy to optimize the anticipated return. The REINFORCE algorithm is a type of policy gradient algorithm in reinforcement learning that is based on Monte Carlo methods.

Model-based RL vs Model-free RL

The key differences between Model-Based and Model-Free Reinforcement Learning algorithms are −

Feature	Model-Based RL	Model-free RL
Learning Process	Initially, learns a model of the environment’s dynamic and uses this model to predict future actions.	Completely based on trial-and-error, learns policies or value functions directly from observed transitions and rewards.
Efficiency	Might achieve greater sample efficiency since it can stimulate many interactions using the learned model.	Requires additional real-world interactions to discover an optimal policy.
Complexity	More complex since it requires learning and maintaining of an accurate model of the environment.	Comparatively easier since it doesn’t have to execute model training.
Utilizing environment	Actively develops a model of the environment to predict outcomes and further actions.	Does not develop any model of the environment and depends directly on previous experiences.
Adaptability	Can adapt to the changing states in the environment.	Might take longer to adapt as it relies on previous experiences.
Computational Requirements	Typically requires more computational resources due to the complexity of model development and learning.	Typically less computational demand, focusing on learning directly from experiences.

October 4, 2025

Machine Learning – Principal Component Analysis
Principal Component Analysis (PCA) is a popular unsupervised dimensionality reduction technique in machine learning used to transform high-dimensional data into a lower-dimensional representation. PCA is used to identify patterns and structure in data by discovering the underlying relationships between variables. It is commonly used in applications such as image processing, data compression, and data visualization.

PCA works by identifying the principal components (PCs) of the data, which are linear combinations of the original variables that capture the most variation in the data. The first principal component accounts for the most variance in the data, followed by the second principal component, and so on. By reducing the dimensionality of the data to only the most significant PCs, PCA can simplify the problem and improve the computational efficiency of downstream machine learning algorithms.

The steps involved in PCA are as follows −
- Standardize the data − PCA requires that the data be standardized to have zero mean and unit variance.
- Compute the covariance matrix − PCA computes the covariance matrix of the standardized data.
- Compute the eigenvectors and eigenvalues of the covariance matrix − PCA then computes the eigenvectors and eigenvalues of the covariance matrix.
- Select the principal components − PCA selects the principal components based on their corresponding eigenvalues, which indicate the amount of variation in the data explained by each component.
- Project the data onto the new feature space − PCA projects the data onto the new feature space defined by the selected principal components.
Example

Here is an example of how you can implement PCA in Python using the scikit-learn library −
```
# Import the necessary librariesimport numpy as np
from sklearn.decomposition import PCA

# Load the iris datasetfrom sklearn.datasets import load_iris
iris = load_iris()# Define the predictor variables (X) and the target variable (y)
X = iris.data
y = iris.target

# Standardize the data
X_standardized =(X - np.mean(X, axis=0))/ np.std(X, axis=0)# Create a PCA object and fit the data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)# Print the explained variance ratio of the selected componentsprint('Explained variance ratio:', pca.explained_variance_ratio_)# Plot the transformed dataimport matplotlib.pyplot as plt
plt.scatter(X_pca[:,0], X_pca[:,1], c=y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
```
In this example, we load the iris dataset, standardize the data, and create a PCA object with two components. We then fit the PCA object to the standardized data and transform the data onto the two principal components. We print the explained variance ratio of the selected components and plot the transformed data using the first two principal components as the x and y axes.

Output

When you execute this code, it will produce the following plot as the output −
```
Explained variance ratio: [0.72962445 0.22850762]
```
Advantages of PCA

Following are the advantages of using Principal Component Analysis −
- Reduces dimensionality − PCA is particularly useful for high-dimensional datasets because it can reduce the number of features while retaining most of the original variability in the data.
- Removes correlated features − PCA can identify and remove correlated features, which can help improve the performance of machine learning models.
- Improves interpretability − The reduced number of features can make it easier to interpret and understand the data.
- Reduces overfitting − By reducing the dimensionality of the data, PCA can reduce overfitting and improve the generalizability of machine learning models.
- Speeds up computation − With fewer features, the computation required to train machine learning models is faster.
Disadvantages of PCA

Following are the disadvantages of using Principal Component Analysis −
- Information loss − PCA reduces the dimensionality of the data by projecting it onto a lower-dimensional space, which may lead to some loss of information.
- Can be sensitive to outliers − PCA can be sensitive to outliers, which can have a significant impact on the resulting principal components.
- Interpretability may be reduced − Although PCA can improve interpretability by reducing the number of features, the resulting principal components may be more difficult to interpret than the original features.
- Assumes linearity − PCA assumes that the relationships between the features are linear, which may not always be the case.
- Requires standardization − PCA requires that the data be standardized, which may not always be possible or appropriate.
October 4, 2025
Machine Learning – Missing Values Ratio
Missing Values Ratio is a feature selection technique used in machine learning to identify and remove features from the dataset that have a high percentage of missing values. This technique is used to improve the performance of the model by reducing the number of features used for training the model and to avoid the problem of bias caused by missing values.

The Missing Values Ratio works by computing the percentage of missing values for each feature in the dataset and removing the features that have a missing value percentage above a certain threshold. This is done because features with a high percentage of missing values may not be useful for predicting the target variable and can introduce bias into the model.

The steps involved in implementing Missing Values Ratio are as follows −
- Compute the percentage of missing values for each feature in the dataset.
- Set a threshold for the percentage of missing values for the features.
- Remove the features that have a missing value percentage above the threshold.
- Use the remaining features for training the machine learning model.
Example

Here is an example of how you can implement Missing Values Ratio in Python −
```
# Importing the necessary librariesimport numpy as np

# Load the diabetes dataset
diabetes = np.genfromtxt(r'C:\Users\Leekha\Desktop\diabetes.csv', delimiter=',')# Define the predictor variables (X) and the target variable (y)
X = diabetes[:,:-1]
y = diabetes[:,-1]# Compute the percentage of missing values for each feature
missing_percentages = np.isnan(X).mean(axis=0)# Set the threshold for the percentage of missing values for the features
threshold =0.5# Find the indices of the features with a missing value percentage# above the threshold
high_missing_indices =[i for i, percentage inenumerate(missing_percentages)if percentage > threshold]# Remove the high missing value features from the dataset
X_filtered = np.delete(X, high_missing_indices, axis=1)# Print the shape of the filtered datasetprint('Shape of the filtered dataset:', X_filtered.shape)
```
The above code performs Missing Values Ratio on the diabetes dataset and removes the features that have a missing value percentage above the threshold.

Output

When you execute this code, it will produce the following output −
```
Shape of the filtered dataset: (769, 8)
```
Advantages of Missing Value Ratio

Following are the advantages of using Missing Value Ratio −
- Saves computational resources − With fewer features, the computational resources required to train machine learning models are reduced.
- Improves model performance − By removing features with a high percentage of missing values, the Missing Value Ratio can improve the performance of machine learning models.
- Simplifies the model − With fewer features, the model can be easier to interpret and understand.
- Reduces bias − By removing features with a high percentage of missing values, the Missing Value Ratio can reduce bias in the model.
Disadvantages of Missing Value Ratio

Following are the disadvantages of using Missing Value Ratio −
- Information loss − The Missing Value Ratio can lead to information loss because it removes features that may contain important information.
- Affects non-missing data − Removing features with a high percentage of missing values can sometimes have a negative impact on non-missing data, particularly if the features are important for predicting the dependent variable.
- Impact on the dependent variable − Removing features with a high percentage of missing values can sometimes have a negative impact on the dependent variable, particularly if the features are important for predicting the dependent variable.
- Selection bias − The Missing Value Ratio may introduce selection bias if it removes features that are important for predicting the dependent variable.
October 4, 2025
Machine Learning – Low Variance Filter
Low Variance Filter is a feature selection technique used in machine learning to identify and remove low variance features from the dataset. This technique is used to improve the performance of the model by reducing the number of features used for training the model and to remove the features that have little or no discriminatory power.

The Low Variance Filter works by computing the variance of each feature in the dataset and removing the features that have a variance below a certain threshold. This is done because features with low variance have little or no discriminatory power and are unlikely to be useful for predicting the target variable.

The steps involved in implementing Low Variance Filter are as follows −
- Compute the variance of each feature in the dataset.
- Set a threshold for the variance of the features.
- Remove the features that have a variance below the threshold.
- Use the remaining features for training the machine learning model.
Example

Here is an example to implement Low Variance Filter in Python −
```
# Importing the necessary librariesimport pandas as pd
import numpy as np

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:,:-1].values
y = diabetes.iloc[:,-1].values

# Compute the variance of each feature
variances = np.var(X, axis=0)# Set the threshold for the variance of the features
threshold =0.1# Find the indices of the low variance features
low_var_indices = np.where(variances < threshold)# Remove the low variance features from the dataset
X_filtered = np.delete(X, low_var_indices, axis=1)# Print the shape of the filtered datasetprint('Shape of the filtered dataset:', X_filtered.shape)
```
Output

When you execute this code, it will produce the following output −
```
Shape of the filtered dataset: (768, 8)
```
Advantages of Low Variance Filter

Following are the advantages of using Low Variance Filter −
- Reduces overfitting − The Low Variance Filter can help reduce overfitting by removing features that do not contribute much to the prediction of the target variable.
- Saves computational resources − With fewer features, the computational resources required to train machine learning models are reduced.
- Improves model performance − By removing low variance features, the Low Variance Filter can improve the performance of machine learning models.
- Simplifies the model − With fewer features, the model can be easier to interpret and understand.
Disadvantages of Low Variance Filter

Following are the disadvantages of using Low Variance Filter −
- Information loss − The Low Variance Filter can lead to information loss because it removes features that may contain important information.
- Affects non-linear relationships − The Low Variance Filter assumes that the relationships between the features are linear. It may not work well for datasets where the relationships between the features are non-linear.
- Impact on the dependent variable − Removing low variance features can sometimes have a negative impact on the dependent variable, particularly if the features are important for predicting the dependent variable.
- Selection bias − The Low Variance Filter may introduce selection bias if it removes features that are important for predicting the dependent variable.
October 4, 2025

Blog

What are Monte Carlo Methods?

Key Concepts in Monte Carlo Methods

Monte Carlo Policy Evaluation

Step-by-step Process for Estimation

Off-Policy and On-Policy Methods

On-policy methods

Off-Policy methods

Monte Carlo Control

Monte Carlo Control Algorithm

Applications of Monte Carlo Methods

Limitations of Monte Carlo Methods

What is Actor-Critic Method?

Working of Actor-Critic Method

Step-by-step Working of Actor-Critic method

Advantages of Actor-Critic Method

Challenges of Actor-Critic Methods

Variants of Actor-Critic Methods

Advantage Actor-Critic (A2C)

Algorithm of A2C

Asynchronous Advantage Actor-Critic (A3C)

Algorithm of A3C

Advantage Actor-Critic (A2C) Vs. Asynchronous Advantage Actor-Critic (A3C)

What is SARSA?

Components of SARSA

Working of SARSA Algorithm

SARSA Vs Q-Learning

What is REINFORCE Algorithm?

Key Concepts of REINFORCE Algorithm

How does REINFORCE Algorithm Work?

Episode Sampling

Trajectory of states, actions, and rewards

Return Calculations

Calculate the Policy Gradient

Update the policy

Advantages of REINFORCE Algorithm

Disadvantages of REINFORCE Algorithm

What is Q-Learning in Reinforcement Learning?

Key Components of Q-Learning

How does Q-Learning Works?

Temporal Difference

Bellman Equation

Q-Learning Algorithm

Advantages of Q-Learning

Disadvantages of Q-Learning

Applications of Q-Learning

Exploitation in Machine Learning

Key Aspects of Exploitation

Exploration in Machine Learning

Key Aspects of Exploration

Action Selection

Exploration Vs. Exploitation Tradeoff

Techniques for Balancing Exploration and Exploitation

Epsilon-Greedy Action Selection

Multi-Armed Bandit Frameworks

Upper Confidence Bound

What is Reinforcement Learning (RL)?

Types of Reinforcement Learning Algorithms

Model-Based Reinforcement Learning Algorithms

1. Dynamic Programming

Value Iteration

Policy Iteration

2. Monte Carlo Tree Search (MCTS)

Model-Free Reinforcement Learning Algorithms

1. Monte Carlo Learning

2. Temporal Difference Learning

3. SARSA

4. Q-Learning

5. Policy Gradient Optimization

Model-based RL vs Model-free RL

Example

Output

Advantages of PCA

Disadvantages of PCA

Example

Output

Advantages of Missing Value Ratio

Disadvantages of Missing Value Ratio

Example

Output