Author: admin

  • Machine Learning – Scatter Matrix Plot

    Scatter Matrix Plot is a graphical representation of the relationship between multiple variables. It is a useful tool in machine learning for visualizing the correlation between features in a dataset. This plot is also known as a Pair Plot, and it is used to identify the correlation between two or more variables in a dataset.

    A Scatter Matrix Plot displays the scatter plot of each pair of features in a dataset. Each scatter plot represents the relationship between two variables. It is also possible to add a diagonal line to the plot that shows the distribution of each variable.

    Python Implementation of Scatter Matrix Plot

    Here, we will implement the Scatter Matrix Plot in Python. For our example given below, we will be using Sklearn’s Iris dataset.

    The Iris dataset is a classic dataset in machine learning. It contains four features: Sepal Length, Sepal Width, Petal Length, and Petal Width. The dataset has 150 samples, and each sample is labeled as one of three species: Setosa, Versicolor, or Virginica.

    We will use the Seaborn library to implement the Scatter Matrix Plot. Seaborn is a Python data visualization library that is built on top of the Matplotlib library.

    Example

    Below is the Python code to implement the Scatter Matrix Plot −

    import seaborn as sns
    import pandas as pd
    
    # load iris dataset
    iris = sns.load_dataset('iris')# create scatter matrix plot
    sns.pairplot(iris, hue='species')# show plot
    plt.show()

    In this code, we first import the necessary libraries, Seaborn and Pandas. Then, we load the Iris dataset using the sns.load_dataset() function. This function loads the Iris dataset from the Seaborn library.

    Next, we create the Scatter Matrix Plot using the sns.pairplot() function. The hue parameter is used to specify the column in the dataset that should be used for color encoding. In this case, we use the species column to color the points according to the species of each sample.

    Finally, we use the plt.show() function to display the plot.

    Output

    The output of this code will be a Scatter Matrix Plot that shows the scatter plots of each pair of features in the Iris dataset.

    scatter matrix plot

    Notice that each scatter plot is color-coded according to the species of each sample.

  • Machine Learning – Correlation Matrix Plot

    A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot represents the relationship between two variables, and the correlation coefficient indicates the strength of the relationship. The diagonal of the matrix usually shows the distribution of each variable.

    The correlation coefficient is a measure of the linear relationship between two variables and ranges from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is associated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation, where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indicates no correlation between the variables.

    Python Implementation of Correlation Matrix Plots

    Now that we have a basic understanding of correlation matrix plots, let’s implement them in Python. For our example, we will be using the Iris flower dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica.

    Example

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from sklearn.datasets import load_iris
    
    iris = load_iris()
    data = pd.DataFrame(iris.data, columns=iris.feature_names)
    target = iris.target
    
    plt.figure(figsize=(7.5,3.5))
    
    corr = data.corr()
    sns.set(style='white')
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)]=True
    f, ax = plt.subplots(figsize=(11,9))
    cmap = sns.diverging_palette(220,10, as_cmap=True)
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
       square=True, linewidths=.5, cbar_kws={"shrink":.5})
    plt.show()

    Output

    This code will produce a correlation matrix plot of the Iris dataset, with each square representing the correlation coefficient between two variables.

    correlation_matrix_plot

    From this plot, we can see that the variables ‘sepal width (cm)’ and ‘petal length (cm)’ have a moderate negative correlation (-0.37), while the variables ‘petal length (cm)’ and ‘petal width (cm)’ have a strong positive correlation (0.96). We can also see that the variable ‘sepal length (cm)’ has a weak positive correlation (0.87) with the variable ‘petal length (cm)’.

  • Machine Learning – Box and Whisker Plots

    A boxplot is a graphical representation of a dataset that displays the five-number summary of the data – the minimum value, the first quartile, the median, the third quartile, and the maximum value.

    The boxplot consists of a box with whiskers extending from the top and bottom of the box.

    • The box represents the interquartile range (IQR) of the data, which is the range between the first and third quartiles.
    • The whiskers extend from the top and bottom of the box to the highest and lowest values that are within 1.5 times the IQR.

    Any values that fall outside this range are considered outliers and are represented as points beyond the whiskers.

    Python Implementation of Box and Whisker Plots

    Now that we have a basic understanding of boxplots, let’s implement them in Python. For our example, we will be using the Iris dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica.

    To start, we need to import the necessary libraries and load the dataset.

    Example

    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import load_iris
    iris = load_iris()
    data = iris.data
    target = iris.target
    

    Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library.

    plt.figure(figsize=(7.5,3.5))
    sns.boxplot(x=target, y=data[:,0])
    plt.xlabel('Species')
    plt.ylabel('Sepal Length (cm)')
    plt.show()

    Output

    This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis representing the species and the y-axis representing the sepal length in centimeters.

    species

    From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica specie.

  • Machine Learning – Density Plots

    A density plot is a type of plot that shows the probability density function of a continuous variable. It is similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth curve to represent the probability density function. The xaxis represents the range of values of the variable, and the y-axis represents the probability density.

    Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

    Python Implementation of Density Plots

    Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For our example given below, we will use Seaborn to implement density plots.

    We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

    Example

    Let’s start by importing the necessary libraries and loading the dataset −

    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import load_breast_cancer
    data = load_breast_cancer()

    Next, we will create a density plot of the mean radius feature of the dataset −

    plt.figure(figsize=(7.2,3.5))
    sns.kdeplot(data.data[:,0], shade=True)
    plt.xlabel('Mean Radius')
    plt.ylabel('Density')
    plt.show()

    In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

    Output

    The resulting density plot shows the probability density function of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

    kdeplot function

    Density Plot with Multiple Data Sets

    We can also create a density plot with multiple data sets to compare their probability density functions. Let’s create density plots of the mean radius feature for both the malignant and benign samples −

    Example

    plt.figure(figsize=(7.5,3.5))
    sns.kdeplot(data.data[data.target==0,0], shade=True, label='Malignant')
    sns.kdeplot(data.data[data.target==1,0], shade=True, label='Benign')
    plt.xlabel('Mean Radius')
    plt.ylabel('Density')
    plt.legend()
    plt.show()

    In this code, we have used the kdeplot() function twice to create two density plots of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the shade parameter to True to shade the area under the curve, and we have added labels to the plots using the label parameter. We have also added a legend to the plot using the legend() function.

    Output

    On executing this code, you will get the following plot as the output −

    Mean Radius

    The resulting density plot shows the probability density functions of mean radius values for both the malignant and benign samples. We can see that the probability density function for the malignant samples is shifted to the right, indicating a higher mean radius value.

  • Machine Learning – Histograms

    A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the y-axis represents the frequency or count of each value. The height of each bar represents the number of data points that fall within that value range.

    Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

    Python Implementation of Histograms

    Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For the example given below, we will use Matplotlib to implement histograms.

    We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

    Example

    Let’s start by importing the necessary libraries and loading the dataset −

    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    data = load_breast_cancer()

    Next, we will create a histogram of the mean radius feature of the dataset −

    plt.figure(figsize=(7.2,3.5))
    plt.hist(data.data[:,0], bins=20)
    plt.xlabel('Mean Radius')
    plt.ylabel('Frequency')
    plt.show()

    In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

    Output

    The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

    mean radius

    Histogram with Multiple Data Sets

    We can also create a histogram with multiple data sets to compare their distributions. Let’s create histograms of the mean radius feature for both the malignant and benign samples −

    Example

    plt.figure(figsize=(7.2,3.5))
    plt.hist(data.data[data.target==0,0], bins=20, alpha=0.5, label='Malignant')
    plt.hist(data.data[data.target==1,0], bins=20, alpha=0.5, label='Benign')
    plt.xlabel('Mean Radius')
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

    In this code, we have used the hist() function twice to create two histograms of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the transparency of the bars to 0.5 using the alpha parameter so that they don’t overlap completely. We have also added a legend to the plot using the legend() function.

    Output

    On executing this code, you will get the following plot as the output −

    Mean Radius

    The resulting histogram shows the distribution of mean radius values for both the malignant and benign samples. We can see that the distributions are different, with the malignant samples having a higher frequency of higher mean radius values.

  • Data Visualization in Machine Learning

    Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate patterns, trends, and insights in the data. Data visualization involves creating graphical representations of the data, which can help to identify patterns and relationships that may not be apparent from the raw data.

    What is Data Visualization?

    Data visualization is a graphical representation of data and information. With the help of data visualization, we can see how the data looks like and what kind of correlation is held by the attributes of the data. It is the fastest way to see if the features correspond to the output.

    Importance of Data Visualization in Machine Learning

    The data visualization play a significant role in machine learning. We can use it in many ways in machine learning. Here are some of the ways data visualization is used in machine learning −

    • Exploring Data − Data visualization is an essential tool for exploring and understanding data. Visualization can help to identify patterns, correlations, and outliers and can also help to detect data quality issues such as missing values and inconsistencies.
    • Feature Selection − Data visualization can help to select relevant features for the ML model. By visualizing the data and its relationship with the target variable, you can identify features that are strongly correlated with the target variable and exclude irrelevant features that have little predictive power.
    • Model Evaluation − Data visualization can be used to evaluate the performance of the ML model. Visualization techniques such as ROC curves, precision-recall curves, and confusion matrices can help to understand the accuracy, precision, recall, and F1 score of the model.
    • Communicating Insights − Data visualization is an effective way to communicate insights and results to stakeholders who may not have a technical background. Visualizations such as scatter plots, line charts, and bar charts can help to convey complex information in an easily understandable format.

    Popular Python Libraries for Data Visualization

    Following are the most popular Python libraries for data visualization in Machine learning. These libraries provide a wide range of visualization techniques and customization options to suit different needs and preferences.

    1. Matplotlib

    Matplotlib is one of the most popular Python packages used for data visualization. It is a cross-platform library for making 2D plots from data in arrays. It provides an object-oriented API that helps in embedding plots in applications using Python GUI toolkits such as PyQt, WxPython, or Tkinter. It can be used in Python and IPython shells, Jupyter notebook and web application servers also.

    2. Seaborn

    Seaborn is an open source, BSD-licensed Python library providing high level API for visualizing the data using Python programming language.

    3. Plotly

    Plotly is a Montreal based technical computing company involved in development of data analytics and visualisation tools such as Dash and Chart Studio. It has also developed open source graphing Application Programming Interface (API) libraries for Python, R, MATLAB, Javascript and other computer programming languages.

    4. Bokeh

    Bokeh is a data visualization library for Python. Unlike Matplotlib and Seaborn, they are also Python packages for data visualization, Bokeh renders its plots using HTML and JavaScript. Hence, it proves to be extremely useful for developing web based dashboards.

    Types of Data Visualization

    Data visualization for machine learning data can be classified into two different categories as follows –

    • Univariate Plots
    • Multivariate Plots
    Data Visualization Techniques

    Let’s understand each of the above two type of data visualization plots in detail.

    Univariate Plots: Understanding Attributes Independently

    The simplest type of visualization is single-variable or univariate visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization −

    • Histograms
    • Density Plots
    • Box and Whisker Plots

    We will learn the above techniques in detail in their respective chapters. Let’s look at these techniques in brief.

    Histograms

    Histograms group the data in bins and is the fastest way to get an idea about the distribution of each attribute in the dataset. The following are some of the characteristics of histograms −

    • It provides us a count of the number of observations in each bin created for visualization.
    • From the shape of the bin, we can easily observe the distribution, i.e., whether it is Gaussian, skewed, or exponential.
    • Histograms also help us to see possible outliers.

    Example

    The code below is an example of a Python script creating the histogram. Here, we will be using hist() function on NumPy Array to generate histograms and matplotlib for plotting them.

    import matplotlib.pyplot as plt
    import numpy as np
    # Generate some random data
    data = np.random.randn(1000)# Create the histogram
    plt.hist(data, bins=30, color='skyblue', edgecolor='black')
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.title('Histogram Example')
    plt.show()
    Output
    ML Histograms Plot

    Because of random number generation, you may notice a slight difference between the outputs when you execute the above program.

    Density Plots

    Density Plot is another quick and easy technique for getting each attribute distribution. It is also like histogram but having a smooth curve drawn through the top of each bin. We can call them as abstracted histograms.

    Example

    In the following example, the Python script will generate Density Plots for the distribution of attributes of the iris dataset.

    import seaborn as sns
    import matplotlib.pyplot as plt
    # Load a sample dataset
    df = sns.load_dataset("iris")# Create the density plot
    sns.kdeplot(data=df, x="sepal_length", fill=True)# Add labels and title
    plt.xlabel("Sepal Length")
    plt.ylabel("Density")
    plt.title("Density Plot of Sepal Length")# Show the plot
    plt.show()
    Output
    Density Plot

    From the above output, the difference between Density plots and Histograms can be easily understood.

    Box and Whisker Plots

    Box and Whisker Plots, also called boxplots in short, is another useful technique to review the distribution of each attributes distribution. The following are the characteristics of this technique −

    • It is univariate in nature and summarizes the distribution of each attribute.
    • It draws a line for the middle value i.e. for median.
    • It draws a box around the 25% and 75%.
    • It also draws whiskers which will give us an idea about the spread of the data.
    • The dots outside the whiskers signifies the outlier values. Outlier values would be 1.5 times greater than the size of the spread of the middle data.

    Example

    In the following example, the Python script will generate a Box and Whisker Plot for the distribution of attributes of the Iris dataset.

    import matplotlib.pyplot as plt
    # Sample data
    data =[10,15,18,20,22,25,28,30,32,35]# Create a figure and axes
    fig, ax = plt.subplots()# Create the boxplot
    ax.boxplot(data)# Set the title
    ax.set_title('Box and Whisker Plot')# Show the plot
    plt.show()
    Output
    Box Plot

    Multivariate Plots: Interaction Among Multiple Variables

    Another type of visualization is multi-variable or multivariate visualization. With the help of multivariate visualization, we can understand the interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization −

    • Correlation Matrix Plot
    • Scatter Matrix Plot

    Correlation Matrix Plot

    Correlation is an indication of the changes between two variables. We can plot correlation matrix plot to show which variable is having a high or low correlation in respect to another variable.

    Example

    In the following example, the Python script will generate a correlation matrix plot. It can be generated with the help of corr() function on Pandas DataFrame and plotted with the help of Matplotlib pyplot.

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    # Sample data
    data ={'A':[1,2,3,4,5],'B':[5,4,3,2,1],'C':[2,3,1,4,5]}
    df = pd.DataFrame(data)# Calculate the correlation matrix
    c_matrix = df.corr()# Create a heatmap
    sns.heatmap(c_matrix, annot=True, cmap='coolwarm')
    plt.title("Correlation Matrix")
    plt.show()
    Output
    Correlation Matrix Plot

    From the above output of the correlation matrix, we can see that it is symmetrical i.e. the bottom left is same as the top right.

    Scatter Matrix Plot

    Scatter matrix plot shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points.

    Example

    In the following example, the Python script will generate and plot the Scatter matrix for the Iris dataset. It can be generated with the help of scatter_matrix() function on Pandas DataFrame and plotted with the help of pyplot.

    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn import datasets
    # Load the iris dataset
    iris = datasets.load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)# Create the scatter matrix plot
    pd.plotting.scatter_matrix(df, diagonal='hist', figsize=(8,7))
    plt.show()
    Output
    Scatter Matrix Plot

    In the next few chapters, we will look at some of the popular and widely used visualization techniques available in machine learning.

  • Supervised vs. Unsupervised Learning

    Supervised learning and Unsupervised learning are two popular approaches in Machine Learning. The simplest way to distinguish between supervised and unsupervised learning is the type of training dataset and the way the models are trained. However, there are other differences, which are further discussed in the chapter.

    What is Supervised Learning?

    Supervised Learning is a machine learning approach that uses labeled datasets to train the model, making it ideal for tasks like classifying data or predicting output. Supervised learning is categorized into two types −

    1. Classification

    Classification uses algorithms to predict categorical values, such as determining whether an email is spam or not or whether it is true or false. The algorithm learns to map each input to its corresponding output label. Some common algorithms include K-Nearest Neighbors, Random forests and Decision trees.

    2. Regression

    Regression is a statistical approach to analyze the relationship between data points. It can be used to forecast house prices based on features like location and size or estimate future sales. Some common algorithms include linear regression, polynomial regression, and logistic regression.

    What is Unsupervised Learning?

    Unsupervised Learning is a machine learning approach used to train models on raw and unlabeled data. This approach is often used to identify patterns in the data without human supervision. Unsupervised learning models are used to for the below tasks −

    1. Clustering

    This task uses unsupervised learning models to group data points into clusters based on their similarities. Popular algorithm used is the K-means clustering.

    2. Association

    This is another type of unsupervised learning that uses pre-defined rules to group data points into a cluster. It is commonly used in Market Basket Analysis, and the main algorithm behind this task is Apriori Algorithm.

    3. Dimensionality Reduction

    This method of unsupervised learning is used to reduce the size of a dataset by removing features that are not necessary without compromising the originality of the data.

    Differences between Supervised and Unsupervised Learning

    The table below shows some key differences between supervised and unsupervised machine learning −

    BasisSupervised LearningUnsupervised Learning
    DefinitionSupervised learning algorithms train data, where every input has a corresponding output.Unsupervised learning algorithms find patterns in data that has no predefined labels.
    GoalThe goal of supervised learning is to predict or classify based on input features.The goal of unsupervised learning is to discover hidden patterns, structures and relationships.
    Input DataLabeled: Input data with corresponding output labels.Unlabeled: Input data is raw and unlabeled.
    Human SupervisionSupervised learning algorithms needs human supervision to train the model.Unsupervised learning algorithms does not any kind of supervision to train the model..
    TasksRegression, ClassificationClustering, Association and Dimensionality Reduction
    Complexitysupervised machine learning methods are computationally simple.Unsupervised machine learning methods are computationally complex.
    AlgorithmsLinear regression, K-Nearest Neighbors, Decision Trees, Naive Bayes, SVMK- Means clustering, DBSCAN, Autoencoders
    AccuracySupervised machine learning methods are highly accurate.Unsupervised machine learning methods are less accurate.
    ApplicationsImage classification, Sentiment Analysis, Recommendation systemsCustomer Segmentation, Anomaly Detection, Recommendation Engines, NLP

    Supervised or Unsupervised Learning – Which to Choose?

    Choosing the right approach is crucial and will also determine the efficiency of the outcome. To decide on which learning approach is best, the following things should be considered −

    • Dataset − Evaluate the data, whether it is labeled or unlabeled. You will also need to assess whether you have the time, resources and expertise to support labeling.
    • Goals − It is also important to define the problem you are trying to solve and the solution you are trying to opt for. It might be classification, discovering new patterns or insights in the data or creating a predictive model.
    • Algorithm − Review the algorithm by making sure that it matches required dimensions, such as attributes and number of features. Also, evaluate if the algorithm can support the volume of the data.

    Semi-supervised Learning

    Semi-supervised learning is the safest medium if you are in a dilemma about choosing between supervised and unsupervised learning. This learning approach is a combination of both supervised and unsupervised learning, where a minor part of the dataset used is labeled and the major part is unlabeled. This is ideal when you have a high volume of data that makes it difficult to identify relevant features.

  • Reinforcement Learning

    What is Reinforcement Learning?

    Reinforcement learning is a machine learning approach where an agent (software entity) is trained to interpret the environment by performing actions and monitoring the results. For every good action, the agent gets positive feedback and for every bad action the agent gets negative feedback. It’s inspired by how animals learn from their experiences, making decisions based on the consequences of their actions.

    The following diagram shows a typical reinforcement learning model −

    Reinforcement Machine Learning

    In the above diagram, the agent is represented in a particular state. The agent takes action in an environment to achieve a particular task. As a result of the performed task, the agent receives feedback as a reward or punishment.

    How Does Reinforcement Learning Work?

    In reinforcement learning, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regarding the current state of the environment. The agent learns how to make decisions by receiving rewards or penalties based on its actions.

    The working of reinforcement learning can be understood by the approach of a master chess player.

    • Exploration − Just like how a chess play considers various possible move and their outcome, the agent also explores different actions to understand their effects and learns which action would lead to better result.
    • Exploitation − The chess player also uses intuition, based on past experiences to make decisions that seem right. Similarly, the agent uses knowledge gained from previous experiences to make best choices.

    Key Elements Reinforcement Learning

    Beyond the agent and the environment, one can identify four main sub elements of reinforcement learning system −

    • Policy − It defines the learning agent’s way of behaving at a given time. A policy is a mapping from perceived states of the environment to actions to be taken when in those states.
    • Reward Signal − It defines the goal of a reinforcement learning problem. It is a numerical score received to the agent by the environment. This reward signal defines what are the good and bad events for the agent.
    • Value function − It specifies what is good in the long run. The value is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
    • Model − Models are used for planning, which means deciding on a course of action by considering possible future situations before they are actually experienced.

    Markov Decision Processes(MDP) provide a mathematical framework for modeling decision-making in an environment with states, actions, rewards, probability. Reinforcement learning uses MDP to understand how an agent should act to maximize rewards and to find the best strategies for decision making.

    Markov Decision Processes (MDP)

    Reinforcement learning uses the mathematical framework of Markov decision processes(MDP) to define the interaction between learning agent and environment. Some important concepts and components of MDP are −

    • States(S) − Represents all the situations in which an agent can find itself.
    • Action(A) − The choices available for the agent from the gives states.
    • Transition Probabilities(P) − The likelihood of moving from one state to another as a result of a specific action.
    • Rewards(R) − Feedback received after transitioning to a new state due to an action, indication the outcome’s desirability.
    • Policy( ) − A strategy that defines the action to take in each state for achieving a reward.

    Steps in Reinforcement Learning Process

    Here are the major steps involved in reinforcement learning methods −

    • Step 1 − First, we need to prepare an agent with some initial set of strategies.
    • Step 2 − Then observe the environment and its current state.
    • Step 3 − Next, select the optimal policy regards the current state of the environment and perform important action.
    • Step 4 − Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step.
    • Step 5 − Now, we can update the strategies if it is required so.
    • Step 6 − At last, repeat steps 2-5 until the agent got to learn & adopt the optimal policies.

    Types of Reinforcement Learning

    There are two types of Reinforcement learning:

    • Positive Reinforcement − When an agent performs an action that is desirable or leads to a good out, it receives a rewards which increase the livelihood of that action being repeated.
    • Negative Reinforcement − When an agent performs an action to avoid a negative outcome, the negative stimulus is removed. For example, if a robot is programmed to avoid an obstacle and successfully navigates away from it, the threat associated with action is removed. And the robot more likely avoids that action in the future.

    Types of Reinforcement Learning Algorithms

    There are various algorithms used in reinforcement learning such as Q-learning, policy gradient methods, Monte Carlo method and many more. All these algorithms can be classified into two broad categories −

    • Model-free Reinforcement Learning − It is a category of reinforcement learning algorithms that learns to make decisions by interacting with the environment directly, without creating a model of the environment’s dynamics. The agent performs different actions multiple times to learn the outcomes and creates a strategy (policy) that optimizes its reward points. This is ideal for changing, large or complex environments.
    • Model-based Reinforcement Learning − This category of reinforcement learning algorithms involves creating a model of the environment’s dynamics to make decisions and improve performance. This model is ideal when the environment is static, and well-defined, where real-world environment testing is difficult.

    Advantages of Reinforcement Learning

    Some of the advantages of reinforcement learning are −

    • Reinforcement learning doesn’t require pre-defined instructions and human intervention.
    • Reinforcement learning model can adapt to wide range of environments including static and dynamic.
    • Reinforcement learning can be used to solve wide range of problems, including decision making, prediction and optimization.
    • Reinforcement learning model gets better as it gains experience and fine-tunes.

    Disadvantages of Reinforcement Learning

    Some of the disadvantages of reinforcement learning are −

    • Reinforcement learning depends on the quality of the reward function, if it is poorly designed, the model can never get better with its performance.
    • The designing and tuning of reinforcement learning can be complex and requires expertise.

    Applications of Reinforcement Learning

    Reinforcement learning has a wide range of applications across various fields. Some major applications are −

    1. Robotics

    Reinforcement learning is generally concerned with decision-making in unpredictable environments. This is the most used approach especially for complicated tasks, such as replicating human behavior, manipulation, navigation and locomotion. This approach also allows robots to adapt to new environments through trial and error.

    2. Natural Language Processing (NLP)

    In Natural Language Processing (NLP), Reinforcement learning is used to enhance the performance of chatbots by managing complex dialogues and improving user interactions. Additionally, this learning approach is also used to train models for tasks like summarizations.

    Reinforcement Learning Vs. Supervised learning

    Supervised learning and Reinforcement learning are two distinct approaches in machine learning. In supervised learning, a model is trained on a dataset that consists of both input and its corresponding outputs for predictive analysis. Whereas, in reinforcement learning an agent interacts with an environment, learning to make decisions by receiving feedback in the form of rewards or penalties, aiming to maximize cumulative rewards. Another difference between these two approaches is the tasks that they are ideal for. While supervised learning is used for tasks that are often with clear, structured output, reinforcement learning is used for complex decision making tasks with optimal strategies.

  • Semi-Supervised Learning

    Semi-supervised learning is a type of machine learning that is neither fully supervised nor fully unsupervised. The semi-supervised learning algorithms basically fall between supervised and unsupervised learning methods.

    In semi-supervise learning, mahcine learning algorithms are trained on datasets that contains both labeled and unlabeled data. Semi-supervised learning is generally used when we have a huge set of unlabeled data available. In any supervised learning algorithm, the available data has to be manually labelled which can be quite an expensive process. In contrast, the unlabeled data used in unsupervised learning has limited applications. Hence, semi-supervised learning algorithms were developed which can provide a perfect balance between the two.

    What is Semi-Supervised Learning?

    Semi-supervised learning is a machine learning approch or technique that works in combination of supervised and unsupervised learning. In semi-supervised learning, the machine learning alogrithms are trained on a small amount of labeled data and a large amount of unlabeled data.

    The goal of semi-supervised learning is to develop an algorithm to divide the entire data into different clusters and the data points closer to each other most likely share the same output label, and then to classify the cluster into a predefined category.

    We can summarize semi-supervised learning as

    • a machine learning approach or technique that
    • combines supervised learning and unsuprvised learning
    • to train ML models by using labeled and unlabled data
    • to perform classification and regreesion related tasks.

    Semi-supervised Learning Vs. Supervised Learning

    The primary difference between supervised learning and semi-supervised is the dataset that is used to train the model. In supervised learning, the model is trained on a dataset that consists of input and each of it is paired with a predefined label i.e, the features and their corresponding target label is provided. This allows for more accurate prediction or classification. Whereas, in semi-supervised learning the dataset consists of a minor amount of labeled data and a major amount of unlabeled data. The model is initially trained on labeled data, then uses these insights to train unlabeled data to discover additional patterns.

    Semi-supervised Learning Vs. Unsupervised Learning

    Unsupervised learning trains a model only on unlabeled dataset, aiming to identify groups with common features within the dataset. In contrast, semi-supervised learning uses a mix of labeled data(small amount) and unlabeled data(large amount). In unsupervised learning, the data points in the dataset are grouped into clusters based on common features, where as semi-supervised learning is much efficient since each cluster is allotted a pre-defined label since it train on labeled data along with unlabeled data.

    When to Choose Semi-Supervised Learning?

    Situations where obtaining a sufficient amount of labeled data is difficult and expensive, but gathering unlabeled data is much easier. In such scenarios, neither fully supervised nor unsupervised learning methods will provide accurate outcomes. This is where semi-supervised learning methods can be implemented.

    How Does Semi-Supervised Learning Work?

    Semi-supervised learning generally uses small supervised learning component, i.e., small amount of pre-labeled annotated data and large unsupervised learning component, i.e., lots of unlabeled data for training.

    In machine learning, we can follow any of the following approaches for implementing semi-supervised learning methods −

    • The first and simple approach is to build the supervised model based on a small labeled and annotated data and then build the unsupervised model by applying the same to the large amounts of unlabeled data to get more labeled samples. Now, train the model on them and repeat the process.
    • The second approach needs some extra efforts. In this approach, we can first use the unsupervised methods to cluster similar data samples, annotate these groups and then use a combination of this information to train the model.

    In Semi-supervised learning, the unlabeled data used should be relevant to the task the model is trained to perform. In mathematical terms, the input data’s distribution p(x) must contain information about the posterior distribution p(y|x), which represents the probability of a given data point (x) belonging to a certain class (y).

    There are certain assumptions held for the working of semi-supervised learning like −

    • Smoothness Assumption
    • Cluster Assumption
    • Low Density Separation
    • Manifold Assumptions

    Let us have a brief understanding about the above listed assumptions.

    Smoothness Assumption

    This assumption states that two data points x1 and x2 in a high-density region (belong to same cluster) are close, so should be the corresponding output labels y1 and y2. On the other hand, if the data points are in low density region, their outputs need not be close

    Cluster Assumption

    Cluster assumption states that when data points are in the same cluster, they are likely to be of the same class. Unlabeled data should aid in finding the boundary of each cluster more accurately using clustering algorithms. Additionally, the labeled data points should be used to assign a class for each cluster.

    Low Density Separation

    Low Density Separation assumption states that the decision boundary should lie in the low density region. Consider digit recognition, for instance, one wants to distinguish a handwritten digit 0 against digit 1. A sample point taken exactly from the decision boundary will be between a 0 and a 1, most likely a digit looking like a very elongated zero. But the probability that someone wrote this weird digit is very small.

    Manifold Assumptions

    This assumption forms the basis of several semi-supervised learning methods, it states that in a higher-dimensional input space, there are several lower dimensional manifolds where all data points exist, and data points with the same label are located on the same manifold.

    Semi-supervised Learning Techniques

    Semi-supervised learning uses several techniques to bring out the best from both labeled and unlabeled data for accurate outcomes. Some popular techniques include −

    Self-training

    Self-training is a process in which any supervised method like classification and regression, can be modified to work in a semi-supervised manner, taking insights from both labeled and unlabeled data.

    Co-training

    This approach is an improved version of Self-training approach, where the idea wa to make use of different “views” on the data that is to be classified. This is ideally used for web content classification where, a web page can be represented by the text on the page, and can also be represented by the hyperlinks referring to the pages. Unlike the typical process, the co-training approach trains two individual classifiers based on two views of data to improve learning performance.

    Graph based label propagation

    The most efficient way to run semi-supervised learning, it models data as graphs where nodes represent data points and edges represent similarities between them, and then the label propagation algorithm is applied. In this approach, labeled data points propagate their labels through the graph, influencing the neighboring nodes. The labels are iteratively updated, allowing the model to assign labels to unlabeled nodes.

    Challenges of Semi-supervised Learning

    Semi-supervised learning requires only a small amount of labeled data along side large set of unlabeled data, reducing the cost and need of manual labeling. In contrast, there are a few challenges that has be addressed like −

    • Quality of data − The efficiency of semi-supervised learning depends on the quality of unlabeled data. If the unlabeled data is noisy or irrelevant, there are chances that it might to lead to incorrect predictions and poor performance.
    • Variation in the data − Semi-supervised learning models are more prone to distribution shifts between the labeled and unlabeled data. For examples, a model is trained on labeled dataset that consists clear high quality images where as the if the unlabeled data contains images from captured from surveillance cameras, it would be difficult to generalize from the labeled to the unlabeled images, impacting the outcomes.

    Applications of Semi-supervised Learning

    Semi-supervised machine learning finds its application in text classification, image classification, speech analysis, anomaly detection, etc. where the general goal is to classify an entity into a predefined category. Semi-supervised algorithm assumes that the data can be divided into discrete clusters and the data points closer to each other are more likely to share the same output label.

    Some popular applications of semi-supervised learning are −

    • Speech Recognition − Labeling audio data is a time consuming task, semi-supervised techniques improve speech models combining unlabeled audio data alongside limited transcribed speech. This enhances the accuracy in recognizing spoken language.
    • Web Content Classification − With billions of websites, manually labeling content is impractical. Semi-supervised Learning helps classify web content efficiently, improving search engines like Google in ranking and produces relevant content to user queries.
    • Text Document Classification − Semi-supervised Learning is used to classify text by training on small set of labeled documents and large corpus of unlabeled text. The model first learns from labeled data to gain insights and then use it to classify text. This learning methods helps improve the accuracy of classification without the need for extensive labeled datasets.
  • Unsupervised Machine Learning

    What is Unsupervised Machine Learning?

    Unsupervised learning, also known as unsupervised machine learning, is a type of machine learning that learns patterns and structures within the data without human supervision. Unsupervised learning uses machine learning algorithms to analyze the data and discover underlying patterns within unlabeled data sets.

    Unlike supervised machine learning, unsupervised machine learning models are trained on unlabeled dataset. Unsupervised learning algorithms are handy in scenarios in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful patterns from input data.

    We can summarize unsupervised learning as −

    • a machine learning approach or type that
    • uses machine learning algorithms
    • to find hidden patterns or structures
    • within the data without human supervision.

    There are many approaches that are used in unsupervised machine learning. Some of the approaches are association, clustering, and dimensionality reduction. Some examples of unsupervised machine learning algorithms include K-means clustering, K-nearest neighbors, etc.

    In regression, we train the machine to predict a future value. In classification, we train the machine to classify an unknown object in one of the categories we define. In short, we have been training machines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it would be difficult for us to train the machine using supervised learning. What if the machine can look up and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so many distinct categories?

    As an example, consider the voters data. By considering some inputs from each voter (these are called features in AI terminology), let the machine predict that there are so many voters who would vote for X political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given a huge set of data points X, What can you tell me about X?. Or it may be a question like What are the five best groups we can make out of X?. Or it could be even like What three features occur together most frequently in X?.

    This is exactly what Unsupervised Learning is all about.

    How does Unsupervised Learning Work?

    In unsupervised learning, machine learning algorithms (called self-learning algorithms) are trained on unlabeled data sets i.e, the input data is not categorized. Based on the tasks, or machine learning problems such as clustering, associations, etc. and the data sets, the suitable algorithms are chosen for the training.

    In the training process, the algorthims learn and infer their own rules on the basis of the similarities, patterns and differences of data points. The algorithms learn without any labels (target values) or pre-training.

    The outcome of this training process of algorithm with data sets is a machine learning model. As the data sets are unlabeled (no target values, no human supervision), the model is unsupervised machine learning model.

    Now the model is ready to perform the unsupervised learning tasks such as clustering, association, or dimensionality reduction.

    Unsupervised learning models is suitable complex tasks, like organizing large datasets into clusters.

    Unsupervised Machine Learning Methods

    Unsupervised learning methods or approaches are broadly categorized into three categories − clustering, association, and dimensionality reduction. Let us discuss these methods briefly and list some related algorithms −

    1. Clustering

    Clustering is a technique used to group a set of objects or data points into clusters based on their similarities. The goal of this technique is to make sure that the data points within the same cluster should have more similarities than those in other clusters.

    Clustering is sometimes called unsupervised classification because it produces the same result as classification does but without having predefined classes.

    Clustering is one of the popular unsupervised learning approaches. There are several unsupervised learning algorithms used for clustering like −

    • K-Means Clustering − This algorithm is used to assign data points to one among the K clusters based on its distance from the center of the cluster. After assigning each data point to a cluster, new centroids are recalculated. This is an iterative process until the centroids no longer change. This shows that the algorithm is efficient and the clusters are stable.
    • Mean Shift Algorithm − It is a clustering technique that identifies clusters by finding high data density areas. It is an iterative process, where mean of each data point is shifted towards the densest area of the data.
    • Gaussian Mixture Models − It is a probabilistic model that is a combination of multiple Gaussian distributions. These models are used to determine which determination a given data belongs to.

    2. Association Rule Mining

    This is rule based technique that is used to discover associations between parameters in large dataset. It is popularly used for Market Basket Analysis, allows companies to make decisions and recommendation engines. One of the main algorithms that is used for Association Rule Mining is the Apriori algorithm.

    Apriori Algorithm

    Apriori algorithm is a technique used in unsupervised learning to identify data points that are frequently repeated and discover association rules within transactional data.

    3. Dimensionality Reduction

    As the name suggests, dimensionality reduction is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features.

    A question arises here is that, why we need to reduce the dimensionality? The reason behind this is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to “curse of dimensionality”. Some popular algorithms in unsupervised learning that are used for dimensionality reduction are −

    • Principle Component Analysis
    • Missing Value Ratio
    • Singular Value Decomposition
    • Autoencoders

    Algorithms for Unsupervised Learning

    Algorithms are very important part in machine learning model training. A machine learning algorithm is a set of instructions that a program follows to analyze the data and produce the outcomes. For specific tasks, suitable machine learning algorithms are selected and trained on the data.

    Algorithms used in unsupervised learning generally fall under one of the three categories − clustering, association, or dimensionality reduction. The following are the most used unsupervised learning algorithms −

    • K-Means Clustering
    • Hierarchical Clustering
    • Mean-shift Clustering
    • DBSCAN Clustering
    • HDBSCAN Clustering
    • BIRCH Clustering
    • Affinity Propagation
    • Agglomerative Clustering
    • Apriori Algorithm
    • Eclat algorithm
    • FP-growth algorithm
    • Principal Component Analysis(PCA)
    • Autoencoders
    • Singular value decomposition (SVD)

    Advantages of Unsupervised Learning

    Unsupervised learning has many advantages that make it particularly purposeful in various tasks −

    • No labeled data required − Unsupervised learning doesn’t require a labeled dataset for training, which makes it easier and cheaper to use.
    • Discovers hidden patterns − It helps in recognizing patterns and relationships in large data, which can lead to gaining insights and efficient decision-making.
    • Suitable for complex tasks − It is efficiently used for various complex tasks like clustering, anomaly detection, and dimensionality reduction.

    Disadvantages of Unsupervised Learning

    While unsupervised learning has many advantages, some challenges can occur too while training the model without human intervention. Some of the disadvantages of unsupervised learning are:

    • Difficult to evaluate − Without labeled data and predefined targets, it would be difficult to evaluate the performance of unsupervised learning algorithms.
    • Inaccurate outcomes − The outcome of an unsupervised learning algorithm might be less accurate, especially if the input data has noise and also since the data is not labeled, the algorithms do not know the exact output.

    Applications of Unsupervised Learning

    Unsupervised learning provides a path for businesses to identify patterns in large volumes of data. Some real-world applications of unsupervised learning are:

    • Customer Segmentation − In business and retail analysis, unsupervised learning is used to group customers into segments based on their purchases, past activity, or preferences.
    • Anomaly Detection − Unsupervised learning algorithms are used in anomaly detection to identify unusual patterns, which is crucial for fraud detection in financial transactions and network security.
    • Recommendation Engines − Unsupervised learning algorithms help to analyze large customer data to gain valuable insights and understand patterns. This can help in target marketing and personalization.
    • Natural Language Processing− Unsupervised learning algorithms are used for various applications. For example, google used to categorize articles in the news section.

    What is Anomaly Detection?

    This unsupervised ML method is used to find out occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or normal data points.

    Some of the unsupervised algorithms, like clustering and KNN, can detect anomalies based on the data and its features.

    Supervised Vs. Unsupervised Learning

    Supervised learning algorithms are trained using labeled data. But there might be cases where data might not be labeled, so how do you gain insights from data that is unlabeled and messy? Well, to solve these types of cases, unsupervised learning is used. We have done a detailed analysis on comparison between supervised and unsupervised learning in supervised vs. unsupervised learning chapter.