Author: admin

  • Machine Learning – Mean, Median, Mode

    Mean, Median, and Mode are statistical measures used to describe the central tendency of a dataset. In machine learning, these measures are used to understand the distribution of data and identify outliers. Here, we will explore the concepts of Mean, Median, and Mode and their implementation in Python.

    Mean

    The “mean” is the average value of a dataset. It is calculated by adding up all the values in the dataset and dividing by the number of observations. The mean is a useful measure of central tendency because it is sensitive to outliers, meaning that extreme values can significantly affect the value of the mean.

    In Python, we can calculate the mean using the NumPy library, which provides a function called mean().

    Median

    The “median” is the middle value in a dataset. It is calculated by arranging the values in the dataset in order and finding the value that lies in the middle. If there are an even number of values in the dataset, the median is the average of the two middle values.

    The median is a useful measure of central tendency because it is not affected by outliers, meaning that extreme values do not significantly affect the value of the median.

    In Python, we can calculate the median using the NumPy library, which provides a function called median().

    Mode

    The “mode” is the most common value in a dataset. It is calculated by finding the value that occurs most frequently in the dataset. If there are multiple values that occur with the same frequency, the dataset is said to be bimodal, trimodal, or multimodal.

    The mode is a useful measure of central tendency because it can identify the most common value in a dataset. However, it is not a good measure of central tendency for datasets with a wide range of values or datasets with no repeating values.

    In Python, we can calculate the mode using the SciPy library, which provides a function called mode().

    Python Implementation

    Let’s see an example of calculating mean, median, and mode for a salary table in Python using NumPy and Pandas −

    import numpy as np
    import pandas as pd
    # create a sample salary table
    salary = pd.DataFrame({'employee_id':['001','002','003','004','005','006','007','008','009','010'],'salary':[50000,65000,55000,45000,70000,60000,55000,45000,80000,70000]})# calculate mean
    mean_salary = np.mean(salary['salary'])print('Mean salary:', mean_salary)# calculate median
    median_salary = np.median(salary['salary'])print('Median salary:', median_salary)# calculate mode
    mode_salary = salary['salary'].mode()[0]print('Mode salary:', mode_salary)

    Output

    On executing this code, you will get the following output −

    Mean salary: 59500.0
    Median salary: 57500.0
    Mode salary: 45000

  • Statistics for Machine Learning

    Statistics is a crucial tool in machine learning because it helps us understand the underlying patterns in the data. It provides us with methods to describe, summarize, and analyze data. Let’s see some of the basics of statistics for machine learning.

    What is Statistics?

    Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of data. It provides us with different types of methods and techniques to analyze data and draw conclusions from it.

    Statistics is the foundation for machine learning as it helps us to analyze and visualize data to find hidden patterns. Statistics is used in machine learning in many ways, including model validation, data cleaning, model selection, evaluating model performance, etc.

    Basic Statistics Concepts for Machine Learning

    Followings are some of the important statistics concepts essential for machine learning −

    • Mean, Median, Mode − These statistical measures used to describe the central tendency of a dataset.
    • Standard deviation, Variance − Standard deviation is a measure of the amount of variation or dispersion of a set of data values around their mean.
    • Percentiles − A percentile is a measure that indicates the value below which a given percentage of observations in a group of observations falls.
    • Data Distribution − It refers to the way in which data points are distributed or spread out across a dataset.
    • Skewness and Kurtosis − Skewness refers to the degree of asymmetry of a distribution and kurtosis refers to the degree of peakedness of a distribution.
    • Bias and Variance − They describe the sources of error in a model’s predictions.
    • Hypothesis − It is a proposed explanation or solution for a problem.
    • Linear Regression − It is used to predict the value of a variable based on the value of another variable.
    • Logistic Regression − It estimates the probability of an event occurring.
    • Principal Component Analysis − It is a dimensionality reduction method used to reduce the dimensionality of large datasets.

    Types of Statistics

    There are two types of statistics – descriptive and inferential statistics.

    • Descriptive Statistics − set of rules or methods used to describe or summarize the features of dataset.
    • Inferential Statistics − deals with making predictions and inferences about a population based on a sample of data

    Let’s understand these two types of statistics in detail.

    Descriptive Statistics

    Descriptive statistics is a branch of statistics that deals with the summary and analysis of data. It includes measures such as mean, median, mode, variance, and standard deviation. These measures help us understand the central tendency, variability, and distribution of the data.

    Applications in Machine Learning

    In machine learning, descriptive statistics can be used to summarize the data, identify outliers, and detect patterns. For example, we can use the mean and standard deviation to describe the distribution of a dataset.

    Example

    In Python, we can calculate descriptive statistics using libraries such as NumPy and Pandas. Below is an example −

    import numpy as np
    import pandas as pd
    
    data = np.array([1,2,3,4,5])
    df = pd.DataFrame(data, columns=["Values"])print(df.describe())

    Output

    This will output a summary of the dataset, including the count, mean, standard deviation, minimum, and maximum values as follows −

             Values
    count    5.000000
    mean     3.000000
    std      1.581139
    min      1.000000
    25%      2.000000
    50%      3.000000
    75%      4.000000
    max      5.000000
    

    Inferential Statistics

    Inferential statistics is a branch of statistics that deals with making predictions and inferences about a population based on a sample of data. It involves using hypothesis testing, confidence intervals, and regression analysis to draw conclusions about the data.

    Applications in Machine Learning

    In machine learning, inferential statistics can be used to make predictions about new data based on existing data. For example, we can use regression analysis to predict the price of a house based on its features, such as the number of bedrooms and bathrooms.

    Example

    In Python, we can perform inferential statistics using libraries such as Scikit-Learn and StatsModels. Below is an example −

    import statsmodels.api as sm
    import numpy as np
    
    X = np.array([1,2,3,4,5])
    y = np.array([2,4,6,8,10])
    
    X = sm.add_constant(X)
    model = sm.OLS(y, X).fit()print(model.summary())

    Output

    This will output a summary of the regression model, including the coefficients, standard errors, t-statistics, and p-values as follows −

    Inferential Statistics

    In the next chapter, we will discuss various descriptive and inferential statistics measures, which are commonly used in machine learning, in detail along with Python implementation example.

  • Machine Learning – Scatter Matrix Plot

    Scatter Matrix Plot is a graphical representation of the relationship between multiple variables. It is a useful tool in machine learning for visualizing the correlation between features in a dataset. This plot is also known as a Pair Plot, and it is used to identify the correlation between two or more variables in a dataset.

    A Scatter Matrix Plot displays the scatter plot of each pair of features in a dataset. Each scatter plot represents the relationship between two variables. It is also possible to add a diagonal line to the plot that shows the distribution of each variable.

    Python Implementation of Scatter Matrix Plot

    Here, we will implement the Scatter Matrix Plot in Python. For our example given below, we will be using Sklearn’s Iris dataset.

    The Iris dataset is a classic dataset in machine learning. It contains four features: Sepal Length, Sepal Width, Petal Length, and Petal Width. The dataset has 150 samples, and each sample is labeled as one of three species: Setosa, Versicolor, or Virginica.

    We will use the Seaborn library to implement the Scatter Matrix Plot. Seaborn is a Python data visualization library that is built on top of the Matplotlib library.

    Example

    Below is the Python code to implement the Scatter Matrix Plot −

    import seaborn as sns
    import pandas as pd
    
    # load iris dataset
    iris = sns.load_dataset('iris')# create scatter matrix plot
    sns.pairplot(iris, hue='species')# show plot
    plt.show()

    In this code, we first import the necessary libraries, Seaborn and Pandas. Then, we load the Iris dataset using the sns.load_dataset() function. This function loads the Iris dataset from the Seaborn library.

    Next, we create the Scatter Matrix Plot using the sns.pairplot() function. The hue parameter is used to specify the column in the dataset that should be used for color encoding. In this case, we use the species column to color the points according to the species of each sample.

    Finally, we use the plt.show() function to display the plot.

    Output

    The output of this code will be a Scatter Matrix Plot that shows the scatter plots of each pair of features in the Iris dataset.

    scatter matrix plot

    Notice that each scatter plot is color-coded according to the species of each sample.

  • Machine Learning – Correlation Matrix Plot

    A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot represents the relationship between two variables, and the correlation coefficient indicates the strength of the relationship. The diagonal of the matrix usually shows the distribution of each variable.

    The correlation coefficient is a measure of the linear relationship between two variables and ranges from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is associated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation, where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indicates no correlation between the variables.

    Python Implementation of Correlation Matrix Plots

    Now that we have a basic understanding of correlation matrix plots, let’s implement them in Python. For our example, we will be using the Iris flower dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica.

    Example

    import numpy as np
    import pandas as pd
    import seaborn as sns
    from sklearn.datasets import load_iris
    
    iris = load_iris()
    data = pd.DataFrame(iris.data, columns=iris.feature_names)
    target = iris.target
    
    plt.figure(figsize=(7.5,3.5))
    
    corr = data.corr()
    sns.set(style='white')
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)]=True
    f, ax = plt.subplots(figsize=(11,9))
    cmap = sns.diverging_palette(220,10, as_cmap=True)
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
       square=True, linewidths=.5, cbar_kws={"shrink":.5})
    plt.show()

    Output

    This code will produce a correlation matrix plot of the Iris dataset, with each square representing the correlation coefficient between two variables.

    correlation_matrix_plot

    From this plot, we can see that the variables ‘sepal width (cm)’ and ‘petal length (cm)’ have a moderate negative correlation (-0.37), while the variables ‘petal length (cm)’ and ‘petal width (cm)’ have a strong positive correlation (0.96). We can also see that the variable ‘sepal length (cm)’ has a weak positive correlation (0.87) with the variable ‘petal length (cm)’.

  • Machine Learning – Box and Whisker Plots

    A boxplot is a graphical representation of a dataset that displays the five-number summary of the data – the minimum value, the first quartile, the median, the third quartile, and the maximum value.

    The boxplot consists of a box with whiskers extending from the top and bottom of the box.

    • The box represents the interquartile range (IQR) of the data, which is the range between the first and third quartiles.
    • The whiskers extend from the top and bottom of the box to the highest and lowest values that are within 1.5 times the IQR.

    Any values that fall outside this range are considered outliers and are represented as points beyond the whiskers.

    Python Implementation of Box and Whisker Plots

    Now that we have a basic understanding of boxplots, let’s implement them in Python. For our example, we will be using the Iris dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica.

    To start, we need to import the necessary libraries and load the dataset.

    Example

    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import load_iris
    iris = load_iris()
    data = iris.data
    target = iris.target
    

    Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library.

    plt.figure(figsize=(7.5,3.5))
    sns.boxplot(x=target, y=data[:,0])
    plt.xlabel('Species')
    plt.ylabel('Sepal Length (cm)')
    plt.show()

    Output

    This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis representing the species and the y-axis representing the sepal length in centimeters.

    species

    From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica specie.

  • Machine Learning – Density Plots

    A density plot is a type of plot that shows the probability density function of a continuous variable. It is similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth curve to represent the probability density function. The xaxis represents the range of values of the variable, and the y-axis represents the probability density.

    Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

    Python Implementation of Density Plots

    Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For our example given below, we will use Seaborn to implement density plots.

    We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

    Example

    Let’s start by importing the necessary libraries and loading the dataset −

    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import load_breast_cancer
    data = load_breast_cancer()

    Next, we will create a density plot of the mean radius feature of the dataset −

    plt.figure(figsize=(7.2,3.5))
    sns.kdeplot(data.data[:,0], shade=True)
    plt.xlabel('Mean Radius')
    plt.ylabel('Density')
    plt.show()

    In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

    Output

    The resulting density plot shows the probability density function of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

    kdeplot function

    Density Plot with Multiple Data Sets

    We can also create a density plot with multiple data sets to compare their probability density functions. Let’s create density plots of the mean radius feature for both the malignant and benign samples −

    Example

    plt.figure(figsize=(7.5,3.5))
    sns.kdeplot(data.data[data.target==0,0], shade=True, label='Malignant')
    sns.kdeplot(data.data[data.target==1,0], shade=True, label='Benign')
    plt.xlabel('Mean Radius')
    plt.ylabel('Density')
    plt.legend()
    plt.show()

    In this code, we have used the kdeplot() function twice to create two density plots of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the shade parameter to True to shade the area under the curve, and we have added labels to the plots using the label parameter. We have also added a legend to the plot using the legend() function.

    Output

    On executing this code, you will get the following plot as the output −

    Mean Radius

    The resulting density plot shows the probability density functions of mean radius values for both the malignant and benign samples. We can see that the probability density function for the malignant samples is shifted to the right, indicating a higher mean radius value.

  • Machine Learning – Histograms

    A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the y-axis represents the frequency or count of each value. The height of each bar represents the number of data points that fall within that value range.

    Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable.

    Python Implementation of Histograms

    Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For the example given below, we will use Matplotlib to implement histograms.

    We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples.

    Example

    Let’s start by importing the necessary libraries and loading the dataset −

    import matplotlib.pyplot as plt
    from sklearn.datasets import load_breast_cancer
    data = load_breast_cancer()

    Next, we will create a histogram of the mean radius feature of the dataset −

    plt.figure(figsize=(7.2,3.5))
    plt.hist(data.data[:,0], bins=20)
    plt.xlabel('Mean Radius')
    plt.ylabel('Frequency')
    plt.show()

    In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We have also added labels to the x and y axes using the xlabel() and ylabel() functions.

    Output

    The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14.

    mean radius

    Histogram with Multiple Data Sets

    We can also create a histogram with multiple data sets to compare their distributions. Let’s create histograms of the mean radius feature for both the malignant and benign samples −

    Example

    plt.figure(figsize=(7.2,3.5))
    plt.hist(data.data[data.target==0,0], bins=20, alpha=0.5, label='Malignant')
    plt.hist(data.data[data.target==1,0], bins=20, alpha=0.5, label='Benign')
    plt.xlabel('Mean Radius')
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

    In this code, we have used the hist() function twice to create two histograms of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the transparency of the bars to 0.5 using the alpha parameter so that they don’t overlap completely. We have also added a legend to the plot using the legend() function.

    Output

    On executing this code, you will get the following plot as the output −

    Mean Radius

    The resulting histogram shows the distribution of mean radius values for both the malignant and benign samples. We can see that the distributions are different, with the malignant samples having a higher frequency of higher mean radius values.

  • Data Visualization in Machine Learning

    Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate patterns, trends, and insights in the data. Data visualization involves creating graphical representations of the data, which can help to identify patterns and relationships that may not be apparent from the raw data.

    What is Data Visualization?

    Data visualization is a graphical representation of data and information. With the help of data visualization, we can see how the data looks like and what kind of correlation is held by the attributes of the data. It is the fastest way to see if the features correspond to the output.

    Importance of Data Visualization in Machine Learning

    The data visualization play a significant role in machine learning. We can use it in many ways in machine learning. Here are some of the ways data visualization is used in machine learning −

    • Exploring Data − Data visualization is an essential tool for exploring and understanding data. Visualization can help to identify patterns, correlations, and outliers and can also help to detect data quality issues such as missing values and inconsistencies.
    • Feature Selection − Data visualization can help to select relevant features for the ML model. By visualizing the data and its relationship with the target variable, you can identify features that are strongly correlated with the target variable and exclude irrelevant features that have little predictive power.
    • Model Evaluation − Data visualization can be used to evaluate the performance of the ML model. Visualization techniques such as ROC curves, precision-recall curves, and confusion matrices can help to understand the accuracy, precision, recall, and F1 score of the model.
    • Communicating Insights − Data visualization is an effective way to communicate insights and results to stakeholders who may not have a technical background. Visualizations such as scatter plots, line charts, and bar charts can help to convey complex information in an easily understandable format.

    Popular Python Libraries for Data Visualization

    Following are the most popular Python libraries for data visualization in Machine learning. These libraries provide a wide range of visualization techniques and customization options to suit different needs and preferences.

    1. Matplotlib

    Matplotlib is one of the most popular Python packages used for data visualization. It is a cross-platform library for making 2D plots from data in arrays. It provides an object-oriented API that helps in embedding plots in applications using Python GUI toolkits such as PyQt, WxPython, or Tkinter. It can be used in Python and IPython shells, Jupyter notebook and web application servers also.

    2. Seaborn

    Seaborn is an open source, BSD-licensed Python library providing high level API for visualizing the data using Python programming language.

    3. Plotly

    Plotly is a Montreal based technical computing company involved in development of data analytics and visualisation tools such as Dash and Chart Studio. It has also developed open source graphing Application Programming Interface (API) libraries for Python, R, MATLAB, Javascript and other computer programming languages.

    4. Bokeh

    Bokeh is a data visualization library for Python. Unlike Matplotlib and Seaborn, they are also Python packages for data visualization, Bokeh renders its plots using HTML and JavaScript. Hence, it proves to be extremely useful for developing web based dashboards.

    Types of Data Visualization

    Data visualization for machine learning data can be classified into two different categories as follows –

    • Univariate Plots
    • Multivariate Plots
    Data Visualization Techniques

    Let’s understand each of the above two type of data visualization plots in detail.

    Univariate Plots: Understanding Attributes Independently

    The simplest type of visualization is single-variable or univariate visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization −

    • Histograms
    • Density Plots
    • Box and Whisker Plots

    We will learn the above techniques in detail in their respective chapters. Let’s look at these techniques in brief.

    Histograms

    Histograms group the data in bins and is the fastest way to get an idea about the distribution of each attribute in the dataset. The following are some of the characteristics of histograms −

    • It provides us a count of the number of observations in each bin created for visualization.
    • From the shape of the bin, we can easily observe the distribution, i.e., whether it is Gaussian, skewed, or exponential.
    • Histograms also help us to see possible outliers.

    Example

    The code below is an example of a Python script creating the histogram. Here, we will be using hist() function on NumPy Array to generate histograms and matplotlib for plotting them.

    import matplotlib.pyplot as plt
    import numpy as np
    # Generate some random data
    data = np.random.randn(1000)# Create the histogram
    plt.hist(data, bins=30, color='skyblue', edgecolor='black')
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.title('Histogram Example')
    plt.show()
    Output
    ML Histograms Plot

    Because of random number generation, you may notice a slight difference between the outputs when you execute the above program.

    Density Plots

    Density Plot is another quick and easy technique for getting each attribute distribution. It is also like histogram but having a smooth curve drawn through the top of each bin. We can call them as abstracted histograms.

    Example

    In the following example, the Python script will generate Density Plots for the distribution of attributes of the iris dataset.

    import seaborn as sns
    import matplotlib.pyplot as plt
    # Load a sample dataset
    df = sns.load_dataset("iris")# Create the density plot
    sns.kdeplot(data=df, x="sepal_length", fill=True)# Add labels and title
    plt.xlabel("Sepal Length")
    plt.ylabel("Density")
    plt.title("Density Plot of Sepal Length")# Show the plot
    plt.show()
    Output
    Density Plot

    From the above output, the difference between Density plots and Histograms can be easily understood.

    Box and Whisker Plots

    Box and Whisker Plots, also called boxplots in short, is another useful technique to review the distribution of each attributes distribution. The following are the characteristics of this technique −

    • It is univariate in nature and summarizes the distribution of each attribute.
    • It draws a line for the middle value i.e. for median.
    • It draws a box around the 25% and 75%.
    • It also draws whiskers which will give us an idea about the spread of the data.
    • The dots outside the whiskers signifies the outlier values. Outlier values would be 1.5 times greater than the size of the spread of the middle data.

    Example

    In the following example, the Python script will generate a Box and Whisker Plot for the distribution of attributes of the Iris dataset.

    import matplotlib.pyplot as plt
    # Sample data
    data =[10,15,18,20,22,25,28,30,32,35]# Create a figure and axes
    fig, ax = plt.subplots()# Create the boxplot
    ax.boxplot(data)# Set the title
    ax.set_title('Box and Whisker Plot')# Show the plot
    plt.show()
    Output
    Box Plot

    Multivariate Plots: Interaction Among Multiple Variables

    Another type of visualization is multi-variable or multivariate visualization. With the help of multivariate visualization, we can understand the interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization −

    • Correlation Matrix Plot
    • Scatter Matrix Plot

    Correlation Matrix Plot

    Correlation is an indication of the changes between two variables. We can plot correlation matrix plot to show which variable is having a high or low correlation in respect to another variable.

    Example

    In the following example, the Python script will generate a correlation matrix plot. It can be generated with the help of corr() function on Pandas DataFrame and plotted with the help of Matplotlib pyplot.

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    # Sample data
    data ={'A':[1,2,3,4,5],'B':[5,4,3,2,1],'C':[2,3,1,4,5]}
    df = pd.DataFrame(data)# Calculate the correlation matrix
    c_matrix = df.corr()# Create a heatmap
    sns.heatmap(c_matrix, annot=True, cmap='coolwarm')
    plt.title("Correlation Matrix")
    plt.show()
    Output
    Correlation Matrix Plot

    From the above output of the correlation matrix, we can see that it is symmetrical i.e. the bottom left is same as the top right.

    Scatter Matrix Plot

    Scatter matrix plot shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points.

    Example

    In the following example, the Python script will generate and plot the Scatter matrix for the Iris dataset. It can be generated with the help of scatter_matrix() function on Pandas DataFrame and plotted with the help of pyplot.

    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn import datasets
    # Load the iris dataset
    iris = datasets.load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)# Create the scatter matrix plot
    pd.plotting.scatter_matrix(df, diagonal='hist', figsize=(8,7))
    plt.show()
    Output
    Scatter Matrix Plot

    In the next few chapters, we will look at some of the popular and widely used visualization techniques available in machine learning.

  • Supervised vs. Unsupervised Learning

    Supervised learning and Unsupervised learning are two popular approaches in Machine Learning. The simplest way to distinguish between supervised and unsupervised learning is the type of training dataset and the way the models are trained. However, there are other differences, which are further discussed in the chapter.

    What is Supervised Learning?

    Supervised Learning is a machine learning approach that uses labeled datasets to train the model, making it ideal for tasks like classifying data or predicting output. Supervised learning is categorized into two types −

    1. Classification

    Classification uses algorithms to predict categorical values, such as determining whether an email is spam or not or whether it is true or false. The algorithm learns to map each input to its corresponding output label. Some common algorithms include K-Nearest Neighbors, Random forests and Decision trees.

    2. Regression

    Regression is a statistical approach to analyze the relationship between data points. It can be used to forecast house prices based on features like location and size or estimate future sales. Some common algorithms include linear regression, polynomial regression, and logistic regression.

    What is Unsupervised Learning?

    Unsupervised Learning is a machine learning approach used to train models on raw and unlabeled data. This approach is often used to identify patterns in the data without human supervision. Unsupervised learning models are used to for the below tasks −

    1. Clustering

    This task uses unsupervised learning models to group data points into clusters based on their similarities. Popular algorithm used is the K-means clustering.

    2. Association

    This is another type of unsupervised learning that uses pre-defined rules to group data points into a cluster. It is commonly used in Market Basket Analysis, and the main algorithm behind this task is Apriori Algorithm.

    3. Dimensionality Reduction

    This method of unsupervised learning is used to reduce the size of a dataset by removing features that are not necessary without compromising the originality of the data.

    Differences between Supervised and Unsupervised Learning

    The table below shows some key differences between supervised and unsupervised machine learning −

    BasisSupervised LearningUnsupervised Learning
    DefinitionSupervised learning algorithms train data, where every input has a corresponding output.Unsupervised learning algorithms find patterns in data that has no predefined labels.
    GoalThe goal of supervised learning is to predict or classify based on input features.The goal of unsupervised learning is to discover hidden patterns, structures and relationships.
    Input DataLabeled: Input data with corresponding output labels.Unlabeled: Input data is raw and unlabeled.
    Human SupervisionSupervised learning algorithms needs human supervision to train the model.Unsupervised learning algorithms does not any kind of supervision to train the model..
    TasksRegression, ClassificationClustering, Association and Dimensionality Reduction
    Complexitysupervised machine learning methods are computationally simple.Unsupervised machine learning methods are computationally complex.
    AlgorithmsLinear regression, K-Nearest Neighbors, Decision Trees, Naive Bayes, SVMK- Means clustering, DBSCAN, Autoencoders
    AccuracySupervised machine learning methods are highly accurate.Unsupervised machine learning methods are less accurate.
    ApplicationsImage classification, Sentiment Analysis, Recommendation systemsCustomer Segmentation, Anomaly Detection, Recommendation Engines, NLP

    Supervised or Unsupervised Learning – Which to Choose?

    Choosing the right approach is crucial and will also determine the efficiency of the outcome. To decide on which learning approach is best, the following things should be considered −

    • Dataset − Evaluate the data, whether it is labeled or unlabeled. You will also need to assess whether you have the time, resources and expertise to support labeling.
    • Goals − It is also important to define the problem you are trying to solve and the solution you are trying to opt for. It might be classification, discovering new patterns or insights in the data or creating a predictive model.
    • Algorithm − Review the algorithm by making sure that it matches required dimensions, such as attributes and number of features. Also, evaluate if the algorithm can support the volume of the data.

    Semi-supervised Learning

    Semi-supervised learning is the safest medium if you are in a dilemma about choosing between supervised and unsupervised learning. This learning approach is a combination of both supervised and unsupervised learning, where a minor part of the dataset used is labeled and the major part is unlabeled. This is ideal when you have a high volume of data that makes it difficult to identify relevant features.

  • Reinforcement Learning

    What is Reinforcement Learning?

    Reinforcement learning is a machine learning approach where an agent (software entity) is trained to interpret the environment by performing actions and monitoring the results. For every good action, the agent gets positive feedback and for every bad action the agent gets negative feedback. It’s inspired by how animals learn from their experiences, making decisions based on the consequences of their actions.

    The following diagram shows a typical reinforcement learning model −

    Reinforcement Machine Learning

    In the above diagram, the agent is represented in a particular state. The agent takes action in an environment to achieve a particular task. As a result of the performed task, the agent receives feedback as a reward or punishment.

    How Does Reinforcement Learning Work?

    In reinforcement learning, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regarding the current state of the environment. The agent learns how to make decisions by receiving rewards or penalties based on its actions.

    The working of reinforcement learning can be understood by the approach of a master chess player.

    • Exploration − Just like how a chess play considers various possible move and their outcome, the agent also explores different actions to understand their effects and learns which action would lead to better result.
    • Exploitation − The chess player also uses intuition, based on past experiences to make decisions that seem right. Similarly, the agent uses knowledge gained from previous experiences to make best choices.

    Key Elements Reinforcement Learning

    Beyond the agent and the environment, one can identify four main sub elements of reinforcement learning system −

    • Policy − It defines the learning agent’s way of behaving at a given time. A policy is a mapping from perceived states of the environment to actions to be taken when in those states.
    • Reward Signal − It defines the goal of a reinforcement learning problem. It is a numerical score received to the agent by the environment. This reward signal defines what are the good and bad events for the agent.
    • Value function − It specifies what is good in the long run. The value is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
    • Model − Models are used for planning, which means deciding on a course of action by considering possible future situations before they are actually experienced.

    Markov Decision Processes(MDP) provide a mathematical framework for modeling decision-making in an environment with states, actions, rewards, probability. Reinforcement learning uses MDP to understand how an agent should act to maximize rewards and to find the best strategies for decision making.

    Markov Decision Processes (MDP)

    Reinforcement learning uses the mathematical framework of Markov decision processes(MDP) to define the interaction between learning agent and environment. Some important concepts and components of MDP are −

    • States(S) − Represents all the situations in which an agent can find itself.
    • Action(A) − The choices available for the agent from the gives states.
    • Transition Probabilities(P) − The likelihood of moving from one state to another as a result of a specific action.
    • Rewards(R) − Feedback received after transitioning to a new state due to an action, indication the outcome’s desirability.
    • Policy( ) − A strategy that defines the action to take in each state for achieving a reward.

    Steps in Reinforcement Learning Process

    Here are the major steps involved in reinforcement learning methods −

    • Step 1 − First, we need to prepare an agent with some initial set of strategies.
    • Step 2 − Then observe the environment and its current state.
    • Step 3 − Next, select the optimal policy regards the current state of the environment and perform important action.
    • Step 4 − Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step.
    • Step 5 − Now, we can update the strategies if it is required so.
    • Step 6 − At last, repeat steps 2-5 until the agent got to learn & adopt the optimal policies.

    Types of Reinforcement Learning

    There are two types of Reinforcement learning:

    • Positive Reinforcement − When an agent performs an action that is desirable or leads to a good out, it receives a rewards which increase the livelihood of that action being repeated.
    • Negative Reinforcement − When an agent performs an action to avoid a negative outcome, the negative stimulus is removed. For example, if a robot is programmed to avoid an obstacle and successfully navigates away from it, the threat associated with action is removed. And the robot more likely avoids that action in the future.

    Types of Reinforcement Learning Algorithms

    There are various algorithms used in reinforcement learning such as Q-learning, policy gradient methods, Monte Carlo method and many more. All these algorithms can be classified into two broad categories −

    • Model-free Reinforcement Learning − It is a category of reinforcement learning algorithms that learns to make decisions by interacting with the environment directly, without creating a model of the environment’s dynamics. The agent performs different actions multiple times to learn the outcomes and creates a strategy (policy) that optimizes its reward points. This is ideal for changing, large or complex environments.
    • Model-based Reinforcement Learning − This category of reinforcement learning algorithms involves creating a model of the environment’s dynamics to make decisions and improve performance. This model is ideal when the environment is static, and well-defined, where real-world environment testing is difficult.

    Advantages of Reinforcement Learning

    Some of the advantages of reinforcement learning are −

    • Reinforcement learning doesn’t require pre-defined instructions and human intervention.
    • Reinforcement learning model can adapt to wide range of environments including static and dynamic.
    • Reinforcement learning can be used to solve wide range of problems, including decision making, prediction and optimization.
    • Reinforcement learning model gets better as it gains experience and fine-tunes.

    Disadvantages of Reinforcement Learning

    Some of the disadvantages of reinforcement learning are −

    • Reinforcement learning depends on the quality of the reward function, if it is poorly designed, the model can never get better with its performance.
    • The designing and tuning of reinforcement learning can be complex and requires expertise.

    Applications of Reinforcement Learning

    Reinforcement learning has a wide range of applications across various fields. Some major applications are −

    1. Robotics

    Reinforcement learning is generally concerned with decision-making in unpredictable environments. This is the most used approach especially for complicated tasks, such as replicating human behavior, manipulation, navigation and locomotion. This approach also allows robots to adapt to new environments through trial and error.

    2. Natural Language Processing (NLP)

    In Natural Language Processing (NLP), Reinforcement learning is used to enhance the performance of chatbots by managing complex dialogues and improving user interactions. Additionally, this learning approach is also used to train models for tasks like summarizations.

    Reinforcement Learning Vs. Supervised learning

    Supervised learning and Reinforcement learning are two distinct approaches in machine learning. In supervised learning, a model is trained on a dataset that consists of both input and its corresponding outputs for predictive analysis. Whereas, in reinforcement learning an agent interacts with an environment, learning to make decisions by receiving feedback in the form of rewards or penalties, aiming to maximize cumulative rewards. Another difference between these two approaches is the tasks that they are ideal for. While supervised learning is used for tasks that are often with clear, structured output, reinforcement learning is used for complex decision making tasks with optimal strategies.