Author: admin

  • Random Forest Algorithm in Machine Learning

    Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to create a large number of decision trees, each of which is trained on a different subset of the data. The predictions of these individual trees are then combined to produce a final prediction.

    Working of Random Forest Algorithm

    We can understand the working of Random Forest algorithm with the help of following steps −

    • Step 1 − First, start with the selection of random samples from a given dataset.
    • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
    • Step 3 − In this step, voting will be performed for every predicted result.
    • Step 4 − At last, select the most voted prediction result as the final prediction result.

    The following diagram illustrates how the Random Forest Algorithm works −

    Random Forest Algorithm

    Random Forest is a flexible algorithm that can be used for both classification and regression tasks. In classification tasks, the algorithm uses the mode of the predictions of the individual trees to make the final prediction. In regression tasks, the algorithm uses the mean of the predictions of the individual trees.

    Advantages of Random Forest Algorithm

    Random Forest algorithm has several advantages over other machine learning algorithms. Some of the key advantages are −

    • Robustness to Overfitting − Random Forest algorithm is known for its robustness to overfitting. This is because the algorithm uses an ensemble of decision trees, which helps to reduce the impact of outliers and noise in the data.
    • High Accuracy − Random Forest algorithm is known for its high accuracy. This is because the algorithm combines the predictions of multiple decision trees, which helps to reduce the impact of individual decision trees that may be biased or inaccurate.
    • Handles Missing Data − Random Forest algorithm can handle missing data without the need for imputation. This is because the algorithm only considers the features that are available for each data point and does not require all features to be present for all data points.
    • Non-Linear Relationships − Random Forest algorithm can handle non-linear relationships between the features and the target variable. This is because the algorithm uses decision trees, which can model non-linear relationships.
    • Feature Importance − Random Forest algorithm can provide information about the importance of each feature in the model. This information can be used to identify the most important features in the data and can be used for feature selection and feature engineering.

    Implementation of Random Forest Algorithm in Python

    Let’s take a look at the implementation of Random Forest Algorithm in Python. We will be using the scikit-learn library to implement the algorithm. The scikit-learn library is a popular machine learning library that provides a wide range of algorithms and tools for machine learning.

    Step 1 − Importing the Libraries

    We will begin by importing the necessary libraries. We will be using the pandas library for data manipulation, and the scikit-learn library for implementing the Random Forest algorithm.

    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    

    Step 2 − Loading the Data

    Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the famous Iris dataset, which is a classic dataset for classification tasks.

    # Loading the iris dataset
    
    iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data', header=None)
    
    iris.columns =['sepal_length','sepal_width','petal_length','petal_width','species']

    Step 3 − Data Preprocessing

    Before we can use the data to train our model, we need to preprocess it. This involves separating the features and the target variable and splitting the data into training and testing sets.

    # Separating the features and target variable
    X = iris.iloc[:,:-1]
    y = iris.iloc[:,-1]# Splitting the data into training and testing setsfrom sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

    Step 4 − Training the Model

    Next, we will train our Random Forest classifier on the training data.

    # Creating the Random Forest classifier object
    rfc = RandomForestClassifier(n_estimators=100)# Training the model on the training data
    rfc.fit(X_train, y_train)

    Step 5 − Making Predictions

    Once we have trained our model, we can use it to make predictions on the test data.

    # Making predictions on the test data
    y_pred = rfc.predict(X_test)

    Step 6 − Evaluating the Model

    Finally, we will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.

    # Importing the metrics libraryfrom sklearn.metrics import accuracy_score, precision_score,
    recall_score, f1_score
    
    # Calculating the accuracy, precision, recall, and F1-score
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')print("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1-score:", f1)

    Complete Implementation Example

    Below is the complete implementation example of Random Forest Algorithm in python using the iris dataset −

    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    # Loading the iris dataset
    iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data', header=None)
    
    iris.columns =['sepal_length','sepal_width','petal_length','petal_width','species']# Separating the features and target variable
    X = iris.iloc[:,:-1]
    y = iris.iloc[:,-1]# Splitting the data into training and testing setsfrom sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.35, random_state=42)# Creating the Random Forest classifier object
    rfc = RandomForestClassifier(n_estimators=100)# Training the model on the training data
    rfc.fit(X_train, y_train)# Making predictions on the test data
    y_pred = rfc.predict(X_test)# Importing the metrics libraryfrom sklearn.metrics import accuracy_score, precision_score,
    recall_score, f1_score
    
    # Calculating the accuracy, precision, recall, and F1-score
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')print("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1-score:", f1)

    Output

    This will give us the performance metrics of our Random Forest classifier as follows −

    Accuracy: 0.9811320754716981
    Precision: 0.9821802935010483
    Recall: 0.9811320754716981
    F1-score: 0.9811157396063056
    

    Pros and Cons of Random Forest

    Pros

    The following are the advantages of Random Forest algorithm −

    • It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
    • Random forests work well for a large range of data items than a single decision tree does.
    • Random forest has less variance then single decision tree.
    • Random forests are very flexible and possess very high accuracy.
    • Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
    • Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

    Cons

    The following are the disadvantages of Random Forest algorithm −

    • Complexity is the main disadvantage of Random forest algorithms.
    • Construction of Random forests are much harder and time-consuming than decision trees.
    • More computational resources are required to implement Random Forest algorithm.
    • It is less intuitive in case when we have a large collection of decision trees .
    • The prediction process using random forests is very time-consuming in comparison with other algorithms.
  • Support Vector Machine (SVM) in Machine Learning

    What is Support Vector Machine (SVM)

    Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithm which is used for both classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990 also. SVMs have their unique way of implementation as compared to other machine learning algorithms. Now a days, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

    Working of SVM

    The goal of SVM is to find a hyperplane that separates the data points into different classes. A hyperplane is a line in 2D space, a plane in 3D space, or a higher-dimensional surface in n-dimensional space. The hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyperplane and the closest data points of each class. The closest data points are called the support vectors.

    The distance between the hyperplane and a data point “x” can be calculated using the formula −

    distance =(w . x + b)/||w||

    where “w” is the weight vector, “b” is the bias term, and “||w||” is the Euclidean norm of the weight vector. The weight vector “w” is perpendicular to the hyperplane and determines its orientation, while the bias term “b” determines its position.

    The optimal hyperplane is found by solving an optimization problem, which is to maximize the margin subject to the constraint that all data points are correctly classified. In other words, we want to find the hyperplane that maximizes the margin between the two classes while ensuring that no data point is misclassified. This is a convex optimization problem that can be solved using quadratic programming.

    If the data points are not linearly separable, we can use a technique called kernel trick to map the data points into a higher-dimensional space where they become separable. The kernel function computes the inner product between the mapped data points without computing the mapping itself. This allows us to work with the data points in the higherdimensional space without incurring the computational cost of mapping them.

    Let’s understand it in detail with the help of following diagram −

    Working Of Svm

    Given below are the important concepts in SVM −

    • Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.
    • Hyperplane − As we can see in the above diagram it is a decision plane or space which is divided between a set of objects having different classes.
    • Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin.

    Implementing SVM Using Python

    For implementing SVM in Python we will start with the standard libraries import as follows −

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy import stats
    import seaborn as sns; sns.set()

    Next, we are creating a sample dataset, having linearly separable data, from sklearn.dataset.sample_generator for classification using SVM −

    from sklearn.datasets import make_blobs
    X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)
    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer');

    The following would be the output after generating sample dataset having 100 samples and 2 clusters −

    SVM Plotting blobs of datapoints

    We know that SVM supports discriminative classification. it divides the classes from each other by simply finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on the above dataset as follows −

    xfit = np.linspace(-1,3.5)
    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
    plt.plot([0.6],[2.1],'x', color='black', markeredgewidth=4, markersize=12)for m, b in[(1,0.65),(0.5,1.6),(-0.2,2.9)]:
       plt.plot(xfit, m * xfit + b,'-k')
    plt.xlim(-1,3.5);

    The output is as follows −

    SVM plotting line/ hyperplane

    We can see from the above output that there are three different separators that perfectly discriminate the above samples.

    As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around each line a margin of some width up to the nearest point. It can be done as follows −

    xfit = np.linspace(-1,3.5)
    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')for m, b, d in[(1,0.65,0.33),(0.5,1.6,0.55),(-0.2,2.9,0.2)]:
      yfit = m * xfit + b
      plt.plot(xfit, yfit,'-k')
      plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
            color='#AAAAAA', alpha=0.4)
      plt.xlim(-1,3.5);
    Plotting Maximum Marginal Hyperplane

    From the above image in output, we can easily observe the “margins” within the discriminative classifiers. SVM will choose the line that maximizes the margin.

    Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are using linear kernel to fit SVM as follows −

    from sklearn.svm import SVC # "Support vector classifier"
    model = SVC(kernel='linear', C=1E10)
    model.fit(X, y)

    The output is as follows −

    SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

    Now, for a better understanding, the following will plot the decision functions for 2D SVC −

    defdecision_function(model, ax=None, plot_support=True):if ax isNone:
          ax = plt.gca()
       xlim = ax.get_xlim()
       ylim = ax.get_ylim()

    For evaluating model, we need to create grid as follows −

    x = np.linspace(xlim[0], xlim[1],30)
    y = np.linspace(ylim[0], ylim[1],30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)

    Next, we need to plot decision boundaries and margins as follows −

    ax.contour(X, Y, P, colors='k',
       levels=[-1,0,1], alpha=0.5,
       linestyles=['--','-','--'])

    Now, similarly plot the support vectors as follows −

    if plot_support:
       ax.scatter(model.support_vectors_[:,0],
          model.support_vectors_[:,1],
          s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

    Now, use this function to fit our models as follows −

    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
    decision_function(model);
    SVM Best Fit Hyperplane

    We can observe from the above output that an SVM classifier fit to the data with margins i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line. These support vector points are stored in the support_vectors_ attribute of the classifier as follows −

    model.support_vectors_
    

    The output is as follows −

    array([[0.5323772 , 3.31338909],
       [2.11114739, 3.57660449],
       [1.46870582, 1.86947425]])
    

    SVM Kernels

    In practice, SVM algorithm is implemented with kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non-separable problems into separable problems by adding more dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used by SVM −

    Linear Kernel

    It can be used as a dot product between any two observations. The formula of linear kernel is as below −

    k(x,xi) = sum(x*xi)

    From the above formula, we can see that the product between two vectors say & is the sum of the multiplication of each pair of input values.

    Polynomial Kernel

    It is more generalized form of linear kernel and distinguish curved or nonlinear input space. Following is the formula for polynomial kernel −

    K(x, xi) = 1 + sum(x * xi)^d

    Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.

    Radial Basis Function (RBF) Kernel

    RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. Following formula explains it mathematically −

    K(x,xi) = exp(-gamma * sum((x xi^2))

    Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good default value of gamma is 0.1.

    As we implemented SVM for linearly separable data, we can implement it in Python for the data that is not linearly separable. It can be done by using kernels.

    Example

    The following is an example for creating an SVM classifier by using kernels. We will be using iris dataset from scikit-learn −

    We will start by importing following packages −

    import pandas as pd
    import numpy as np
    from sklearn import svm, datasets
    import matplotlib.pyplot as plt
    

    Now, we need to load the input data −

    iris = datasets.load_iris()

    From this dataset, we are taking first two features as follows −

    X = iris.data[:,:2]
    y = iris.target
    

    Next, we will plot the SVM boundaries with original data as follows −

    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    h =(x_max / x_min)/100
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
       np.arange(y_min, y_max, h))
    X_plot = np.c_[xx.ravel(), yy.ravel()]

    Now, we need to provide the value of regularization parameter as follows −

    C =1.0

    Next, SVM classifier object can be created as follows −

    Svc_classifier = svm.SVC(kernel=’linear’, C=C).fit(X, y)

    Z = svc_classifier.predict(X_plot)
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(15,5))
    plt.subplot(121)
    plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Set1)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.title('Support Vector Classifier with linear kernel')

    Output

    Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
    
    Curve

    For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −

    Svc_classifier = svm.SVC(kernel='rbf', gamma ='auto',C=C).fit(X, y)
    Z = svc_classifier.predict(X_plot)
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(15,5))
    plt.subplot(121)
    plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
    plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Set1)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.title('Support Vector Classifier with rbf kernel')

    Output

    Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')
    
    Classifier

    We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.

    Tuning SVM Parameters

    In practice, SVMs often require tuning of their parameters to achieve optimal performance. The most important parameters to tune are the kernel, the regularization parameter C, and the kernel-specific parameters.

    The kernel parameter determines the type of kernel to use. The most common kernel types are linear, polynomial, radial basis function (RBF), and sigmoid. The linear kernel is used for linearly separable data, while the other kernels are used for non-linearly separable data.

    The regularization parameter C controls the trade-off between maximizing the margin and minimizing the classification error. A higher value of C means that the classifier will try to minimize the classification error at the expense of a smaller margin, while a lower value of C means that the classifier will try to maximize the margin even if it means more misclassifications.

    The kernel-specific parameters depend on the type of kernel being used. For example, the polynomial kernel has parameters for the degree of the polynomial and the coefficient of the polynomial, while the RBF kernel has a parameter for the width of the Gaussian function.

    We can use cross-validation to tune the parameters of the SVM. Cross-validation involves splitting the data into several subsets and training the classifier on each subset while using the remaining subsets for testing. This allows us to evaluate the performance of the classifier on different subsets of the data and choose the best set of parameters.

    Example

    from sklearn.model_selection import GridSearchCV
    # define the parameter grid
    param_grid ={'C':[0.1,1,10,100],'kernel':['linear','poly','rbf','sigmoid'],'degree':[2,3,4],'coef0':[0.0,0.1,0.5],'gamma':['scale','auto']}# create an SVM classifier
    svm = SVC()# perform grid search to find the best set of parameters
    grid_search = GridSearchCV(svm, param_grid, cv=5)
    grid_search.fit(X_train, y_train)# print the best set of parameters and their accuracyprint("Best parameters:", grid_search.best_params_)print("Best accuracy:", grid_search.best_score_)

    We start by importing the GridSearchCV module from scikit-learn, which is a tool for performing grid search on a set of parameters. We define a parameter grid that contains the possible values for each parameter we want to tune.

    We create an SVM classifier using SVC() and then pass it to GridSearchCV along with the parameter grid and the number of cross-validation folds (cv=5). We then call grid_search.fit(X_train, y_train) to perform the grid search.

    Once the grid search is complete, we print the best set of parameters and their accuracy using grid_search.best_params_ and grid_search.best_score_, respectively.

    Output

    On executing this program, you will get the following output −

    Best parameters: {'C': 0.1, 'coef0': 0.5, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
    Best accuracy: 0.975
    

    This means that the best set of parameters found by the grid search are: C=0.1, coef0=0.5, degree=3, gamma=scale, and kernel=poly. The accuracy achieved by this set of parameters on the training set is 97.5%.

    You can now use these parameters to create a new SVM classifier and test its performance on the testing set.

    Pros and Cons of SVM Classifiers

    Pros of SVM classifiers

    SVM classifiers offers great accuracy and work well with high dimensional space. SVM classifiers basically use a subset of training points hence in result uses very less memory.

    Cons of SVM classifiers

    They have high training time hence in practice not suitable for large datasets. Another disadvantage is that SVM classifiers do not work well with overlapping classes.

  • Decision Trees Algorithm in Machine Learning

    Decision Tree Algorithm

    The decision tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data.

    The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node.

    The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below −

    Decision Tree Algorithm

    In the above decision tree, the question are decision nodes and final outcomes are leaves.

    Types of Decision Tree Algorithm

    There are two main types of Decision Tree algorithm −

    • Classification Tree − A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class.
    • Regression Tree − A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value.

    Implementing Decision Tree Algorithm

    Gini Index

    It is the name of the cost function that is used to evaluate the binary splits in the dataset and works with the categorial target variable Success or Failure.

    Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of following steps −

    • First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum of the square of probability for success and failure.
    • Next, calculate Gini index for split using weighted Gini score of each node of that split.

    Classification and Regression Tree (CART) algorithm uses Gini method to generate binary splits.

    Split Creation

    A split is basically including an attribute in the dataset and a value. We can create a split in dataset with the help of following three parts −

    • Part1: Calculating Gini Score − We have just discussed this part in the previous section.
    • Part2: Splitting a dataset − It may be defined as separating a dataset into two lists of rows having index of an attribute and a split value of that attribute. After getting the two groups – right and left, from the dataset, we can calculate the value of split by using Gini score calculated in first part. Split value will decide in which group the attribute will reside.
    • Part3: Evaluating all splits − Next part after finding Gini score and splitting dataset is the evaluation of all splits. For this purpose, first, we must check every value associated with each attribute as a candidate split. Then we need to find the best possible split by evaluating the cost of the split. The best split will be used as a node in the decision tree.

    Building a Tree

    As we know that a tree has root node and terminal nodes. After creating the root node, we can build the tree by following two parts −

    Part1: Terminal node creation

    While creating terminal nodes of decision tree, one important point is to decide when to stop growing tree or creating further terminal nodes. It can be done by using two criteria namely maximum tree depth and minimum node records as follows −

    • Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a tree after root node. We must stop adding terminal nodes once a tree reached at maximum depth i.e. once a tree got maximum number of terminal nodes.
    • Minimum Node Records − It may be defined as the minimum number of training patterns that a given node is responsible for. We must stop adding terminal nodes once tree reached at these minimum node records or below this minimum.

    Terminal node is used to make a final prediction.

    Part2: Recursive Splitting

    As we understood about when to create terminal nodes, now we can start building our tree. Recursive splitting is a method to build the tree. In this method, once a node is created, we can create the child nodes (nodes added to an existing node) recursively on each group of data, generated by splitting the dataset, by calling the same function again and again.

    Prediction

    After building a decision tree, we need to make a prediction about it. Basically, prediction involves navigating the decision tree with the specifically provided row of data.

    We can make a prediction with the help of recursive function, as did above. The same prediction routine is called again with the left or the child right nodes.

    Assumptions

    The following are some of the assumptions we make while creating decision tree −

    • While preparing decision trees, the training set is as root node.
    • Decision tree classifier prefers the features values to be categorical. In case if you want to use continuous values then they must be done discretized prior to model building.
    • Based on the attributes values, the records are recursively distributed.
    • Statistical approach will be used to place attributes at any node position i.e.as root node or internal node.

    Implementation in Python

    Let’s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.

    First, we will import the necessary libraries and load the dataset −

    import numpy as np
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    
    # Load the iris dataset
    iris = load_iris()# Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(iris.data,
    iris.target, test_size=0.3, random_state=0)

    We then create an instance of the Decision Tree classifier and train it on the training set −

    # Create a Decision Tree classifier
    dtc = DecisionTreeClassifier()# Fit the classifier to the training data
    dtc.fit(X_train, y_train)

    We can now use the trained classifier to make predictions on the testing set −

    # Make predictions on the testing data
    y_pred = dtc.predict(X_test)

    We can evaluate the performance of the classifier by calculating its accuracy −

    # Calculate the accuracy of the classifier
    accuracy = np.sum(y_pred == y_test)/len(y_test)print("Accuracy:", accuracy)

    We can visualize the Decision Tree using Matplotlib library −

    import matplotlib.pyplot as plt
    from sklearn.tree import plot_tree
    
    # Visualize the Decision Tree using Matplotlib
    plt.figure(figsize=(20,10))
    plot_tree(dtc, filled=True, feature_names=iris.feature_names,
    class_names=iris.target_names)
    plt.show()

    The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot.

    Complete Implementation Example

    Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset −

    import numpy as np
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    
    # Load the iris dataset
    iris = load_iris()# Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)# Create a Decision Tree classifier
    dtc = DecisionTreeClassifier()# Fit the classifier to the training data
    dtc.fit(X_train, y_train)# Make predictions on the testing data
    y_pred = dtc.predict(X_test)# Calculate the accuracy of the classifier
    accuracy = np.sum(y_pred == y_test)/len(y_test)print("Accuracy:", accuracy)# Visualize the Decision Tree using Matplotlibimport matplotlib.pyplot as plt
    from sklearn.tree import plot_tree
    plt.figure(figsize=(20,10))
    plot_tree(dtc, filled=True, feature_names=iris.feature_names,
    class_names=iris.target_names)
    plt.show()

    Output

    This will create a plot of the Decision Tree that looks like this −

    Plot Of Decision Tree
    Accuracy: 0.9777777777777777
    

    As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node.

  • Nave Bayes Algorithm in Machine Learning

    What is Nave Bayes Algorithm?

    The Naive Bayes algorithm is a classification algorithm based on Bayes’ theorem. The algorithm assumes that the features are independent of each other, which is why it is called “naive.” It calculates the probability of a sample belonging to a particular class based on the probabilities of its features. For example, a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all these features are dependent on each other, but all these features independently contribute to the probability of that the phone is a smart phone.

    In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, P(L | features). With the help of Bayes theorem, we can express this in quantitative form as follows −

    P(L|features)=P(L)P(features|L)P(features)

    Here,

    • P(L|features) is the posterior probability of class.
    • P(L) is the prior probability of class.
    • P(features|L) is the likelihood which is the probability of predictor given class.
    • P(features) is the prior probability of predictor.

    In the Naive Bayes algorithm, we use Bayes’ theorem to calculate the probability of a sample belonging to a particular class. We calculate the probability of each feature of the sample given the class and multiply them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat this process for each class and choose the class with the highest probability as the class of the sample.

    Types of Naive Bayes Algorithm

    There are many types of Naive Bayes Algorithm. Here we discuss the following three types −

    Gaussian Nave Bayes

    Gaussian Nave Bayes is the simplest Nave Bayes classifier having the assumption that the data from each label is drawn from a simple Gaussian distribution. It is used when the features are continuous variables that follow a normal distribution.

    Multinomial Nave Bayes

    Another useful Nave Bayes classifier is Multinomial Nave Bayes in which the features are assumed to be drawn from a simple Multinomial distribution. Such kind of Nave Bayes are most appropriate for the features that represents discrete counts. It is commonly used in text classification tasks where the features are the frequency of words in a document.

    Bernoulli Nave Bayes

    Another important model is Bernoulli Nave Bayes in which features are assumed to be binary (0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli Nave Bayes.

    Implementation of Nave Bayes Algorithm in Python

    Depending on our data set, we can choose any of the Nave Bayes model explained above. Here, we are implementing Gaussian Nave Bayes model in Python −

    We will start with required imports as follows −

    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns; sns.set()

    Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with Gaussian distribution as follows −

    from sklearn.datasets import make_blobs
    X, y = make_blobs(300,2, centers=2, random_state=2, cluster_std=1.5)
    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer');
    Blobs of Points with Gaussian Distribution

    Next, for using GaussianNB model, we need to import and make its object as follows −

    from sklearn.naive_bayes import GaussianNB
    model_GNB = GaussianNB()
    model_GNB.fit(X, y);

    Now, we have to do prediction. It can be done after generating some new data as follows −

    rng = np.random.RandomState(0)
    Xnew =[-6,-14]+[14,18]* rng.rand(2000,2)
    ynew = model_GNB.predict(Xnew)

    Next, we are plotting new data to find its boundaries −

    plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
    lim = plt.axis()
    plt.scatter(Xnew[:,0], Xnew[:,1], c=ynew, s=20, cmap='summer', alpha=0.1)
    plt.axis(lim);
    Plotting the prediction with new data

    Now, with the help of following line of codes, we can find the posterior probabilities of first and second label −

    yprob = model_GNB.predict_proba(Xnew)
    yprob[-10:].round(3)

    Output

    array([[0.998, 0.002],
       [1.   , 0.   ],
       [0.987, 0.013],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [0.   , 1.   ],
       [0.986, 0.014]]
    )
    

    Pros & Cons of Nave Bayes classification

    Let’s discuss some of the advantages and limitations of Naive Bayes classification algorithm.

    Pros

    The followings are some pros of using Nave Bayes classifiers −

    • Nave Bayes classification is easy to implement and fast.
    • It will converge faster than discriminative models like logistic regression.
    • It requires less training data.
    • It is highly scalable in nature, or they scale linearly with the number of predictors and data points.
    • It can make probabilistic predictions and can handle continuous as well as discrete data.
    • Nave Bayes classification algorithm can be used for binary as well as multi-class classification problems both.

    Cons

    The followings are some cons of using Nave Bayes classifiers −

    • One of the most important cons of Nave Bayes classification is its strong feature independence because in real life it is almost impossible to have a set of features which are completely independent of each other.
    • Another issue with Nave Bayes classification is its ‘zero frequency’ which means that if a categorial variable has a category but not being observed in training data set, then Nave Bayes model will assign a zero probability to it and it will be unable to make a prediction.

    Applications of Nave Bayes classification

    The following are some common applications of Nave Bayes classification −

    Real-time prediction − Due to its ease of implementation and fast computation, it can be used to do prediction in real-time.

    Multi-class prediction − Nave Bayes classification algorithm can be used to predict posterior probability of multiple classes of target variable.

    Text classification − Due to the feature of multi-class prediction, Nave Bayes classification algorithms are well suited for text classification. That is why it is also used to solve problems like spam-filtering and sentiment analysis.

    Recommendation system − Along with the algorithms like collaborative filtering, Nave Bayes makes a Recommendation system which can be used to filter unseen information and to predict weather a user would like the given resource or not.

  • K-Nearest Neighbors (KNN) in Machine Learning

    K-Nearest Neighbors (KNN) Algorithm

    K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The main idea behind KNN is to find the k-nearest data points to a given test data point and use these nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it represents the number of neighbors to consider.

    For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors is the predicted class.

    For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors’ values.

    The distance metric used to measure the similarity between two data points is an essential factor that affects the KNN algorithm’s performance. The most commonly used distance metrics are Euclidean distance, Manhattan distance, and Minkowski distance.

    The following two properties would define KNN well −

    • Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.
    • Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

    How Does K-Nearest Neighbors Algorithm Work?

    K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps −

    • Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.
    • Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
    • Step 3 − For each point in the test data do the following −3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.3.2 − Now, based on the distance value, sort them in ascending order.3.3 − Next, it will choose the top K rows from the sorted array.3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
    • Step 4 − End

    Example

    The following is an example to understand the concept of K and working of KNN algorithm −

    Suppose we have a dataset which can be plotted as follows −

    Violate

    Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −

    Circle

    We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class.

    Building a K Nearest Neighbors Model

    We can follow the below steps to build a KNN model −

    • Load the data − The first step is to load the dataset into memory. This can be done using various libraries such as pandas or numpy.
    • Split the data − The next step is to split the data into training and test sets. The training set is used to train the KNN algorithm, while the test set is used to evaluate its performance.
    • Normalize the data − Before training the KNN algorithm, it is essential to normalize the data to ensure that each feature contributes equally to the distance metric calculation.
    • Calculate distances − Once the data is normalized, the KNN algorithm calculates the distances between the test data point and each data point in the training set.
    • Select k-nearest neighbors − The KNN algorithm selects the k-nearest neighbors based on the distances calculated in the previous step.
    • Make a prediction − For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors’ values.
    • Evaluate performance − Finally, the KNN algorithm’s performance is evaluated using various metrics such as accuracy, precision, recall, and F1-score.

    Implementation of KNN Algorithm in Python

    As we know K-nearest neighbors (KNN) algorithm can be used for both classification as well as regression. The following are the recipes in Python to use KNN as classifier as well as regressor −

    KNN as Classifier

    First, start with importing necessary python packages −

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    

    Next, download the iris dataset from its weblink as follows −

    path ="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

    Next, we need to assign column names to the dataset as follows −

    headernames =['sepal-length','sepal-width','petal-length','petal-width','Class']

    Now, we need to read dataset to pandas dataframe as follows −

    dataset = pd.read_csv(path, names=headernames)
    dataset.head()
    slno.sepal-lengthsepal-widthpetal-lengthpetal-widthClass
    05.13.51.40.2Iris-setosa
    14.93.01.40.2Iris-setosa
    24.73.21.30.2Iris-setosa
    34.63.11.50.2Iris-setosa
    45.03.61.40.2Iris-setosa

    Data Preprocessing will be done with the help of following script lines −

    X = dataset.iloc[:,:-1].values
    y = dataset.iloc[:,4].values
    

    Next, we will divide the data into train and test split. Following code will split the dataset into 60% training data and 40% of testing data −

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)

    Next, data scaling will be done as follows −

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −

    from sklearn.neighbors import KNeighborsClassifier
    classifier = KNeighborsClassifier(n_neighbors=8)
    classifier.fit(X_train, y_train)

    At last we need to make prediction. It can be done with the help of following script −

    y_pred = classifier.predict(X_test)

    Next, print the results as follows −

    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    result = confusion_matrix(y_test, y_pred)print("Confusion Matrix:")print(result)
    result1 = classification_report(y_test, y_pred)print("Classification Report:",)print(result1)
    result2 = accuracy_score(y_test,y_pred)print("Accuracy:",result2)

    Output

    Confusion Matrix:
    [[21 0 0]
    [ 0 16 0]
    [ 0 7 16]]
    Classification Report:
                precision      recall       f1-score       support
    Iris-setosa       1.00        1.00         1.00          21
    Iris-versicolor   0.70        1.00         0.82          16
    Iris-virginica    1.00        0.70         0.82          23
    micro avg         0.88        0.88         0.88          60
    macro avg         0.90        0.90         0.88          60
    weighted avg      0.92        0.88         0.88          60
    
    
    Accuracy: 0.8833333333333333
    

    KNN as Regressor

    First, start with importing necessary Python packages −

    import numpy as np
    import pandas as pd
    

    Next, download the iris dataset from its weblink as follows −

    path ="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

    Next, we need to assign column names to the dataset as follows −

    headernames =['sepal-length','sepal-width','petal-length','petal-width','Class']

    Now, we need to read dataset to pandas dataframe as follows −

    data = pd.read_csv(url, names=headernames)
    array = data.values
    X = array[:,:2]
    Y = array[:,2]
    data.shape
    
    output:(150,5)

    Next, import KNeighborsRegressor from sklearn to fit the model −

    from sklearn.neighbors import KNeighborsRegressor
    knnr = KNeighborsRegressor(n_neighbors=10)
    knnr.fit(X, y)

    At last, we can find the MSE as follows −

    print("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))

    Output

    The MSE is: 0.12226666666666669
    

    Pros and Cons of KNN

    Pros

    • It is very simple algorithm to understand and interpret.
    • It is very useful for nonlinear data because there is no assumption about data in this algorithm.
    • It is a versatile algorithm as we can use it for classification as well as regression.
    • It has relatively high accuracy but there are much better supervised learning models than KNN.

    Cons

    • It is computationally a bit expensive algorithm because it stores all the training data.
    • High memory storage required as compared to other supervised learning algorithms.
    • Prediction is slow in case of big N.
    • It is very sensitive to the scale of data as well as irrelevant features.

    Applications of KNN

    The following are some of the areas in which KNN can be applied successfully −

    Banking System

    KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have the characteristics similar to the defaulters one?

    Calculating Credit Ratings

    KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.

    Politics

    With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

    Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition and Video Recognition.

  • Logistic Regression in Machine Learning

    Introduction to Logistic Regression

    Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.

    In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

    Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

    Types of Logistic Regression

    Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −

    Binary or Binomial

    In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

    Multinomial

    In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

    Ordinal

    In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

    Logistic Regression Assumptions

    Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same −

    • In case of binary logistic regression, the target variables must be binary always and the desired outcome is represented by the factor level 1.
    • There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other .
    • We must include meaningful variables in our model.
    • We should choose a large sample size for logistic regression.

    Binary Logistic Regression Model

    The simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible types either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. In case of logistic regression, the linear function is basically used as an input to another function such as in the following relation −

    hθ(x)=g(θTx)0hθ1

    Here, is the logistic or sigmoid function which can be given as follows −

    g(z)=11+e−z=θT

    To sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.sigmoid curve

    The classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is 0.5, otherwise negative.

    We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by theta as follows −

    =()

    J(θ)=1m.(−yTlog(h)−(1−y)Tlog(1−h))

    Now, after defining the loss function our prime goal is to minimize the loss function. It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function w.r.t each weight, we would be able to know what parameters should have high weight and what should have smaller weight.

    The following gradient descent equation tells us how loss would change if we modified the parameters −

    ()θj=1mXT(())

    Implementation of Binary Logistic Regression Model in Python

    Now we will implement the above concept of binomial logistic regression in Python. For this purpose, we are using a multivariate flower dataset named iris which have 3 classes of 50 instances each, but we will be using the first two feature columns. Every class represents a type of iris flower.

    First, we need to import the necessary libraries as follows −

    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn import datasets
    

    Next, load the iris dataset as follows −

    iris = datasets.load_iris()
    X = iris.data[:,:2]
    y =(iris.target !=0)*1

    We can plot our training data s follows −

    plt.figure(figsize=(6,6))
    plt.scatter(X[y ==0][:,0], X[y ==0][:,1], color='g', label='0')
    plt.scatter(X[y ==1][:,0], X[y ==1][:,1], color='y', label='1')
    plt.legend();

    Iris Training Data

    Next, we will define sigmoid function, loss function and gradient descend as follows −

    classLogisticRegression:def__init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
          self.lr = lr
          self.num_iter = num_iter
          self.fit_intercept = fit_intercept
          self.verbose = verbose
       def__add_intercept(self, X):
          intercept = np.ones((X.shape[0],1))return np.concatenate((intercept, X), axis=1)def__sigmoid(self, z):return1/(1+ np.exp(-z))def__loss(self, h, y):return(-y * np.log(h)-(1- y)* np.log(1- h)).mean()deffit(self, X, y):if self.fit_intercept:
             X = self.__add_intercept(X)

    Now, initialize the weights as follows −

    self.theta = np.zeros(X.shape[1])for i inrange(self.num_iter):
          z = np.dot(X, self.theta)
          h = self.__sigmoid(z)
          gradient = np.dot(X.T,(h - y))/ y.size
          self.theta -= self.lr * gradient
          z = np.dot(X, self.theta)
          h = self.__sigmoid(z)
          loss = self.__loss(h, y)if(self.verbose ==Trueand i %10000==0):print(f'loss: {loss} \t')

    With the help of the following script, we can predict the output probabilities −

    defpredict_prob(self, X):if self.fit_intercept:
          X = self.__add_intercept(X)return self.__sigmoid(np.dot(X, self.theta))defpredict(self, X):return self.predict_prob(X).round()

    Next, we can evaluate the model and plot it as follows −

    model = LogisticRegression(lr=0.1, num_iter=300000)
    preds = model.predict(X)(preds == y).mean()
    
    plt.figure(figsize=(10,6))
    plt.scatter(X[y ==0][:,0], X[y ==0][:,1], color='g', label='0')
    plt.scatter(X[y ==1][:,0], X[y ==1][:,1], color='y', label='1')
    plt.legend()
    x1_min, x1_max = X[:,0].min(), X[:,0].max(),
    x2_min, x2_max = X[:,1].min(), X[:,1].max(),
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
    grid = np.c_[xx1.ravel(), xx2.ravel()]
    probs = model.predict_prob(grid).reshape(xx1.shape)
    plt.contour(xx1, xx2, probs,[0.5], linewidths=1, colors='red');

    Model Evaluation

    Multinomial Logistic Regression Model

    Another useful form of logistic regression is multinomial logistic regression in which the target or dependent variable can have 3 or more possible unordered types i.e. the types having no quantitative significance.

    Implementation of Multinomial Logistic Regression Model in Python

    Now we will implement the above concept of multinomial logistic regression in Python. For this purpose, we are using a dataset from sklearn named digit.

    First, we need to import the necessary libraries as follows −

    Import sklearn
    from sklearn import datasets
    from sklearn import linear_model
    from sklearn import metrics
    from sklearn.model_selection import train_test_split
    

    Next, we need to load digit dataset −

    digits = datasets.load_digits()

    Now, define the feature matrix(X) and response vector(y)as follows −

    X = digits.data
    y = digits.target
    

    With the help of next line of code, we can split X and y into training and testing sets −

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

    Now create an object of logistic regression as follows −

    digreg = linear_model.LogisticRegression()

    Now, we need to train the model by using the training sets as follows −

    digreg.fit(X_train, y_train)

    Next, make the predictions on testing set as follows −

    y_pred = digreg.predict(X_test)

    Next print the accuracy of the model as follows −

    print("Accuracy of Logistic Regression model is:",
    metrics.accuracy_score(y_test, y_pred)*100)

    Output

    Accuracy of Logistic Regression model is: 95.6884561891516
    

    From the above output we can see the accuracy of our model is around 96 percent.

  • Classification Algorithms in Machine Learning

    Classification in Machine Learning

    Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.

    Classification in machine learning is a supervised learning technique where an algorithm is trained with labeled data to predict the category of new data.

    Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.

    An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.

    To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.

    Types of Learners in Classification

    We have two types of learners in respective to classification problems −

    • Lazy Learners − As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
    • Eager Learners − As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Nave Bayes and Artificial Neural Networks (ANN).

    Classification Algorithms in Machine Learning

    The classification algorithm is a type of supervised learning technique that involves predicting a categorical target variable based on a set of input features. It is commonly used to solve problems such as spam detection, fraud detection, image recognition, sentiment analysis, and many others.

    The goal of a classification model is to learn a mapping function (f) between the input features (X) and the target variable (Y). This mapping function is often represented as a decision boundary, which separates different classes in the input feature space. Once the model is trained, it can be used to predict the class of new, unseen examples.

    The followings are some important ML classification algorithms −

    • Logistic Regression
    • K-Nearest Neighbors (KNN)
    • Support Vector Machine (SVM)
    • Decision Tree
    • Nave Bayes
    • Random Forest

    We will be discussing all these classification algorithms in detail in further chapters. However let’s discuss these algorithms in brief as follows −

    Logistic Regression

    Logistic Regression is a popular algorithm used for binary classification problems, where the target variable is categorical with two classes. It models the probability of the target variable given the input features and predicts the class with the highest probability.

    Logistic regression is a type of generalized linear model, where the target variable follows a Bernoulli distribution. The model consists of a linear function of the input features, which is transformed using the logistic function to produce a probability value between 0 and 1.

    K-Nearest Neighbors (KNN)

    K-Nearest Neighbors (KNN) is a supervised learning algorithm that can be used for both classification and regression problems. The main idea behind KNN is to find the k-nearest data points to a given test data point and use these nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it represents the number of neighbors to consider.

    For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors is the predicted class.

    For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors’ values.

    Support Vector Machine (SVM)

    Support Vector Machines (SVMs) are powerful yet flexible supervised machine learning algorithm which is used for both classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990 also. SVMs have their unique way of implementation as compared to other machine learning algorithms. Now a days, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

    Decision Tree

    The Decision Tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data.

    Nave Bayes

    The Nave Bayes algorithm is a classification algorithm based on Bayes’ theorem. The algorithm assumes that the features are independent of each other, which is why it is called “naive.” It calculates the probability of a sample belonging to a particular class based on the probabilities of its features. For example, a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all these features are dependent on each other, but all these features independently contribute to the probability of that the phone is a smart phone.

    Random Forest

    Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to create a large number of decision trees, each of which is trained on a different subset of the data. The predictions of these individual trees are then combined to produce a final prediction.

    Applications of Classification in Machine Learning

    Some of the most important applications of classification algorithms are as follows −

    • Speech Recognition
    • Handwriting Recognition
    • Biometric Identification
    • Document Classification
    • Image Classification
    • Spam Filtering
    • Fraud Detection
    • Facial Recognition

    Building a Classication Model in Machine Learning

    Let us now take a look at the steps involved in building a classification model −

    1. Data Preparation

    The first step is to collect and preprocess the data. This involves cleaning the data, handling missing values, and converting categorical variables to numerical values.

    2. Feature Extraction/Selection

    The next step is to extract or select relevant features from the data. This is an important step because the quality of the features can greatly impact the performance of the model. Some common feature selection techniques include correlation analysis, feature importance ranking, and principal component analysis.

    3. Model Selection

    Once the features are selected, the next step is to choose an appropriate classification algorithm. There are many different algorithms to choose from, each with its own strengths and weaknesses. Some popular algorithms include logistic regression, decision trees, random forests, support vector machines, and neural networks

    4. Model Training

    After selecting a suitable algorithm, the next step is to train the model on the labeled training data. During training, the model learns the mapping function between the input features and the target variable. The model parameters are adjusted iteratively to minimize the difference between the predicted outputs and the actual outputs.

    5. Model Evaluation

    Once the model is trained, the next step is to evaluate its performance on a separate set of validation data. This is done to estimate the model’s accuracy and generalization performance. Common evaluation metrics include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve.

    5. Hyperparameter Tuning

    In many cases, the performance of the model can be further improved by tuning its hyperparameters. Hyperparameters are settings that are chosen before training the model and control aspects such as the learning rate, regularization strength, and the number of hidden layers in a neural network. Grid search, random search, and Bayesian optimization are some common techniques used for hyperparameter tuning.

    6. Model Deployment

    Once the model has been trained and evaluated, the final step is to deploy it in a production environment. This involves integrating the model into a larger system, testing it on realworld data, and monitoring its performance over time.

    Building a Classification Model with Python

    Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −

    Step 1: Importing necessary python package

    For building a classifier using scikit-learn, we need to import it. We can import it by using following script −

    import sklearn
    

    Step 2: Importing dataset

    After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearns Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −

    from sklearn.datasets import load_breast_cancer
    

    The following script will load the dataset;

    data = load_breast_cancer()

    We also need to organize the data and it can be done with the help of following scripts −

    label_names = data['target_names']
    labels = data['target']
    feature_names = data['feature_names']
    features = data['data']

    The following command will print the name of the labels, malignant and ‘benign’ in case of our database.

    print(label_names)

    The output of the above command is the names of the labels −

    ['malignant''benign']

    These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.

    The feature names and feature values of these labels can be seen with the help of following commands −

    print(feature_names[0])

    The output of the above command is the names of the features for label 0 i.e. Malignant cancer −

    mean radius
    

    Similarly, names of the features for label can be produced as follows −

    print(feature_names[1])

    The output of the above command is the names of the features for label 1 i.e. Benign cancer −

    mean texture
    

    We can print the features for these labels with the help of following command −

    print(features[0])

    This will give the following output −

    [1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
     1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
     6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
     1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
     4.601e-01 1.189e-01]
    

    We can print the features for these labels with the help of following command −

    print(features[1])

    This will give the following output −

    [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
    7.017e-02  1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
    5.225e-03  1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
    2.341e+01  1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
    2.750e-01  8.902e-02]
    

    Step 3: Organizing data into training & testing sets

    As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −

    from sklearn.model_selection import train_test_split
    

    Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −

    train, test, train_labels, test_labels = 
       train_test_split(features,labels,test_size =0.40, random_state =42)

    Step 4: Model evaluation

    After dividing the data into training and testing we need to build the model. We will be using Nave Bayes algorithm for this purpose. The following commands will import the GaussianNB module −

    from sklearn.naive_bayes import GaussianNB
    

    Now, initialize the model as follows −

    gnb = GaussianNB()

    Next, with the help of following command we can train the model −

    model = gnb.fit(train, train_labels)

    Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −

    preds = gnb.predict(test)print(preds)

    This will give the following output −

    [1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
     1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
     1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
     1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
     1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
     0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
     0 0 1 1 0 1]
    

    The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.

    Step 5: Finding accuracy

    We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.

    from sklearn.metrics import accuracy_score
    print(accuracy_score(test_labels,preds))0.951754385965

    The above output shows that NaveBayes classifier is 95.17% accurate.

    Evaluation Metrics for Classification Model

    The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation/ performance metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.

    The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −

    Confusion Matrix

    The confusion matrix is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −

    Confusion Matrix

    The explanation of the terms associated with confusion matrix are as follows −

    • True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
    • True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
    • False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1.
    • False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0.

    We can find the confusion matrix with the help of confusion_matrix() function of sklearn. With the help of the following script, we can find the confusion matrix of above built binary classifier −

    from sklearn.metrics import confusion_matrix
    preds = gnb.predict(test)
    cm = confusion_matrix(test, preds)
    print(cm)
    

    Output

    [
       [ 73   7]
       [  4 144]
    ]
    

    Accuracy

    It may be defined as the number of correct predictions made by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

    Accuracy=TP+TNTP+FP+FN+TN

    For above built binary classifier, TP + TN = 73+144 = 217 and TP+FP+FN+TN = 73+7+4+144=228.

    Hence, Accuracy = 217/228 = 0.951754385965 which is same as we have calculated after creating our binary classifier.

    Precision

    Precision, used in document retrievals, may be defined as the number of correct documents returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

    Precision=TPTP+FP

    For the above built binary classifier, TP = 73 and TP+FP = 73+7 = 80.

    Hence, Precision = 73/80 = 0.915

    Recall or Sensitivity

    Recall may be defined as the number of positives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

    Recall=TPTP+FN

    For above built binary classifier, TP = 73 and TP+FN = 73+4 = 77.

    Hence, Precision = 73/77 = 0.94805

    Specificity

    Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −

    Specificity=TNTN+FP

    For the above built binary classifier, TN = 144 and TN+FP = 144+7 = 151.

    Hence, Precision = 144/151 = 0.95364

    In the subsequent chapters, we will discuss some of the most popular classification algorithms in machine learning in detail.

  • Polynomial Regression in Machine Learning

    What is Polynomial Regression?

    Polynomial Linear Regression is a type of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an n-th degree polynomial function. Polynomial regression allows for a more complex relationship between the variables to be captured beyond the linear relationship in simple linear regression and multiple linear regression.

    Why Polynomial Regression?

    In machine learning (ML) and data science, choosing between a linear regression or polynomial regression depends upon the characteristics of the dataset. A non-linear dataset can’t be fitted with a linear regression. If we apply linear regression to a nonlinear dataset, it will not be able to capture the non-linear patterns in the data.

    Look at the below diagram to understand why we need polynomial regression for non-linear data.

    Simple Linear Regression vs. Polynomial Regression

    The above diagram shows the simple linear model hardly fits the data points whereas the polynomial model fits most of the data points.

    Equation of Polynomial Regression Model

    In machine learning, the general formula for polynomial regression of degree n is as follows −

    y=w0+w1x+w2x2+w3x3+…+wnxn+ϵ

    Where

    • y is the dependent variable (output).
    • x is the independent variable (input).
    • w0,w1,w2,…,wn are the coefficients (parameters) of the model.
    • n is the degree of the polynomial (the highest power of x).
    • ϵ is the error term or residual, representing the difference between the observed value and the model’s prediction.

    For a quadratic (second-degree) polynomial regression, the formula would be:

    y=w0+w1x+w2x2+ϵ

    This would fit a parabolic curve to the data points.

    How does Polynomial Regression Work?

    In machine learning, the polynomial regression actually works in a similar way as linear regression works. It is modeled as multiple linear regression. The input feature is transformed into polynomial features of higher degrees (x2,x3,…,xn). These features are now treated as separate independent variables as in multiple linear regression. Now, a multiple linear regressor is trained on these transformed polynomial features.

    The polynomial regression is a special case of multiple linear regression but there is a difference that multiple linear regression assumes linearity of input features. Here, in polynomial regression, the transformed polynomial features are dependent on the original input feature.

    Implementation of Polynomial Regression using Python

    Let’s implement polynomial regression using Python. We will use a well known machine learning Python library, Scikit-learn for building a regression model.

    Step 1: Data Preparation

    In machine learning model building, the data preparation is very important step. Let’s prepare our data first. We will be using a dataset named ice_cream_selling_data.csv. It contains 49 data examples. It has an input feature/ independent variable (Temperature (C)) and target feature/ dependent variable (Ice Cream Sales (units)).

    The following table represents the data in ice_cream_selling_data.csv file.

    ice_cream_selling_data.csv

    Temperature (C)Ice Cream Sales (units)
    -4.66226267741.84298632
    -4.31655944734.66111954
    -4.21398476539.38300088
    -3.94966108937.53984488
    -3.57855371632.28453119
    -3.45571169830.00113848
    -3.10844012122.63540128
    -3.08130332425.36502221
    -2.67246082719.22697005
    -2.65228679320.27967918
    -2.65149803313.2758285
    -2.28826399818.12399121
    -2.1118696911.21829447
    -1.81893760910.01286785
    -1.6603477312.61518115
    -1.32637898310.95773134
    -1.1731232686.68912264
    -0.7733300439.392968661
    -0.6737528025.210162615
    -0.1496348674.673642541
    -0.0361564980.328625517
    -0.0338952860.897603187
    0.0086076993.165600008
    0.1492445741.931416029
    0.6887809082.576782245
    0.6935988734.625689458
    0.8749050290.789973651
    1.0241808142.313806358
    1.2407116191.292360811
    1.3598126740.953115312
    1.7400000123.782570136
    1.8505519264.857987801
    1.9993103698.943823209
    2.0751005978.170734936
    2.318591247.412094028
    2.47194599710.33663062
    2.78483646315.99661997
    2.83176021112.56823739
    2.95993209121.34291574
    3.02087431420.11441346
    3.21136614422.8394055
    3.27004406816.98327874
    3.31607251925.14208223
    3.33593241226.10474041
    3.61077847828.91218793
    3.70405743817.84395652
    4.13086796134.53074274
    4.13353378827.69838335
    4.89903151441.51482194

    Note − Create a CSV file with the above data and save it as ice_cream_selling_data.csv.

    Import Python libraries and packages for data preparation

    Let’s first import libraries and packages required in the data preparation step. We use Python pandas for reading CSV files. We use NumPy to convert the pandas data frame to NumPy array. Input and output features are NumPy arrays. We use preprocessing package from the Scikit-learn library for preprocessing related tasks such as transforming input feature to polynomial features.

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import PolynomialFeatures
    

    Load the dataset

    Load the ice_cream_selling_data.csv as a pandas dataframe. Learn more about data loading here.

    data = pd.read_csv('/ice_cream_selling_data.csv')
    data.head()

    Output

    	Temperature (C)	Ice Cream Sales (units)
    0	-4.662263	41.842986
    1	-4.316559	34.661120
    2	-4.213985	39.383001
    3	-3.949661	37.539845
    4	-3.578554	32.284531
    

    Let’s create independent variable (X) and the dependent variable (y).

    X = data.iloc[:,0].values.reshape(-1,1)
    y = data.iloc[:,1].values
    

    Visualize the original datapoints

    Let’s visualize the original data points to get some insight.

    # Visualize the original data points
    plt.scatter(X, y, color="green")
    plt.title("Original Data")
    plt.xlabel("Temperature (C)")
    plt.ylabel("Ice Cream Sales (units)")
    plt.show()

    Output

    scatter plot - original data

    The above graph shows a parabolic curve (polynomial with degree 2) that will fit the datapoints.

    So the relationship between the dependent variable (“Ice Cream Sales (units)”) and independent variable (“Temperature (C)”) can be modeled using polynomial regression of degree 2.

    Create a polynomial features object

    Now, let’s create a polynomial feature object with degree 2. We will use PolynomialFeatures class from sklearn.preprocessing module to create the feature object.

    degree =2# Degree of the polynomial
    poly_features = PolynomialFeatures(degree=degree)

    Let’s now transform the input data to include polynomial features

    X_poly = poly_features.fit_transform(X)

    Here X_poly is transformed polynomial features of original input features (X). The transformed data is of (49, 3) shape.

    Step 2: Model Training

    We have created polynomial features. Now, let’s build out the model. We use LinearRegression class from sklearn.linear_model module. As we already discussed, Polynomial regression is a special type of linear regression.

    Let’s create a linear regression object lr_model and train (fit) the model with data.

    from sklearn.linear_model import LinearRegression
    lr_model = LinearRegression()#Now, fit the model (linear regression object) on the data
    lr_model.fit(X_poly, y)

    So far, we have trained our regression model lr_model

    Step 3: Model Prediction and Testing

    Now, we can use our model to predict the output. Before going to predict for new data, let’s predict for the existing data.

    # Generate predictions
    y_pred = lr_model.predict(X_poly)
    
    df = pd.DataFrame({'Actual Values':y,'Predicted Values':y_pred})print(df)

    Output

        Actual Values  Predicted Values
    0       41.842986         46.564507
    1       34.661120         40.600548
    2       39.383001         38.915089
    3       37.539845         34.749272
    4       32.284531         29.331940
    5       30.001138         27.649735
    6       22.635401         23.192862
    7       25.365022         22.863178
    8       19.226970         18.222266
    9       20.279679         18.009098
    10      13.275828         18.000794
    11      18.123991         14.418541
    12      11.218294         12.853070
    13      10.012868         10.504868
    14      12.615181          9.364587
    15      10.957731          7.264266
    16       6.689123          6.437055
    17       9.392969          4.683654
    18       5.210163          4.337906
    19       4.673643          3.116139
    20       0.328626          2.983983
    21       0.897603          2.981829
    22       3.165600          2.944811
    23       1.931416          2.869446
    24       2.576782          3.251711
    25       4.625689          3.259923
    26       0.789974          3.630683
    27       2.313806          4.026226
    28       1.292361          4.744891
    29       0.953115          5.213321
    30       3.782570          7.055902
    31       4.857988          7.690948
    32       8.943823          8.616039
    33       8.170735          9.118494
    34       7.412094         10.874961
    35      10.336631         12.092557
    36      15.996620         14.843721
    37      12.568237         15.287199
    38      21.342916         16.539614
    39      20.114413         17.156188
    40      22.839406         19.171090
    41      16.983279         19.818497
    42      25.142082         20.335157
    43      26.104740         20.560474
    44      28.912188         23.826884
    45      17.843957         24.998282
    46      34.530743         30.764287
    47      27.698383         30.802396
    48      41.514822         42.821195
    

    You can compare the predicted values with actual values.

    Step 4: Evaluating Model Performance

    To evaluate the model performance, the best metric is the R-squared score (Coefficient of determination). It measures the proportion of the variance in the dependent variable that is predictable from the independent variables.

    from sklearn.metrics import r2_score
    
    # get the predicted values for test dat
    y_pred = lr_model.predict(X_poly)
    r2 = r2_score(y, y_pred)print(r2)

    Outout

    0.9321137090423877
    

    The r2_score is the most common metric used to evaluate a regression model. The high score indicates a better fit of the model with data. 1 represent perfect fit and 0 represents no relation between the predicted values and actual values.

    Result Explanation − You can examine the above metrics. Our model shows an R-squared score of around 0.932, which means that approximately 93% of data points are scattered around the fitted regression curve. Another interpretation is that 93% of the variation in the output variables is explained by the input variables.

    Step 5: Visualize the polynomial regression results

    Let’s visualize the regression results for better understanding. We use the pyplot module from the Matplotlib library to plot the graph.

    import matplotlib.pyplot as plt
    
    # Visualize the polynomial regression results
    plt.scatter(X, y, color="green")
    plt.plot(X, y_pred, color='red', label=f'Polynomial Regression (degree={degree})')
    plt.xlabel("Temperature (C)")
    plt.ylabel("Ice Cream Sales (units)")
    plt.legend()
    plt.title('Polynomial Regression')
    plt.show()

    Output

    ML Polynomial Regression Results

    The above graph shows that the polynomial regression with degree 2 fits well with the original data. The polynomial curve (parabola), in red color, represents the best-fit regression curve. This regression curve is used to predict the value. The graph also shows that the predicted values are close to the actual values.

    Step 5: Model Prediction for New Data

    Up to now, we have predicted the values in the dataset. Let’s use our regression model to predict new, unseen data.

    Let’s take the Temperature (C) as 1.9929C and predict the units of Ice Cream Sales.

    # Predict a new value
    X_new = np.array([[1.9929]])# Example value to predict
    X_new_poly = poly_features.transform(X_new)
    y_new_pred = lr_model.predict(X_new_poly)print(y_new_pred)

    Output

    [8.57450466]
    

    The above result shows that the predicted value of Ice cream sales is 8.57450466.

  • Multiple Linear Regression in Machine Learning

    Multiple linear regression in machine learning is a supervised algorithm that models the relationship between a dependent variable and multiple independent variables. This relationship is used to predict the outcome of the dependent variable.

    Multiple linear regression is a type of linear regression in machine learning. There are mainly two types of linear regression algorithms −

    • simple linear regression − it deals with two features (one dependent variable and one independent variable).
    • multiple linear regression − deals with more than two features (one dependent variable and more than one independent variables).

    Let’s discuss multiple linear regression in detail −

    What is Multiple Linear Regression?

    In machine learning, multiple linear regression (MLR) is a statistical technique that is used to predict the outcome of a dependent variable based on the values of multiple independent variables. The multiple linear regression algorithm is trained on data to learn a relationship (known as a regression line) that best fits the data. This relation describes how various factors affect the result. This relation is used to forecast the value of dependent variable based on the values of independent variables.

    In linear regression (simple and multiple), the dependent variable is continuous (numeric value) and independent variables can be continuous or discreet (numeric value). Independent variables can also be categorical (gender, occupation), but they need to be converted to numerical values first.

    Multiple linear regression is basically the extension of simple linear regression that predicts a response using two or more features. Mathematically we can represent the multiple linear regression as follows −

    Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows −

    h(xi)=w0+w1xi1+w2xi2+⋅⋅⋅+wpxip

    Here, h(xi) is the predicted response value and w0,w1,w2….wp are the regression coefficients.

    Multiple Linear Regression models always includes the errors in the data known as residual error which changes the calculation as follows −

    yi=w0+w1xi1+w2xi2+⋅⋅⋅+wpxip+ei

    We can also write the above equation as follows −

    yi=h(xi)+eiorei=yi−h(xi)

    Assumptions of Multiple Linear Regression

    The following are some assumptions about the dataset that are made by the multiple linear regression model −

    1. Linearity

    The relationship between the dependent variable (target) and independent (predictor) variables is linear.

    2. Independence

    Each observation is independent of others. The value of the dependent variable for one observation is independent of the value of another.

    3. Homoscedasticity

    For all observations, the variance of the residual errors is similar across the value of each independent variable.

    4. Normality of Errors

    The residuals (errors) are normally distributed. The residuals are differences between the actual and predicted values.

    5. No Multicollinearity

    The independent variables are not highly correlated with each other. Linear regression models assume that there is very little or no multi-collinearity in the data.

    6. No Autocorrelation

    There is no correlation between residuals. This ensures that the residuals (errors) are independent of each other.

    7. Fixed Independent Variables

    The values of independent variables are fixed in all repeated samples.

    Violations of these assumptions can lead to biased or inefficient estimates. It is essential to validate these assumptions to ensure model accuracy.

    Implementing Multiple Linear Regression in Python

    To implement multiple linear regression in Python using Scikit-Learn, we can use the same LinearRegression class as in simple linear regression, but this time we need to provide multiple independent variables as input.

    Step 1: Data Preparation

    We use the dataset named data.csv with 50 examples. It contains four predictor (independent) variables and a target (dependent) variable. The following table represents the data in data.csv file.

    data.csv

    R&D SpendAdministrationMarketing SpendStateProfit
    165349.2136897.8471784.1New York192261.8
    162597.7151377.6443898.5California191792.1
    153441.5101145.6407934.5Florida191050.4
    144372.4118671.9383199.6New York182902
    142107.391391.77366168.4Florida166187.9
    131876.999814.71362861.4New York156991.1
    134615.5147198.9127716.8California156122.5
    130298.1145530.1323876.7Florida155752.6
    120542.5148719311613.3New York152211.8
    123334.9108679.2304981.6California149760
    101913.1110594.1229161Florida146122
    10067291790.61249744.6California144259.4
    93863.75127320.4249839.4Florida141585.5
    91992.39135495.1252664.9California134307.4
    119943.2156547.4256512.9Florida132602.7
    114523.6122616.8261776.2New York129917
    78013.11121597.6264346.1California126992.9
    94657.16145077.6282574.3New York125370.4
    91749.16114175.8294919.6Florida124266.9
    86419.7153514.10New York122776.9
    76253.86113867.3298664.5California118474
    78389.47153773.4299737.3New York111313
    73994.56122782.8303319.3Florida110352.3
    67532.53105751304768.7Florida108734
    77044.0199281.34140574.8New York108552
    64664.71139553.2137962.6California107404.3
    75328.87144136134050.1Florida105733.5
    72107.6127864.6353183.8New York105008.3
    66051.52182645.6118148.2Florida103282.4
    65605.48153032.1107138.4New York101004.6
    61994.48115641.391131.24Florida99937.59
    61136.38152701.988218.23New York97483.56
    63408.86129219.646085.25California97427.84
    55493.95103057.5214634.8Florida96778.92
    46426.07157693.9210797.7California96712.8
    46014.0285047.44205517.6New York96479.51
    28663.76127056.2201126.8Florida90708.19
    44069.9551283.14197029.4California89949.14
    20229.5965947.93185265.1New York81229.06
    38558.5182982.09174999.3California81005.76
    28754.33118546.1172795.7California78239.91
    27892.9284710.77164470.7Florida77798.83
    23640.9396189.63148001.1California71498.49
    15505.73127382.335534.17New York69758.98
    22177.74154806.128334.72California65200.33
    1000.231241531903.93New York64926.08
    1315.46115816.2297114.5Florida49490.75
    0135426.90California42559.73
    542.0551743.150New York35673.41
    0116983.845173.06California14681.4

    You can create a CSV file and store the above data points in it.

    We have our dataset as data.csv file. We will use it to understand the implementation of the multiple linear regression in Python.

    We need to import libraries before loading the dataset.

    # import librariesimport numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    

    Load the dataset

    We load our dataset as a Pandas Data frame named <string>dataset. Now let’s create a list of independent values (predictors) and put them in a variable called X.</string>

    The independent values are ‘R&D Spend’, ‘Administration’, ‘Marketing Spend’. We are not using the independent variable ‘State’ for sake of simplicity.

    We put the dependent variable values to a variable y.

    # load dataset
    dataset = pd.read_csv('data.csv')
    X = dataset[['R&D Spend','Administration','Marketing Spend']]
    y = dataset['Profit']

    Let’s check first five examples (rows) of input features (X) and target (y) −

    X.head()

    Output

    	R&D Spend	Administration	Marketing Spend
    0	165349.20	136897.80	471784.10
    1	162597.70	151377.59	443898.53
    2	153441.51	101145.55	407934.54
    3	144372.41	118671.85	383199.62
    4	142107.34	91391.77	366168.42
    
    y.head()

    Output

    	Profit
    0	192261.83
    1	191792.06
    2	191050.39
    3	182901.99
    4	166187.94
    

    Split the dataset into training and test sets

    Now, we split the dataset into a training set and a test set. Both the X(independent values) and y (dependent values) are divided into two sets – training and test. We will use 20% for the test set. In such a way out of 50 feature vectors (observations/ examples), there will be 40 feature vectors in training set and 10 feature vectors in test set.

    # Split the dataset into training and test sets from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

    Here X_train and X_test represent input features in training set and test set, where y_train and y_test represent target values (output) in traning and test set.

    Step 2: Model Training

    The next step is to fit our model with training data. We will use linear_model class from sklearn module. We use the Linear Regression() method of linear_model class to create a linear regression object, here we name it as regressor.

    # Fit Multiple Linear Regression to the Training setfrom sklearn.linear_model import LinearRegression
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)

    The regressor object has fit() method. The fit() method is used to fit the linear regression object, regressor to the training data. The model learns the relation between the predictor variable (X_train), and the target variable (y_train).

    Step 3: Model Testing

    Now our model is ready to use for prediction. Let’s test our regressor model on test data.

    We use predict() method to predict the results for the test set. It takes input features (X_test) and return the redicted values.

    y_pred = regressor.predict(X_test)
    df = pd.DataFrame({'Real Values':y_test,'Predicted Values':y_pred})print(df)

    Output

    	Real Values	Predicted Values
    23	108733.99	110159.827849
    43	69758.98	59787.885207
    26	105733.54	110545.686823
    34	96712.80	88204.710014
    24	108552.04	114094.816702
    39	81005.76	84152.640761
    44	65200.33	63862.256006
    18	124266.90	129379.514419
    47	42559.73	45832.902722
    17	125370.37	130086.829016
    

    You can compare the actual values and predicted values.

    Step 4: Model Evaluation

    We now evaluate our model to check how accurate it is. We will use mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and R2-score (Coefficient of determination).

    from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score
    # Assuming you have your true y values (y_test) and predicted y values (y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = root_mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)print("Mean Squared Error (MSE):", mse)print("Root Mean Squared Error (RMSE):", rmse)print("Mean Absolute Error (MAE):", mae)print("R-squared (R2):", r2)

    Output

    Mean Squared Error (MSE): 72684687.6336162
    Root Mean Squared Error (RMSE): 8525.531516193943
    Mean Absolute Error (MAE): 6425.118502810154
    R-squared (R2): 0.9588459519573707
    

    You can examine the above metrics. Our model shows an R-squared score of around 0.96, which means that 96% of data points are scattered around the fitted regression line. Another interpretation is that 96% of the variation in the output variables is explained by the input variables.

    Step 5: Model Prediction for New Data

    Let’s use our regressor model to predict profit values based on R&D Spend, Administration and Marketing Spend.

    [‘R&D Spend’,’Administration’,’Marketing Spend’]=[166343.2, 136787.8, 461724.1]

    // predict profit when R&D Spend is166343.2, Administration is136787.8and Marketing Spend is461724.1
    new_data =[[166343.2,136787.8,461724.1]] 
    profit = regressor.predict(new_data)print(profit)

    Output

    [193053.61874652]
    

    The model predicts the profit value is approximately 192090.567 for the above three values.

    Model Parameters (Coefficients and Intercept)

    The model parameters (intercept and coefficients) describe the relation between a dependent variable and the independent variables.

    Our regression model for the above use case,

    Y=w0+w1X1+w2X2+w2X2

    w0 is intercept and w1,w2,w3 are coefficients of X1,X2,X3 respectively.

    Here,

    • X1 represents R&D Spend,
    • X2 represents Administration, and
    • X3 represents Marketing Spend.

    Let’s first compute the intercept and coefficients.

    print("coefficients: ", regressor.coef_)print("intercept: ", regressor.intercept_)

    Output

    coefficients: [ 0.81129358 -0.06184074  0.02515044]
    intercept: 54946.94052163202
    

    The above output shows the following –

    • w0 = 54946.94052163202
    • w1 = 0.81129358
    • w2 = -0.06184074
    • w3 = 0.02515044

    Result Explanation

    We have calculated intercept (w0) and coefficients (w1, w2, w3).

    The coefficients are as follows –

    • R&D Spend: 0.81129358
    • Administration: -0.06184074
    • Marketing Spend: 0.02515044

    This shows that if R&D Spend is increased by 1 USD, the Profit will increase by 0.81851334 USD.

    The result shows that when Administration spend is increased by 1 USD, the Profit will decrease by 0.03124763 USD.

    And when Marketing Spend increases by 1 USD, the Profit increases by 0.02042286 USD.

    Let’s verify the result,

    In step 5, we have predicted Profit for new data as 193053.61874652

    Here,

    new_data =[[166343.2,136787.8,461724.1]] 
    Profit =54946.94052163202+0.81129358*166343.2-0.06184074*136787.8+0.02515044*461724.1
    Profit =193053.616257

    Which is approximately the same as model prediction. Why approximately? Because of residual error.

    residual error = 193053.61874652 - 193053.616257
    residual error = 0.00248952
    

    Applications of Multiple Linear Regression

    The following are some commonly used applications of multiple linear regression −

    ApplicationDescription
    FinancePredicting stock prices, forecasting exchange rates, assessing credit risk.
    MarketingPredicting sales, customer churn, and marketing campaign effectiveness.
    Real EstatePredicting house prices based on factors like size, location, and number of bedrooms.
    HealthcarePredicting patient outcomes, analyzing the impact of treatments, and identifying risk factors for diseases.
    EconomicsForecasting economic growth, analyzing the impact of policies, and predicting inflation rates.
    Social SciencesModeling social phenomena, predicting election outcomes, and understanding human behavior.

    Challenges of Multiple Linear Regression

    The following are some common challenges faced by multiple linear regression in machine learning −

    ChallengeDescription
    MulticollinearityHigh correlation between independent variables, leading to unstable model coefficients and difficulty in interpreting the impact of individual variables.
    OverfittingThe model fits the training data too closely, leading to poor performance on new, unseen data.
    UnderfittingThe model fails to capture the underlying patterns in the data, resulting in poor performance on both training and test data.
    Non-linearityMultiple linear regression assumes a linear relationship between the independent and dependent variables. Non-linear relationships can lead to inaccurate predictions.
    OutliersOutliers can significantly impact the model’s performance, especially in small datasets.
    Missing DataMissing data can lead to biased and inaccurate results.

    Difference Between Simple and Multiple Linear Regression

    The following table highlights the major differences between simple and multiple linear regression −

    FeatureSimple Linear RegressionMultiple Linear Regression
    Independent VariablesOneTwo or more
    Model Equationy = w1x + w0y=w0+w1x1+w2x2+ … +wpxp
    ComplexityLess complexMore complex due to multiple variables
    Real-world ApplicationsPredicting house prices based on square footage, predicting sales based on advertising expenditurePredicting sales based on advertising expenditure, price, and competitor activity, predicting student performance based on study hours, attendance, and IQ
    Model InterpretationEasier to interpret coefficientsMore complex to interpret due to multiple variables

  • Simple Linear Regression in Machine Learning

    What is Simple Linear Regression?

    Simple linear regression is a statistical and supervised learning method in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

    Simple linear regression in machine learning is a type of linear regression. When the linear regression algorithm deals with a single independent variable, it is known as simple linear regression. When there is more than one independent variable (feature variables), it is known as multiple linear regression.

    Independent Variable

    The feature inputs in the dataset are termed as the independent variables. There is only a single independent variable in simple linear regression. An independent variable is also known as a predictor variable as it is used to predict the target value. It is plotted on a horizontal axis.

    Dependent Variable

    The target value in the dataset is termed as the dependent variable. It is also known as a response variable or predicted variable. It is plotted on a vertical axis.

    Line of Regression

    In simple linear regression, a line of regression is a straight line that best fits the data points and is used to show the relationship between a dependent variable and an independent variable.

    Graphical Representation

    The following graph depicts the simple linear regression model −

    ML Simple Linear Regression

    In the above image, the straight line represents the simple linear regression line where &Ycirc; is the predicted value, and Y is dependent variable (target) and X is independent variable (input).

    Simple Linear Regression Model

    A simple linear regression model in machine learning can be represented as the following mathematical equation −

    Y=w0+w1X+ϵ

    Where

    • Y is the dependent variable (target).
    • X is the independent variable (feature).
    • w0 is the y-intercept of the line.
    • w1 is the slope of the line, representing the effect of X on Y.
    • ε is the error term, capturing the variability in Y not explained by X.

    How Simple Linear Regression Works?

    The main of simple linear regression is to find the best fit line (a straight line) through the data points that minimizes the difference between the actual values and predicted values.

    Defining Hypothesis Function

    In simple linear regression, the hypothesis is that there is a linear relation between the dependent variable (output/ target) and the independent variable (input). This linear relation can be represented using a linear equation −

    Ŷ =w0+w1X

    With different values of parameters w0 and w1 there are multiple linear equations (straight lines). The set of all such linear equations (all straight lines) is termed hypothesis space.

    Now, the main aim of the simple linear regression model is to find the best-fit line in Hypothesis space (set of all straight lines).

    Finding the Best Fit Line

    Now the task is to find the best fit line (line of regression). To do this, we define a cost function or loss function that measure the the difference between the actual values and predicted values.

    To find the best fit line, the simple linear regression model initializes (with default values) the parameters of the regression line. This regression line (with initialized parameters) is used to find the predicted values for the given input values.

    Loss Function for Simple Linear Regression

    Now using the input and predicted values, we compute the loss function. The loss function is used to find the optimal values of the parameters.

    The loss function finds the difference between the input value and predicted value. There are different loss functions such as mean squared error (MSE), mean absolute error (MEA), R-squared, etc. used in simple linear regression. The most commonly used loss function is mean squared error.

    The loss function for simple linear regression in terms of mean squared error is as follows −

    J(w0,w1)=12n∑i=1n(Yi−Ŷ i)2

    Optimization

    The optimal values of parameters are those values that minimize the cost function. Finding the optimal values is an iterative process in which the parameters are updated iteratively.

    There are many optimization techniques applied in simple linear regression. Gradient Descent is a simple and most common optimization technique used in simple linear regression.

    A linear equation with optimal parameter values is the best fit line(regression line) and it is the final solution for a simple linear regression problem. This line is used to predict new and unseen data.

    Assumptions of Simple Linear Regression

    There are some assumptions about the dataset that are made by the simple linear regression model. The following are some assumptions −

    • Linearity − This assumption assumes that the relationship between the dependent and independent variables is linear. That means the dependent variable changes linearly as the independent variable changes. A scatter plot will show the linearity in the dataset.
    • Homoskedasticity − For all observations, the variance of the residuals is the same. This assumption relates to the squared residuals.
    • Independence − The examples (observations or X and Y pairs) are independent. There is no collinearity in data so the residuals will not be correlated. To check this, we example the scatter plot of residuals vs. fits.
    • Normality − Model Residuals are normally distributed. Residuals are the differences between the actual and predicted values. To check for the normality, we examine the histogram of residuals. The histogram should be approximately normally distributed.

    Implementation of Simple Linear Regression Algorithm using Python

    To implement the simple linear regression algorithm, we are taking a dataset with two variables: YearsExperience (independent variable) and Salary (dependent variable).

    Here, we are using the following dataset. The dataset contains 30 examples of data points. You can create a CSV file and store these data points in it.

    Salary_Data.csv

    Years of ExperienceSalary
    1.139343
    1.346205
    1.537731
    243525
    2.239891
    2.956642
    360150
    3.254445
    3.264445
    3.757189
    3.963218
    455794
    456957
    4.157081
    4.561111
    4.967938
    5.166029
    5.383088
    5.981363
    693940
    6.891738
    7.198273
    7.9101302
    8.2113812
    8.7109431
    9105582
    9.5116969
    9.6112635
    10.3122391
    10.5121872

    What is the purpose of this implementation?

    The purpose of building this simple linear regression model is to determine which line best represents the relationship between the two variables.

    The following are the steps to implement the simple linear regression model in Python −

    Step 1: Data Preparation

    Data preparation or pre-processing is the initial step. We have our dataset as a CSV file named “Salary_Data.csv,” as discussed above.

    We need to import python libraries prior to importing the dataset and building the simple linear regression model.

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    

    Load the dataset

    dataset = pd.read_csv('Salary_Data.csv')

    The dependent variable (X) and independent variable (Y) must then be extracted from the provided dataset. Years of experience (YearsExperience) is the independent variable, and Salary is the dependent variable.

    X = dataset.iloc[:,:-1].values
    y = dataset.iloc[:,-1].values
    

    Let’s check the first five examples of the dataset.

    print(dataset.head())

    Output

    0	1.1	39343.0
    1	1.3	46205.0
    2	1.5	37731.0
    3	2.0	43525.0
    4	2.2	39891.0
    

    Lets check if the dataset is linear or not

    plt.scatter(X, y, color="green")
    plt.title("Salary vs Experience")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary (INR)")
    plt.show()

    Output

    Linear Relation Between Dependent and Independent Variables

    The above graph shows that the dependent and independent variables are linearly dependent. So we can apply the simple linear regression on the dataset to find the best relation between these variables.

    Split the dataset into training and testing sets

    The training set and test set will then be divided into two groups. We will use 80% observations for the training set and 20% observations for the test set out of the total 30 observations we have. So there will be 24 observation in training set and 6 observation in test set. We divide our dataset into training and test sets so that we can use one set to train and the other to test our model.

    # Split the dataset into training and testing setsfrom sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

    Here, X_train represents the input feature of the training data and y_train represents the output variable (target variable).

    Step 2: Model Training (Fitting the Simple Linear Regression to Training Set)

    The next step is fitting our model with the training dataset. We will use scikit-learn’s LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows −

    from sklearn.linear_model import LinearRegression
    
    # Create a linear regression object
    regressor= LinearRegression()
    regressor.fit(X_train, y_train)

    The fit() method is used to fit the linear regression object (regressor) to the training data. The model learns the relation between the predictor variable (X_train), and the target variable (y_train).

    Step 3: Model Testing

    Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows −

    y_pred = regressor.predict(X_test)
    df = pd.DataFrame({'Actual Values':y_test,'Predicted Values':y_pred})print(df)

    Output

       Actual Values  Predicted Values
    0        60150.0      54093.648425
    1        93940.0      82416.119864
    2        57081.0      64478.554619
    3       116969.0     115459.003211
    4        56957.0      63534.472238
    5       121872.0     124899.827024
    

    The above output shows actual values and predicted values of Salary in the test set.

    Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable).

    Similarly, you can test the model with training data.

    y_pred = regressor.predict(X_train)
    df = pd.DataFrame({'Real Values':y_test,'Predicted Values':y_pred})print(df)

    Output

        Real Values  Predicted Values
    0       57189.0      60702.225094
    1       64445.0      55981.813188
    2       63218.0      62590.389857
    3      122391.0     123011.662261
    4       91738.0      89968.778915
    5       43525.0      44652.824612
    6       61111.0      68254.884145
    7       56642.0      53149.566044
    8       66029.0      73919.378433
    9       83088.0      75807.543195
    10      46205.0      38044.247943
    11     109431.0     107906.344160
    12      98273.0      92801.026059
    13      37731.0      39932.412705
    14      54445.0      55981.813188
    15      39891.0      46540.989374
    16     101302.0     100353.685109
    17      55794.0      63534.472238
    18      81363.0      81472.037483
    19      39343.0      36156.083180
    20     113812.0     103185.932253
    21      67938.0      72031.213670
    22     112635.0     116403.085592
    23     105582.0     110738.591304
    

    Step 4: Model Evaluation

    We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE), root mse (RMSE), mean average error (MAE), and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows −

    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import r2_score
    
    # get the predicted values for test dat
    y_pred = regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)print("mse", mse)
    rmse = mean_squared_error(y_test, y_pred, squared=False)print("rsme", rmse)
    mae = mean_absolute_error(y_test, y_pred)print("mae", mae)
    r2 = r2_score(y_test, y_pred)print("r2", r2)

    Output

    mse:  46485664.99327367
    rsme:  6818.0396737826095
    mae:  6015.513730219523
    r2:  0.9399326805390613
    

    Here, y_test represents the actual output variable of the test data.

    Step 5: Visualize Training Set Results (with Regression Line)

    Now, let’s visualize the results on the training set and the regression line.

    We use the scatter plot to plot the actual values (input and target values) in the training set. We also plot a straight line (regression line) for actual values (input) and predicted values of the training set.

    y_pred = regressor.predict(X_train)
    plt.scatter(X_train, y_train, color="green", label="training data points (actual)")
    plt.scatter(X_train, y_pred, color="blue",label="training data points (predicted)")
    plt.plot(X_train, y_pred, color="red")
    plt.title("Salary vs Experience (Training Dataset)")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary(In Rupees)")
    plt.legend()
    plt.show()

    Output

    Visualizing training set results

    The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the training set.

    Step 6: Visualize the Test Set Results (with Regression Line)

    Now, let’s visualize the results on the test set and the regression line.

    We use the scatter plot to plot the actual values (input and target values) in the test set. We also plot a straight line (regression line) for actual values (input) and predicted values of the test set.

    y_pred = regressor.predict(X_test)
    plt.scatter(X_test, y_test, color="green", label="test data points (actual)")
    plt.scatter(X_test, y_pred, color="blue",label="test data points (predicted)")
    plt.plot(X_test, y_pred, color="red")
    plt.title("Salary vs Experience (Test Dataset)")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary(In Rupees)")
    plt.legend()
    plt.show()

    Output

    Visualizing test set results

    The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the test set.