Category: Regression Analysis In ML

https://zain.sweetdishy.com/wp-content/uploads/2025/10/deep-learnin.png

  • Polynomial Regression in Machine Learning

    What is Polynomial Regression?

    Polynomial Linear Regression is a type of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an n-th degree polynomial function. Polynomial regression allows for a more complex relationship between the variables to be captured beyond the linear relationship in simple linear regression and multiple linear regression.

    Why Polynomial Regression?

    In machine learning (ML) and data science, choosing between a linear regression or polynomial regression depends upon the characteristics of the dataset. A non-linear dataset can’t be fitted with a linear regression. If we apply linear regression to a nonlinear dataset, it will not be able to capture the non-linear patterns in the data.

    Look at the below diagram to understand why we need polynomial regression for non-linear data.

    Simple Linear Regression vs. Polynomial Regression

    The above diagram shows the simple linear model hardly fits the data points whereas the polynomial model fits most of the data points.

    Equation of Polynomial Regression Model

    In machine learning, the general formula for polynomial regression of degree n is as follows −

    y=w0+w1x+w2x2+w3x3+…+wnxn+ϵ

    Where

    • y is the dependent variable (output).
    • x is the independent variable (input).
    • w0,w1,w2,…,wn are the coefficients (parameters) of the model.
    • n is the degree of the polynomial (the highest power of x).
    • ϵ is the error term or residual, representing the difference between the observed value and the model’s prediction.

    For a quadratic (second-degree) polynomial regression, the formula would be:

    y=w0+w1x+w2x2+ϵ

    This would fit a parabolic curve to the data points.

    How does Polynomial Regression Work?

    In machine learning, the polynomial regression actually works in a similar way as linear regression works. It is modeled as multiple linear regression. The input feature is transformed into polynomial features of higher degrees (x2,x3,…,xn). These features are now treated as separate independent variables as in multiple linear regression. Now, a multiple linear regressor is trained on these transformed polynomial features.

    The polynomial regression is a special case of multiple linear regression but there is a difference that multiple linear regression assumes linearity of input features. Here, in polynomial regression, the transformed polynomial features are dependent on the original input feature.

    Implementation of Polynomial Regression using Python

    Let’s implement polynomial regression using Python. We will use a well known machine learning Python library, Scikit-learn for building a regression model.

    Step 1: Data Preparation

    In machine learning model building, the data preparation is very important step. Let’s prepare our data first. We will be using a dataset named ice_cream_selling_data.csv. It contains 49 data examples. It has an input feature/ independent variable (Temperature (C)) and target feature/ dependent variable (Ice Cream Sales (units)).

    The following table represents the data in ice_cream_selling_data.csv file.

    ice_cream_selling_data.csv

    Temperature (C)Ice Cream Sales (units)
    -4.66226267741.84298632
    -4.31655944734.66111954
    -4.21398476539.38300088
    -3.94966108937.53984488
    -3.57855371632.28453119
    -3.45571169830.00113848
    -3.10844012122.63540128
    -3.08130332425.36502221
    -2.67246082719.22697005
    -2.65228679320.27967918
    -2.65149803313.2758285
    -2.28826399818.12399121
    -2.1118696911.21829447
    -1.81893760910.01286785
    -1.6603477312.61518115
    -1.32637898310.95773134
    -1.1731232686.68912264
    -0.7733300439.392968661
    -0.6737528025.210162615
    -0.1496348674.673642541
    -0.0361564980.328625517
    -0.0338952860.897603187
    0.0086076993.165600008
    0.1492445741.931416029
    0.6887809082.576782245
    0.6935988734.625689458
    0.8749050290.789973651
    1.0241808142.313806358
    1.2407116191.292360811
    1.3598126740.953115312
    1.7400000123.782570136
    1.8505519264.857987801
    1.9993103698.943823209
    2.0751005978.170734936
    2.318591247.412094028
    2.47194599710.33663062
    2.78483646315.99661997
    2.83176021112.56823739
    2.95993209121.34291574
    3.02087431420.11441346
    3.21136614422.8394055
    3.27004406816.98327874
    3.31607251925.14208223
    3.33593241226.10474041
    3.61077847828.91218793
    3.70405743817.84395652
    4.13086796134.53074274
    4.13353378827.69838335
    4.89903151441.51482194

    Note − Create a CSV file with the above data and save it as ice_cream_selling_data.csv.

    Import Python libraries and packages for data preparation

    Let’s first import libraries and packages required in the data preparation step. We use Python pandas for reading CSV files. We use NumPy to convert the pandas data frame to NumPy array. Input and output features are NumPy arrays. We use preprocessing package from the Scikit-learn library for preprocessing related tasks such as transforming input feature to polynomial features.

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import PolynomialFeatures
    

    Load the dataset

    Load the ice_cream_selling_data.csv as a pandas dataframe. Learn more about data loading here.

    data = pd.read_csv('/ice_cream_selling_data.csv')
    data.head()

    Output

    	Temperature (C)	Ice Cream Sales (units)
    0	-4.662263	41.842986
    1	-4.316559	34.661120
    2	-4.213985	39.383001
    3	-3.949661	37.539845
    4	-3.578554	32.284531
    

    Let’s create independent variable (X) and the dependent variable (y).

    X = data.iloc[:,0].values.reshape(-1,1)
    y = data.iloc[:,1].values
    

    Visualize the original datapoints

    Let’s visualize the original data points to get some insight.

    # Visualize the original data points
    plt.scatter(X, y, color="green")
    plt.title("Original Data")
    plt.xlabel("Temperature (C)")
    plt.ylabel("Ice Cream Sales (units)")
    plt.show()

    Output

    scatter plot - original data

    The above graph shows a parabolic curve (polynomial with degree 2) that will fit the datapoints.

    So the relationship between the dependent variable (“Ice Cream Sales (units)”) and independent variable (“Temperature (C)”) can be modeled using polynomial regression of degree 2.

    Create a polynomial features object

    Now, let’s create a polynomial feature object with degree 2. We will use PolynomialFeatures class from sklearn.preprocessing module to create the feature object.

    degree =2# Degree of the polynomial
    poly_features = PolynomialFeatures(degree=degree)

    Let’s now transform the input data to include polynomial features

    X_poly = poly_features.fit_transform(X)

    Here X_poly is transformed polynomial features of original input features (X). The transformed data is of (49, 3) shape.

    Step 2: Model Training

    We have created polynomial features. Now, let’s build out the model. We use LinearRegression class from sklearn.linear_model module. As we already discussed, Polynomial regression is a special type of linear regression.

    Let’s create a linear regression object lr_model and train (fit) the model with data.

    from sklearn.linear_model import LinearRegression
    lr_model = LinearRegression()#Now, fit the model (linear regression object) on the data
    lr_model.fit(X_poly, y)

    So far, we have trained our regression model lr_model

    Step 3: Model Prediction and Testing

    Now, we can use our model to predict the output. Before going to predict for new data, let’s predict for the existing data.

    # Generate predictions
    y_pred = lr_model.predict(X_poly)
    
    df = pd.DataFrame({'Actual Values':y,'Predicted Values':y_pred})print(df)

    Output

        Actual Values  Predicted Values
    0       41.842986         46.564507
    1       34.661120         40.600548
    2       39.383001         38.915089
    3       37.539845         34.749272
    4       32.284531         29.331940
    5       30.001138         27.649735
    6       22.635401         23.192862
    7       25.365022         22.863178
    8       19.226970         18.222266
    9       20.279679         18.009098
    10      13.275828         18.000794
    11      18.123991         14.418541
    12      11.218294         12.853070
    13      10.012868         10.504868
    14      12.615181          9.364587
    15      10.957731          7.264266
    16       6.689123          6.437055
    17       9.392969          4.683654
    18       5.210163          4.337906
    19       4.673643          3.116139
    20       0.328626          2.983983
    21       0.897603          2.981829
    22       3.165600          2.944811
    23       1.931416          2.869446
    24       2.576782          3.251711
    25       4.625689          3.259923
    26       0.789974          3.630683
    27       2.313806          4.026226
    28       1.292361          4.744891
    29       0.953115          5.213321
    30       3.782570          7.055902
    31       4.857988          7.690948
    32       8.943823          8.616039
    33       8.170735          9.118494
    34       7.412094         10.874961
    35      10.336631         12.092557
    36      15.996620         14.843721
    37      12.568237         15.287199
    38      21.342916         16.539614
    39      20.114413         17.156188
    40      22.839406         19.171090
    41      16.983279         19.818497
    42      25.142082         20.335157
    43      26.104740         20.560474
    44      28.912188         23.826884
    45      17.843957         24.998282
    46      34.530743         30.764287
    47      27.698383         30.802396
    48      41.514822         42.821195
    

    You can compare the predicted values with actual values.

    Step 4: Evaluating Model Performance

    To evaluate the model performance, the best metric is the R-squared score (Coefficient of determination). It measures the proportion of the variance in the dependent variable that is predictable from the independent variables.

    from sklearn.metrics import r2_score
    
    # get the predicted values for test dat
    y_pred = lr_model.predict(X_poly)
    r2 = r2_score(y, y_pred)print(r2)

    Outout

    0.9321137090423877
    

    The r2_score is the most common metric used to evaluate a regression model. The high score indicates a better fit of the model with data. 1 represent perfect fit and 0 represents no relation between the predicted values and actual values.

    Result Explanation − You can examine the above metrics. Our model shows an R-squared score of around 0.932, which means that approximately 93% of data points are scattered around the fitted regression curve. Another interpretation is that 93% of the variation in the output variables is explained by the input variables.

    Step 5: Visualize the polynomial regression results

    Let’s visualize the regression results for better understanding. We use the pyplot module from the Matplotlib library to plot the graph.

    import matplotlib.pyplot as plt
    
    # Visualize the polynomial regression results
    plt.scatter(X, y, color="green")
    plt.plot(X, y_pred, color='red', label=f'Polynomial Regression (degree={degree})')
    plt.xlabel("Temperature (C)")
    plt.ylabel("Ice Cream Sales (units)")
    plt.legend()
    plt.title('Polynomial Regression')
    plt.show()

    Output

    ML Polynomial Regression Results

    The above graph shows that the polynomial regression with degree 2 fits well with the original data. The polynomial curve (parabola), in red color, represents the best-fit regression curve. This regression curve is used to predict the value. The graph also shows that the predicted values are close to the actual values.

    Step 5: Model Prediction for New Data

    Up to now, we have predicted the values in the dataset. Let’s use our regression model to predict new, unseen data.

    Let’s take the Temperature (C) as 1.9929C and predict the units of Ice Cream Sales.

    # Predict a new value
    X_new = np.array([[1.9929]])# Example value to predict
    X_new_poly = poly_features.transform(X_new)
    y_new_pred = lr_model.predict(X_new_poly)print(y_new_pred)

    Output

    [8.57450466]
    

    The above result shows that the predicted value of Ice cream sales is 8.57450466.

  • Multiple Linear Regression in Machine Learning

    Multiple linear regression in machine learning is a supervised algorithm that models the relationship between a dependent variable and multiple independent variables. This relationship is used to predict the outcome of the dependent variable.

    Multiple linear regression is a type of linear regression in machine learning. There are mainly two types of linear regression algorithms −

    • simple linear regression − it deals with two features (one dependent variable and one independent variable).
    • multiple linear regression − deals with more than two features (one dependent variable and more than one independent variables).

    Let’s discuss multiple linear regression in detail −

    What is Multiple Linear Regression?

    In machine learning, multiple linear regression (MLR) is a statistical technique that is used to predict the outcome of a dependent variable based on the values of multiple independent variables. The multiple linear regression algorithm is trained on data to learn a relationship (known as a regression line) that best fits the data. This relation describes how various factors affect the result. This relation is used to forecast the value of dependent variable based on the values of independent variables.

    In linear regression (simple and multiple), the dependent variable is continuous (numeric value) and independent variables can be continuous or discreet (numeric value). Independent variables can also be categorical (gender, occupation), but they need to be converted to numerical values first.

    Multiple linear regression is basically the extension of simple linear regression that predicts a response using two or more features. Mathematically we can represent the multiple linear regression as follows −

    Consider a dataset having n observations, p features i.e. independent variables and y as one response i.e. dependent variable the regression line for p features can be calculated as follows −

    h(xi)=w0+w1xi1+w2xi2+⋅⋅⋅+wpxip

    Here, h(xi) is the predicted response value and w0,w1,w2….wp are the regression coefficients.

    Multiple Linear Regression models always includes the errors in the data known as residual error which changes the calculation as follows −

    yi=w0+w1xi1+w2xi2+⋅⋅⋅+wpxip+ei

    We can also write the above equation as follows −

    yi=h(xi)+eiorei=yi−h(xi)

    Assumptions of Multiple Linear Regression

    The following are some assumptions about the dataset that are made by the multiple linear regression model −

    1. Linearity

    The relationship between the dependent variable (target) and independent (predictor) variables is linear.

    2. Independence

    Each observation is independent of others. The value of the dependent variable for one observation is independent of the value of another.

    3. Homoscedasticity

    For all observations, the variance of the residual errors is similar across the value of each independent variable.

    4. Normality of Errors

    The residuals (errors) are normally distributed. The residuals are differences between the actual and predicted values.

    5. No Multicollinearity

    The independent variables are not highly correlated with each other. Linear regression models assume that there is very little or no multi-collinearity in the data.

    6. No Autocorrelation

    There is no correlation between residuals. This ensures that the residuals (errors) are independent of each other.

    7. Fixed Independent Variables

    The values of independent variables are fixed in all repeated samples.

    Violations of these assumptions can lead to biased or inefficient estimates. It is essential to validate these assumptions to ensure model accuracy.

    Implementing Multiple Linear Regression in Python

    To implement multiple linear regression in Python using Scikit-Learn, we can use the same LinearRegression class as in simple linear regression, but this time we need to provide multiple independent variables as input.

    Step 1: Data Preparation

    We use the dataset named data.csv with 50 examples. It contains four predictor (independent) variables and a target (dependent) variable. The following table represents the data in data.csv file.

    data.csv

    R&D SpendAdministrationMarketing SpendStateProfit
    165349.2136897.8471784.1New York192261.8
    162597.7151377.6443898.5California191792.1
    153441.5101145.6407934.5Florida191050.4
    144372.4118671.9383199.6New York182902
    142107.391391.77366168.4Florida166187.9
    131876.999814.71362861.4New York156991.1
    134615.5147198.9127716.8California156122.5
    130298.1145530.1323876.7Florida155752.6
    120542.5148719311613.3New York152211.8
    123334.9108679.2304981.6California149760
    101913.1110594.1229161Florida146122
    10067291790.61249744.6California144259.4
    93863.75127320.4249839.4Florida141585.5
    91992.39135495.1252664.9California134307.4
    119943.2156547.4256512.9Florida132602.7
    114523.6122616.8261776.2New York129917
    78013.11121597.6264346.1California126992.9
    94657.16145077.6282574.3New York125370.4
    91749.16114175.8294919.6Florida124266.9
    86419.7153514.10New York122776.9
    76253.86113867.3298664.5California118474
    78389.47153773.4299737.3New York111313
    73994.56122782.8303319.3Florida110352.3
    67532.53105751304768.7Florida108734
    77044.0199281.34140574.8New York108552
    64664.71139553.2137962.6California107404.3
    75328.87144136134050.1Florida105733.5
    72107.6127864.6353183.8New York105008.3
    66051.52182645.6118148.2Florida103282.4
    65605.48153032.1107138.4New York101004.6
    61994.48115641.391131.24Florida99937.59
    61136.38152701.988218.23New York97483.56
    63408.86129219.646085.25California97427.84
    55493.95103057.5214634.8Florida96778.92
    46426.07157693.9210797.7California96712.8
    46014.0285047.44205517.6New York96479.51
    28663.76127056.2201126.8Florida90708.19
    44069.9551283.14197029.4California89949.14
    20229.5965947.93185265.1New York81229.06
    38558.5182982.09174999.3California81005.76
    28754.33118546.1172795.7California78239.91
    27892.9284710.77164470.7Florida77798.83
    23640.9396189.63148001.1California71498.49
    15505.73127382.335534.17New York69758.98
    22177.74154806.128334.72California65200.33
    1000.231241531903.93New York64926.08
    1315.46115816.2297114.5Florida49490.75
    0135426.90California42559.73
    542.0551743.150New York35673.41
    0116983.845173.06California14681.4

    You can create a CSV file and store the above data points in it.

    We have our dataset as data.csv file. We will use it to understand the implementation of the multiple linear regression in Python.

    We need to import libraries before loading the dataset.

    # import librariesimport numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    

    Load the dataset

    We load our dataset as a Pandas Data frame named <string>dataset. Now let’s create a list of independent values (predictors) and put them in a variable called X.</string>

    The independent values are ‘R&D Spend’, ‘Administration’, ‘Marketing Spend’. We are not using the independent variable ‘State’ for sake of simplicity.

    We put the dependent variable values to a variable y.

    # load dataset
    dataset = pd.read_csv('data.csv')
    X = dataset[['R&D Spend','Administration','Marketing Spend']]
    y = dataset['Profit']

    Let’s check first five examples (rows) of input features (X) and target (y) −

    X.head()

    Output

    	R&D Spend	Administration	Marketing Spend
    0	165349.20	136897.80	471784.10
    1	162597.70	151377.59	443898.53
    2	153441.51	101145.55	407934.54
    3	144372.41	118671.85	383199.62
    4	142107.34	91391.77	366168.42
    
    y.head()

    Output

    	Profit
    0	192261.83
    1	191792.06
    2	191050.39
    3	182901.99
    4	166187.94
    

    Split the dataset into training and test sets

    Now, we split the dataset into a training set and a test set. Both the X(independent values) and y (dependent values) are divided into two sets – training and test. We will use 20% for the test set. In such a way out of 50 feature vectors (observations/ examples), there will be 40 feature vectors in training set and 10 feature vectors in test set.

    # Split the dataset into training and test sets from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

    Here X_train and X_test represent input features in training set and test set, where y_train and y_test represent target values (output) in traning and test set.

    Step 2: Model Training

    The next step is to fit our model with training data. We will use linear_model class from sklearn module. We use the Linear Regression() method of linear_model class to create a linear regression object, here we name it as regressor.

    # Fit Multiple Linear Regression to the Training setfrom sklearn.linear_model import LinearRegression
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)

    The regressor object has fit() method. The fit() method is used to fit the linear regression object, regressor to the training data. The model learns the relation between the predictor variable (X_train), and the target variable (y_train).

    Step 3: Model Testing

    Now our model is ready to use for prediction. Let’s test our regressor model on test data.

    We use predict() method to predict the results for the test set. It takes input features (X_test) and return the redicted values.

    y_pred = regressor.predict(X_test)
    df = pd.DataFrame({'Real Values':y_test,'Predicted Values':y_pred})print(df)

    Output

    	Real Values	Predicted Values
    23	108733.99	110159.827849
    43	69758.98	59787.885207
    26	105733.54	110545.686823
    34	96712.80	88204.710014
    24	108552.04	114094.816702
    39	81005.76	84152.640761
    44	65200.33	63862.256006
    18	124266.90	129379.514419
    47	42559.73	45832.902722
    17	125370.37	130086.829016
    

    You can compare the actual values and predicted values.

    Step 4: Model Evaluation

    We now evaluate our model to check how accurate it is. We will use mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and R2-score (Coefficient of determination).

    from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score
    # Assuming you have your true y values (y_test) and predicted y values (y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = root_mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)print("Mean Squared Error (MSE):", mse)print("Root Mean Squared Error (RMSE):", rmse)print("Mean Absolute Error (MAE):", mae)print("R-squared (R2):", r2)

    Output

    Mean Squared Error (MSE): 72684687.6336162
    Root Mean Squared Error (RMSE): 8525.531516193943
    Mean Absolute Error (MAE): 6425.118502810154
    R-squared (R2): 0.9588459519573707
    

    You can examine the above metrics. Our model shows an R-squared score of around 0.96, which means that 96% of data points are scattered around the fitted regression line. Another interpretation is that 96% of the variation in the output variables is explained by the input variables.

    Step 5: Model Prediction for New Data

    Let’s use our regressor model to predict profit values based on R&D Spend, Administration and Marketing Spend.

    [‘R&D Spend’,’Administration’,’Marketing Spend’]=[166343.2, 136787.8, 461724.1]

    // predict profit when R&D Spend is166343.2, Administration is136787.8and Marketing Spend is461724.1
    new_data =[[166343.2,136787.8,461724.1]] 
    profit = regressor.predict(new_data)print(profit)

    Output

    [193053.61874652]
    

    The model predicts the profit value is approximately 192090.567 for the above three values.

    Model Parameters (Coefficients and Intercept)

    The model parameters (intercept and coefficients) describe the relation between a dependent variable and the independent variables.

    Our regression model for the above use case,

    Y=w0+w1X1+w2X2+w2X2

    w0 is intercept and w1,w2,w3 are coefficients of X1,X2,X3 respectively.

    Here,

    • X1 represents R&D Spend,
    • X2 represents Administration, and
    • X3 represents Marketing Spend.

    Let’s first compute the intercept and coefficients.

    print("coefficients: ", regressor.coef_)print("intercept: ", regressor.intercept_)

    Output

    coefficients: [ 0.81129358 -0.06184074  0.02515044]
    intercept: 54946.94052163202
    

    The above output shows the following –

    • w0 = 54946.94052163202
    • w1 = 0.81129358
    • w2 = -0.06184074
    • w3 = 0.02515044

    Result Explanation

    We have calculated intercept (w0) and coefficients (w1, w2, w3).

    The coefficients are as follows –

    • R&D Spend: 0.81129358
    • Administration: -0.06184074
    • Marketing Spend: 0.02515044

    This shows that if R&D Spend is increased by 1 USD, the Profit will increase by 0.81851334 USD.

    The result shows that when Administration spend is increased by 1 USD, the Profit will decrease by 0.03124763 USD.

    And when Marketing Spend increases by 1 USD, the Profit increases by 0.02042286 USD.

    Let’s verify the result,

    In step 5, we have predicted Profit for new data as 193053.61874652

    Here,

    new_data =[[166343.2,136787.8,461724.1]] 
    Profit =54946.94052163202+0.81129358*166343.2-0.06184074*136787.8+0.02515044*461724.1
    Profit =193053.616257

    Which is approximately the same as model prediction. Why approximately? Because of residual error.

    residual error = 193053.61874652 - 193053.616257
    residual error = 0.00248952
    

    Applications of Multiple Linear Regression

    The following are some commonly used applications of multiple linear regression −

    ApplicationDescription
    FinancePredicting stock prices, forecasting exchange rates, assessing credit risk.
    MarketingPredicting sales, customer churn, and marketing campaign effectiveness.
    Real EstatePredicting house prices based on factors like size, location, and number of bedrooms.
    HealthcarePredicting patient outcomes, analyzing the impact of treatments, and identifying risk factors for diseases.
    EconomicsForecasting economic growth, analyzing the impact of policies, and predicting inflation rates.
    Social SciencesModeling social phenomena, predicting election outcomes, and understanding human behavior.

    Challenges of Multiple Linear Regression

    The following are some common challenges faced by multiple linear regression in machine learning −

    ChallengeDescription
    MulticollinearityHigh correlation between independent variables, leading to unstable model coefficients and difficulty in interpreting the impact of individual variables.
    OverfittingThe model fits the training data too closely, leading to poor performance on new, unseen data.
    UnderfittingThe model fails to capture the underlying patterns in the data, resulting in poor performance on both training and test data.
    Non-linearityMultiple linear regression assumes a linear relationship between the independent and dependent variables. Non-linear relationships can lead to inaccurate predictions.
    OutliersOutliers can significantly impact the model’s performance, especially in small datasets.
    Missing DataMissing data can lead to biased and inaccurate results.

    Difference Between Simple and Multiple Linear Regression

    The following table highlights the major differences between simple and multiple linear regression −

    FeatureSimple Linear RegressionMultiple Linear Regression
    Independent VariablesOneTwo or more
    Model Equationy = w1x + w0y=w0+w1x1+w2x2+ … +wpxp
    ComplexityLess complexMore complex due to multiple variables
    Real-world ApplicationsPredicting house prices based on square footage, predicting sales based on advertising expenditurePredicting sales based on advertising expenditure, price, and competitor activity, predicting student performance based on study hours, attendance, and IQ
    Model InterpretationEasier to interpret coefficientsMore complex to interpret due to multiple variables

  • Simple Linear Regression in Machine Learning

    What is Simple Linear Regression?

    Simple linear regression is a statistical and supervised learning method in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

    Simple linear regression in machine learning is a type of linear regression. When the linear regression algorithm deals with a single independent variable, it is known as simple linear regression. When there is more than one independent variable (feature variables), it is known as multiple linear regression.

    Independent Variable

    The feature inputs in the dataset are termed as the independent variables. There is only a single independent variable in simple linear regression. An independent variable is also known as a predictor variable as it is used to predict the target value. It is plotted on a horizontal axis.

    Dependent Variable

    The target value in the dataset is termed as the dependent variable. It is also known as a response variable or predicted variable. It is plotted on a vertical axis.

    Line of Regression

    In simple linear regression, a line of regression is a straight line that best fits the data points and is used to show the relationship between a dependent variable and an independent variable.

    Graphical Representation

    The following graph depicts the simple linear regression model −

    ML Simple Linear Regression

    In the above image, the straight line represents the simple linear regression line where &Ycirc; is the predicted value, and Y is dependent variable (target) and X is independent variable (input).

    Simple Linear Regression Model

    A simple linear regression model in machine learning can be represented as the following mathematical equation −

    Y=w0+w1X+ϵ

    Where

    • Y is the dependent variable (target).
    • X is the independent variable (feature).
    • w0 is the y-intercept of the line.
    • w1 is the slope of the line, representing the effect of X on Y.
    • ε is the error term, capturing the variability in Y not explained by X.

    How Simple Linear Regression Works?

    The main of simple linear regression is to find the best fit line (a straight line) through the data points that minimizes the difference between the actual values and predicted values.

    Defining Hypothesis Function

    In simple linear regression, the hypothesis is that there is a linear relation between the dependent variable (output/ target) and the independent variable (input). This linear relation can be represented using a linear equation −

    Ŷ =w0+w1X

    With different values of parameters w0 and w1 there are multiple linear equations (straight lines). The set of all such linear equations (all straight lines) is termed hypothesis space.

    Now, the main aim of the simple linear regression model is to find the best-fit line in Hypothesis space (set of all straight lines).

    Finding the Best Fit Line

    Now the task is to find the best fit line (line of regression). To do this, we define a cost function or loss function that measure the the difference between the actual values and predicted values.

    To find the best fit line, the simple linear regression model initializes (with default values) the parameters of the regression line. This regression line (with initialized parameters) is used to find the predicted values for the given input values.

    Loss Function for Simple Linear Regression

    Now using the input and predicted values, we compute the loss function. The loss function is used to find the optimal values of the parameters.

    The loss function finds the difference between the input value and predicted value. There are different loss functions such as mean squared error (MSE), mean absolute error (MEA), R-squared, etc. used in simple linear regression. The most commonly used loss function is mean squared error.

    The loss function for simple linear regression in terms of mean squared error is as follows −

    J(w0,w1)=12n∑i=1n(Yi−Ŷ i)2

    Optimization

    The optimal values of parameters are those values that minimize the cost function. Finding the optimal values is an iterative process in which the parameters are updated iteratively.

    There are many optimization techniques applied in simple linear regression. Gradient Descent is a simple and most common optimization technique used in simple linear regression.

    A linear equation with optimal parameter values is the best fit line(regression line) and it is the final solution for a simple linear regression problem. This line is used to predict new and unseen data.

    Assumptions of Simple Linear Regression

    There are some assumptions about the dataset that are made by the simple linear regression model. The following are some assumptions −

    • Linearity − This assumption assumes that the relationship between the dependent and independent variables is linear. That means the dependent variable changes linearly as the independent variable changes. A scatter plot will show the linearity in the dataset.
    • Homoskedasticity − For all observations, the variance of the residuals is the same. This assumption relates to the squared residuals.
    • Independence − The examples (observations or X and Y pairs) are independent. There is no collinearity in data so the residuals will not be correlated. To check this, we example the scatter plot of residuals vs. fits.
    • Normality − Model Residuals are normally distributed. Residuals are the differences between the actual and predicted values. To check for the normality, we examine the histogram of residuals. The histogram should be approximately normally distributed.

    Implementation of Simple Linear Regression Algorithm using Python

    To implement the simple linear regression algorithm, we are taking a dataset with two variables: YearsExperience (independent variable) and Salary (dependent variable).

    Here, we are using the following dataset. The dataset contains 30 examples of data points. You can create a CSV file and store these data points in it.

    Salary_Data.csv

    Years of ExperienceSalary
    1.139343
    1.346205
    1.537731
    243525
    2.239891
    2.956642
    360150
    3.254445
    3.264445
    3.757189
    3.963218
    455794
    456957
    4.157081
    4.561111
    4.967938
    5.166029
    5.383088
    5.981363
    693940
    6.891738
    7.198273
    7.9101302
    8.2113812
    8.7109431
    9105582
    9.5116969
    9.6112635
    10.3122391
    10.5121872

    What is the purpose of this implementation?

    The purpose of building this simple linear regression model is to determine which line best represents the relationship between the two variables.

    The following are the steps to implement the simple linear regression model in Python −

    Step 1: Data Preparation

    Data preparation or pre-processing is the initial step. We have our dataset as a CSV file named “Salary_Data.csv,” as discussed above.

    We need to import python libraries prior to importing the dataset and building the simple linear regression model.

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    

    Load the dataset

    dataset = pd.read_csv('Salary_Data.csv')

    The dependent variable (X) and independent variable (Y) must then be extracted from the provided dataset. Years of experience (YearsExperience) is the independent variable, and Salary is the dependent variable.

    X = dataset.iloc[:,:-1].values
    y = dataset.iloc[:,-1].values
    

    Let’s check the first five examples of the dataset.

    print(dataset.head())

    Output

    0	1.1	39343.0
    1	1.3	46205.0
    2	1.5	37731.0
    3	2.0	43525.0
    4	2.2	39891.0
    

    Lets check if the dataset is linear or not

    plt.scatter(X, y, color="green")
    plt.title("Salary vs Experience")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary (INR)")
    plt.show()

    Output

    Linear Relation Between Dependent and Independent Variables

    The above graph shows that the dependent and independent variables are linearly dependent. So we can apply the simple linear regression on the dataset to find the best relation between these variables.

    Split the dataset into training and testing sets

    The training set and test set will then be divided into two groups. We will use 80% observations for the training set and 20% observations for the test set out of the total 30 observations we have. So there will be 24 observation in training set and 6 observation in test set. We divide our dataset into training and test sets so that we can use one set to train and the other to test our model.

    # Split the dataset into training and testing setsfrom sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

    Here, X_train represents the input feature of the training data and y_train represents the output variable (target variable).

    Step 2: Model Training (Fitting the Simple Linear Regression to Training Set)

    The next step is fitting our model with the training dataset. We will use scikit-learn’s LinearRegression class to train a simple linear regression model on the training data. The code for this is as follows −

    from sklearn.linear_model import LinearRegression
    
    # Create a linear regression object
    regressor= LinearRegression()
    regressor.fit(X_train, y_train)

    The fit() method is used to fit the linear regression object (regressor) to the training data. The model learns the relation between the predictor variable (X_train), and the target variable (y_train).

    Step 3: Model Testing

    Once the model is trained, we can use it to make predictions on the test data. The code for this is as follows −

    y_pred = regressor.predict(X_test)
    df = pd.DataFrame({'Actual Values':y_test,'Predicted Values':y_pred})print(df)

    Output

       Actual Values  Predicted Values
    0        60150.0      54093.648425
    1        93940.0      82416.119864
    2        57081.0      64478.554619
    3       116969.0     115459.003211
    4        56957.0      63534.472238
    5       121872.0     124899.827024
    

    The above output shows actual values and predicted values of Salary in the test set.

    Here, X_test represents the input feature of the test data and y_pred represents the predicted output variable (target variable).

    Similarly, you can test the model with training data.

    y_pred = regressor.predict(X_train)
    df = pd.DataFrame({'Real Values':y_test,'Predicted Values':y_pred})print(df)

    Output

        Real Values  Predicted Values
    0       57189.0      60702.225094
    1       64445.0      55981.813188
    2       63218.0      62590.389857
    3      122391.0     123011.662261
    4       91738.0      89968.778915
    5       43525.0      44652.824612
    6       61111.0      68254.884145
    7       56642.0      53149.566044
    8       66029.0      73919.378433
    9       83088.0      75807.543195
    10      46205.0      38044.247943
    11     109431.0     107906.344160
    12      98273.0      92801.026059
    13      37731.0      39932.412705
    14      54445.0      55981.813188
    15      39891.0      46540.989374
    16     101302.0     100353.685109
    17      55794.0      63534.472238
    18      81363.0      81472.037483
    19      39343.0      36156.083180
    20     113812.0     103185.932253
    21      67938.0      72031.213670
    22     112635.0     116403.085592
    23     105582.0     110738.591304
    

    Step 4: Model Evaluation

    We need to evaluate the performance of the model to determine its accuracy. We will use the mean squared error (MSE), root mse (RMSE), mean average error (MAE), and the coefficient of determination (R^2) as evaluation metrics. The code for this is as follows −

    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import r2_score
    
    # get the predicted values for test dat
    y_pred = regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)print("mse", mse)
    rmse = mean_squared_error(y_test, y_pred, squared=False)print("rsme", rmse)
    mae = mean_absolute_error(y_test, y_pred)print("mae", mae)
    r2 = r2_score(y_test, y_pred)print("r2", r2)

    Output

    mse:  46485664.99327367
    rsme:  6818.0396737826095
    mae:  6015.513730219523
    r2:  0.9399326805390613
    

    Here, y_test represents the actual output variable of the test data.

    Step 5: Visualize Training Set Results (with Regression Line)

    Now, let’s visualize the results on the training set and the regression line.

    We use the scatter plot to plot the actual values (input and target values) in the training set. We also plot a straight line (regression line) for actual values (input) and predicted values of the training set.

    y_pred = regressor.predict(X_train)
    plt.scatter(X_train, y_train, color="green", label="training data points (actual)")
    plt.scatter(X_train, y_pred, color="blue",label="training data points (predicted)")
    plt.plot(X_train, y_pred, color="red")
    plt.title("Salary vs Experience (Training Dataset)")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary(In Rupees)")
    plt.legend()
    plt.show()

    Output

    Visualizing training set results

    The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the training set.

    Step 6: Visualize the Test Set Results (with Regression Line)

    Now, let’s visualize the results on the test set and the regression line.

    We use the scatter plot to plot the actual values (input and target values) in the test set. We also plot a straight line (regression line) for actual values (input) and predicted values of the test set.

    y_pred = regressor.predict(X_test)
    plt.scatter(X_test, y_test, color="green", label="test data points (actual)")
    plt.scatter(X_test, y_pred, color="blue",label="test data points (predicted)")
    plt.plot(X_test, y_pred, color="red")
    plt.title("Salary vs Experience (Test Dataset)")
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary(In Rupees)")
    plt.legend()
    plt.show()

    Output

    Visualizing test set results

    The above graph shows the line of regression (straight line in red color), actual values (in green color), and predicted values (in blue color) for the test set.

  • Linear Regression in Machine Learning

    Linear regression in machine learning is defined as a statistical model that analyzes the linear relationship between a dependent variable and a given set of independent variables. The linear relationship between variables means that when the value of one or more independent variables will change (increase or decrease), the value of the dependent variable will also change accordingly (increase or decrease).

    In machine learning, linear regression is used for predicting continuous numeric values based on learned linear relation for new and unseen data. It is used in predictive modeling, financial forecasting, risk assessment, etc.

    In this chapter, we will discuss the following topics in detail −

    • What is Linear Regression?
    • Types of Linear Regression
    • How Does Linear Regression Work?
    • Hypothesis Function For Linear Regression
    • Finding the Best Fit Line
    • Loss Function For Linear Regression
    • Gradient Descent for Optimization
    • Assumptions of Linear Regression
    • Evaluation Metrics for Linear Regression
    • Applications of Linear Regression
    • Advantages of Linear Regression
    • Common Challenges with Linear Regression

    What is Linear Regression?

    Linear regression is a statistical technique that estimates the linear relationship between a dependent and one or more independent variables. In machine learning, linear regression is implemented as a supervised learning approach. In machine learning, labeled datasets contain input data (features) and output labels (target values). For linear regression in machine learning, we represent features as independent variables and target values as the dependent variable.

    For the simplicity, take the following data (Single feature and single target)

    Square Feet (X)House Price (Y)
    1300240
    1500320
    1700330
    1830295
    1550256
    2350409
    1450319

    In the above data, the target House Price is the dependent variable represented by X, and the feature, Square Feet, is the independent variable represented by Y. The input features (X) are used to predict the target label (Y). So, the independent variables are also known as predictor variables, and the dependent variable is known as the response variable.

    So lets define linear regression in machine learning as follows:

    In machine learning, linear regression uses a linear equation to model the relationship between a dependent variable (Y) and one or more independent variables (Y).

    The main goal of the linear regression model is to find the best-fitting straight line (often called a regression line) through a set of data points.

    Line of Regression

    A straight line that shows a relation between the dependent variable and independent variables is known as the line of regression or regression line.

    ML Regression Line

    Furthermore, the linear relationship can be positive or negative in nature as explained below −

    1. Positive Linear Relationship

    A linear relationship will be called positive if both independent and dependent variable increases. It can be understood with the help of the following graph −

    Positive Linear Relationship

    2. Negative Linear Relationship

    A linear relationship will be called positive if the independent increases and the dependent variable decreases. It can be understood with the help of the following graph −

    Negative Linear Relationship

    Linear regression is of two types, “simple linear regression” and “multiple linear regression”, which we are going to discuss in the next two chapters of this tutorial.

    Types of Linear Regression

    Linear regression is of the following two types −

    • Simple Linear Regression
    • Multiple Linear Regression

    1. Simple Linear Regression

    Simple linear regression is a type of regression analysis in which a single independent variable (also known as a predictor variable) is used to predict the dependent variable. In other words, it models the linear relationship between the dependent variable and a single independent variable.

    ML Simple Linear Regression

    In the above image, the straight line represents the simple linear regression line where &Ycirc; is the predicted value, and X is the input value.

    Mathematically, the relationship can be modeled as a linear equation −

    Y=w0+w1X+ϵ

    Where

    • Y is the dependent variable (target).
    • X is the independent variable (feature).
    • w0 is the y-intercept of the line.
    • w1 is the slope of the line, representing the effect of X on Y.
    • ε is the error term, capturing the variability in Y not explained by X.

    2. Multiple Linear Regression

    Multiple linear regression is basically the extension of simple linear regression that predicts a response using two or more features.

    When dealing with more than one independent variable, we extend simple linear regression to multiple linear regression. The model is expressed as:

    Multiple linear regression extends the concept of simple linear regression to multiple independent variables. The model is expressed as:

    Y=w0+w1X1+w2X2+⋯+wpXp+ϵ

    Where

    • X1, X2, …, Xp are the independent variables (features).
    • w0, w1, …, wp are the coefficients for these variables.
    • ε is the error term.

    How Does Linear Regression Work?

    The main goal of linear regression is to find the best-fit line through a set of data points that minimizes the difference between the actual values and predicted values. So it is done? This is done by estimating the parameters w0, w1 etc.

    The working of linear regression in machine learning can be broken down into many steps as follows −

    • Hypothesis − We assume that there is a linear relation between input and output.
    • Cost Function − Define a loss or cost function. The cost function quantifies the model’s prediction error. The cost function takes the model’s predicted values and actual values and returns a single scaler value that represents the cost of the model’s prediction.
    • Optimization − Optimize (minimize) the model’s cost function by updating the model’s parameters.

    It continues updating the model’s parameters until the cost or error of the model’s prediction is optimized (minimized).

    Let’s discuss the above three steps in more detail −

    Hypothesis Function For Linear Regression

    In linear regression problems, we assume that there is a linear relationship between input features (X) and predicted value (&Ycirc;).

    The hypothesis function returns the predicted value for a given input value. Generally we represent a hypothesis by hw(X) and it is equal to &Ycirc;.

    Hypothesis function for simple linear regression −

    Ŷ =w0+w1X

    Hypothesis function for multiple linear regression −

    Ŷ =w0+w1X1+w2X2+⋯+wpXp

    For different values of parameters (weights), we can find many regression lines. The main goal is to find the best-fit lines. Let’s discuss it as below −

    Finding the Best Fit Line

    We discussed above that different set of parameters will provide different regression lines. However, each regression line does not represent the optimal relation between the input and output values. The main goal is to find the best-fit line.

    A regression line is said to be the best fit if the error between actual and predicted values is minimal.

    Below image shows a regression line with error (ε) at input data point X. The error is calculated for all data points and our goal is to minimize the average error/ loss. We can use different types of loss functions such as mean square error (MSE), mean average error (MAE), L1 loss, L2 Loss, etc.

    ML Best Fit Line Representation

    So, how can we minimize the error between the actual and predicted values? Let’s discuss the important concept, which is cost function or loss function.

    Loss Function for Linear Regression

    The error between actual and predicted values can be quantified using a loss function of the cost function. The cost function takes the model’s predicted values and actual values and returns a single scaler value that represents the cost of the model’s prediction. Our main goal is to minimize the cost function.

    The most commonly used cost function is the mean squared error function.

    J(w0,w1)=12n∑i=1n(Yi−Ŷ i)2

    Where,

    • n is the number of data points.
    • Yi is the observed value for the i-th data point.
    • Ŷ i=w0+w1Xi is the predicted value for the i-th data point.

    Gradient Descent for Optimization

    Now we have defined our loss function. The next step is to minimize it and find the optimized values of the parameters or weights. This process of finding optimal values of parameters such that the loss or error is minimal is known as model optimization.

    Gradient Descent is one of the most used optimization techniques for linear regression.

    To find the optimal values of parameters, gradient descent is often used, especially in cases with large datasets. Gradient descent iteratively adjusts the parameters in the direction of the steepest descent of the cost function.

    The parameter updates are given by

    w0=w0−α∂J∂w0

    w1=w1−α∂J∂w1

    Where α is the learning rate, and the partial derivatives are:

    ∂J∂w0=−1n∑i=1n(Yi−Ŷ i)

    ∂J∂w1=−1n∑i=1n(Yi−Ŷ i)Xi

    These gradients are used to update the parameters until convergence is reached (i.e., when the changes in w0 and w1 become negligible).

    Assumptions of Linear Regression

    The following are some assumptions about the dataset that are made by the Linear Regression model −

    Multi-collinearity − Linear regression model assumes that there is very little or no multi-collinearity in the data. Basically, multi-collinearity occurs when the independent variables or features have a dependency on them.

    Auto-correlation − Another assumption the Linear regression model assumes is that there is very little or no auto-correlation in the data. Basically, auto-correlation occurs when there is dependency between residual errors.

    Relationship between variables − Linear regression model assumes that the relationship between response and feature variables must be linear.

    Violations of these assumptions can lead to biased or inefficient estimates. It is essential to validate these assumptions to ensure model accuracy.

    Evaluation Metrics for Linear Regression

    To assess the performance of a linear regression model, several evaluation metrics are used −

    R-squared (R2) − It measures the proportion of the variance in the dependent variable that is predictable from the independent variables.

    R2=1−∑(yi−ŷ i)2∑(yi−y¯)2

    Mean Squared Error (MSE) − It measures an average of the sum of the squared difference between the predicted values and the actual values.

    MSE=1n∑i=1n(yi−ŷ i)2

    Root Mean Squared Error (RMSE) − It measures the square root of the MSE.

    RMSE=MSE‾‾‾‾‾√

    Mean Absolute Error (MAE) − It measures the average of the sum of the absolute values of the difference between the predicted values and the actual values.

    MAE=1n∑i=1n|yi−ŷ i|

    Applications of Linear Regression

    1. Predictive Modeling

    Linear regression is widely used for predictive modeling. For instance, in real estate, predicting house prices based on features such as size, location, and number of bedrooms can help buyers, sellers, and real estate agents make informed decisions.

    2. Feature Selection

    In multiple linear regression, analyzing the coefficients can help in feature selection. Features with small or zero coefficients might be considered less important and can be dropped to simplify the model.

    3. Financial Forecasting

    In finance, linear regression models predict stock prices, economic indicators, and market trends. Accurate forecasts can guide investment strategies and financial planning.

    4. Risk Management

    Linear regression helps in risk assessment by modeling the relationship between risk factors and financial metrics. For example, in insurance, it can model the relationship between policyholder characteristics and claim amounts.

    Advantages of Linear Regression

    • Interpretability − Linear regression is easy to understand, which is useful when explaining how a model makes decisions.
    • Speed − Linear regression is faster to train than many other machine learning algorithms.
    • Predictive analytics − Linear regression is a fundamental building block for predictive analytics.
    • Linear relationships − Linear regression is a powerful statistical method for finding linear relationships between variables.
    • Simplicity − Linear regression is simple to implement and interpret.
    • Efficiency − Linear regression is efficient to compute.

    Common Challenges with Linear Regression

    1. Overfitting

    Overfitting occurs when the regression model performs well on training data but lacks generalization on test data. Overfitting leads to poor prediction on new, unseen data.

    2. Multicollinearity

    When the dependent variables (predictor or feature variables) correlate, the situation is known as mutilcolinearty. In this, the estimates of the parameters (coefficients) can be unstable.

    3. Outliers and Their Impact

    Outliers can cause the regression line to be a poor fit for the majority of data points.

    Polynomial Regression: An Alternate to Linear Regression

    Polynomial Linear Regression is a type of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an n-th degree polynomial function. Polynomial regression allows for a more complex relationship between the variables to be captured beyond the linear relationship in Simple and Multiple Linear Regression.

  • Regression Analysis in Machine Learning

    What is Regression Analysis?

    In machine learning, regression analysis is a statistical technique that predicts continuous numeric values based on the relationship between independent and dependent variables. The main goal of regression analysis is to plot a line or curve that best fit the data and to estimate how one variable affects another.

    Regression analysis is a fundamental concept in machine learning and it is used in many applications such as forecasting, predictive analytics, etc.

    In machine learning, regression is a type of supervised learning. The key objective of regression-based tasks is to predict output labels or responses, which are continuous numeric values, for the given input data. The output will be based on what the model has learned in the training phase.

    Regression models use the input data features (independent variables) and their corresponding continuous numeric output values (dependent or outcome variables) to learn specific associations between inputs and corresponding outputs.

    Terminologies Used In Regression Analysis

    Let us understand some basic terminologies used in regression analysis before going into further detail. The following are some important terminologies −

    • Independent Variables − These variables are used to predict the value of the dependent variable. These are also called predictors. In dataset, these are represented as features.
    • Dependent Variables − These are the variables whose values we want to predict. These are the main factors in regression analysis. In dataset, these are represented as target variables
    • Regression line − It is a straight line or curve that a regressor plots to fit the data points best.
    • Overfitting and underfitting − Overfitting is when the regression model works well with the training dataset but not with the testing dataset. It’s also referred to as the problem of high variance. Underfitting is when the model doesn’t work well with training datasets. It’s also referred to as the problem of high bias.
    • Outliers − These are data points that don’t fit the pattern of the rest of the data. They are the extremely high or extremely low values in the data set.
    • Multicollinearity − multicollinearity occurs when independent variables (features) have dependency among them.

    How Does Regression Work?

    Regression in machine learning is a supervised learning. Basically, regression is a statistical technique that finds a relationship between dependent and independent variables. To implement regression in machine learning, a regression algorithm is trained with a labeled dataset. The dataset contains features (independent variables) and target values (dependent variable).

    During the training phase, the regression algorithm learns the relation between independent variables (predictors) and dependent variables (target).

    The regression models predict new values based on the learned relation between predictors and targets during the training.

    Types of Regression in Machine Learning

    Generally, the classification of regression methods is done based on the three metrics − the number of independent variables, type of dependent variables, and shape of the regression line.

    There are numerous regression techniques used in machine learning. However, the following are commonly used types of regression −

    • Linear Regression
    • Logistic Regression
    • Polynomial Regression
    • Lasso Regression
    • Ridge Regression
    • Decision Tree Regression
    • Random Forest Regression
    • Support Vector Regression

    Let’s discuss each type of regression in machine learning in detail.

    1. Linear Regression

    Linear regression is the most commonly used regression model in machine learning. It may be defined as the statistical model that analyzes the linear relationship between a dependent variable with a given set of independent variables. A linear relationship between variables means that when the value of one or more independent variables changes (increase or decrease), the value of the dependent variable will also change accordingly (increase or decrease).

    Linear regression is further divided into two subcategories: simple linear regression and multiple linear regression (also known as multivariate linear regression).

    In simple linear regression, a single independent variable (or predictor) is used to predict the dependent variable.

    Mathematically, the simple linear regression can be represented as follows −

    Y=mX+b

    Where,

    • Y is the dependent variable we are trying to predict.
    • X is the dependent variable we are using to make predictions.
    • m is the slope of the regression line, which represents the effect X has on Y.
    • b is a constant known as the Y-intercept. If X=0, Y would be equal to b.

    In multi-linear regression, multiple independent variables are used to predict the dependent variables.

    We will learn linear regression in more detail in upcoming chapters.

    2. Logistic Regression

    Logistic regression is a popular machine learning algorithm used for predicting the probability of an event occurring.

    Logistic regression is a generalized linear model where the target variable follows a Bernoulli distribution. Logistic regression uses a logistic function or logit function to learn a relationship between the independent variables (predictors) and dependent variables (target).

    It maps the dependent variable as a sigmoid function of independent variables. The sigmoid function produces a probability between 0 and 1. The probability value is used to estimate the dependent variable’s value.

    It is mostly used in binary classification problems, where the target variable is categorical with two classes. It models the probability of the target variable given the input features and predicts the class with the highest probability.

    3. Polynomial Regression

    Polynomial Linear Regression is a type of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an n-th degree polynomial function. Polynomial regression allows for a more complex relationship between the variables to be captured, beyond the linear relationship in Simple and Multiple Linear Regression.

    Polynomial regression is one of the most widely used non-linear regressions. It is very useful because it can model non-linear relationships between predictors and targets, and also it is more sensitive to outliers.

    4. Lasso Regression

    Lasso regression is a regularization technique that uses a penalty to prevent overfitting and improve the accuracy of regression models. It performs L1 regularization. It modifies the loss function by adding the penalty (shrinkage quantity) equivalent to the summation of the absolute value of coefficients.

    Lasso regression is often used to handle high dimensional and high correlation data.

    5. Ridge Regression

    Ridge regression is a statistical technique used in machine learning to prevent overfitting in linear regression models. It is used as a regularization technique that performs L2 regularization. It modifies the loss or cost function by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients.

    Ridge regression helps to reduce model complexity and improve prediction accuracy. It is useful in developing many parameters with high weights. It is also well suited to datasets with more feature variables than a number of observations.

    It also corrects the multicollinearity in regression analysis. Multicollinearity occurs when independent variables are dependent on each other.

    6. Decision Tree Regression

    Decision tree regression uses the decision tree algorithm to predict numerical values. The decision tree algorithm is a supervised machine learning algorithm that can be used for both classification and regression.

    It is used to predict numerical values or continuous variables. It works by splitting the data into smaller subsets based on the values of the input features and assigning each subset a numerical value. So incrementally, it develops a decision tree

    The tree fits local linear regressions that approximate a curve, and each leaf represents a numeric value. The algorithm tries to reduce the mean square error at each child node, which measures how much the predictions deviate from the original target.

    The decision tree regression can be used in predicting stock prices or customer behavior etc.

    7. Random Forest Regression

    Random forest regression is a supervised machine learning algorithm that uses an ensemble of decision trees to predict continuous target variables. It uses a bagging technique that involves randomly selecting subsets of training data to build smaller decision trees. These smaller models are combined to form a random forest model that outputs a single prediction value.

    The technique helps improve accuracy and reduce variance by combining the predictions from multiple decision trees.

    8. Support Vector Regression

    Support vector regression (SVR) is a machine learning algorithm that uses support vector machine to solve regression problems. It can learn non-linear relationships between the input data (feature variables) and output data (target values).

    Support vector regression has many advantages. It can handle linear as well as non-linear relationships in datasets. It is resistant to outliers. It has high prediction accuracy.

    Types of Regression Models

    Regression models are of following two types −

    Simple regression model − This is the most basic regression model in which predictions are formed from a single, univariate feature of the data.

    Multiple regression model − As the name implies, in this regression model, the predictions are formed from multiple features of the data.

    Types of Regression Models

    How to Select Best Regression Model?

    You can consider factors like performance metrics, model complexity, interpretability, etc., to select the best regression model. Evaluate the model performance using metrics such as Mean Squared Error (MSE), Mean absolute error (MAE), R-squared, etc. Compare the performance of different models, such as linear regression, decision trees, random forests, etc., and choose a model that has the highest performance metrics, the lowest complexity, and the best interpretability.

    Evaluation Metrics for Regression

    Common evaluation/ performance metrics for regression models −

    • Mean Absolute error (MAE) − It is the average of the absolute difference between predicted values and true values.
    • Mean Squared error (MSE) − It is the average of the square of the difference between actual and estimated values.
    • Median Absolute error − It is the median value of the absolute difference between predicted values and true values.
    • Root mean square error (RMSE) − It is the square root value of the mean squared error (MSE).
    • R2 (coefficient of determination) Score − the best possible score is 1.0, and it can be negative (because the model can be arbitrarily worse).
    • Mean absolute percentage error(MAPE) − It is the percentage equivalent of mean absolute error (MAE).

    Applications of Regression in Machine Learning

    The applications of ML regression algorithms are as follows −

    Forecasting or Predictive analysis − One of the important uses of regression is forecasting or predictive analysis. For example, we can forecast GDP, oil prices, or, in simple words, the quantitative data that changes with the passage of time.

    Optimization − We can optimize business processes with the help of regression. For example, a store manager can create a statistical model to understand the peak time of coming customers.

    Error correction − In business, making correct decisions is equally important as optimizing the business process. Regression can help us to make correct decision as well as correct the already implemented decision.

    Economics − It is the most used tool in economics. We can use regression to predict supply, demand, consumption, inventory investment, etc.

    Finance − A financial company is always interested in minimizing the risk portfolio and wants to know the factors that affect the customers. All these can be predicted with the help of a regression model.

    Building a Regressor in Python

    Regressor model can be constructed from scratch in Python. Scikit-learn, a Python library for machine learning, can also be used to build a regressor in Python.

    In the following example, we will be building a basic regression model that will fit a line to the data, i.e., linear regressor. The necessary steps for building a regressor in Python are as follows −

    Step 1: Importing necessary python package

    For building a regressor using scikit-learn, we need to import it along with other necessary packages. We can import the by using following script −

    import numpy as np
    from sklearn import linear_model
    import sklearn.metrics as sm
    import matplotlib.pyplot as plt
    

    Step 2: Importing dataset

    After importing necessary package, we need a dataset to build regression prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use our saved input data. We can import it with the help of following script −

    input=r'C:\linear.txt'

    Next, we need to load this data. We are using np.loadtxt function to load it.

    input_data = np.loadtxt(input, delimiter=',')
    X, y = input_data[:,:-1], input_data[:,-1]

    Step 3: Organizing data into training & testing sets

    As we need to test our model on unseen data hence, we will divide our dataset into two parts: a training set and a test set. The following command will perform it −

    training_samples =int(0.6*len(X))
    testing_samples =len(X)- num_training
    X_train, y_train = X[:training_samples], y[:training_samples]
    X_test, y_test = X[training_samples:], y[training_samples:]

    Step 4: Model evaluation & prediction

    After dividing the data into training and testing we need to build the model. We will be using LineaRegression() function of Scikit-learn for this purpose. Following command will create a linear regressor object.

    reg_linear = linear_model.LinearRegression()

    Next, train this model with the training samples as follows −

    reg_linear.fit(X_train, y_train)

    Now, at last we need to do the prediction with the testing data.

    y_test_pred = reg_linear.predict(X_test)

    Step 5: Plot & visualization

    After prediction, we can plot and visualize it with the help of following script −

    plt.scatter(X_test, y_test, color ='red')
    plt.plot(X_test, y_test_pred, color ='black', linewidth =2)
    plt.xticks(())
    plt.yticks(())
    plt.show()

    Output

    Plot Visualization

    In the above output, we can see the regression line between the data points.

    Step 6: Performance computation

    We can also compute the performance of our regression model with the help of various performance metrics as follows.

    print("Regressor model performance:")print("Mean absolute error(MAE) =",round(sm.mean_absolute_error(y_test, y_test_pred),2))print("Mean squared error(MSE) =",round(sm.mean_squared_error(y_test, y_test_pred),2))print("Median absolute error =",round(sm.median_absolute_error(y_test, y_test_pred),2))print("Explain variance score =",round(sm.explained_variance_score(y_test, y_test_pred),2))print("R2 score =",round(sm.r2_score(y_test, y_test_pred),2))

    Output

    Regressor model performance:
    Mean absolute error(MAE) = 1.78
    Mean squared error(MSE) = 3.89
    Median absolute error = 2.01
    Explain variance score = -0.09
    R2 score = -0.09