February 20, 2023

On this tutorial, we’ll find out about linear regression and learn how to implement it in Python. First, we’ll discover a pattern machine studying downside, after which we’ll develop a mannequin to make predictions. (This tutorial assumes some familiarity with Python syntax and knowledge cleansing.)

## The Drawback

The dataset that we’ll be analyzing is the Car Knowledge Set from the UCI Machine Studying Repository. This dataset comprises data on numerous automobile traits, together with car sort and engine sort, amongst many others.

Think about that we’re taking over the function of an information analyst at an auto insurance coverage firm. We’ve been tasked with rating automobiles by way of their “riskiness,” a measure of how doubtless a automobile is to get into an accident and due to this fact require the motive force to make use of their insurance coverage. Riskiness isn’t one thing we find out about a automobile simply by it, so we have to use different qualities that we are able to see and measure.

To unravel our downside, we’ll flip to a machine studying mannequin that may convert our knowledge into helpful predictions. There are a number of machine studying fashions that we are able to use, however we’ll flip our consideration to linear regression.

## The Linear Regression Mannequin

Earlier than we start the evaluation, we’ll look at the linear regression mannequin to grasp the way it might help clear up our downside. A linear regression mannequin with a single characteristic seems to be like the next:

$$

Y = beta_0 + beta_1 X_1 + epsilon

$$

$Y$ represents the end result that we wish to predict. In our instance, it’s automobile riskiness. $X_1$ here’s a “characteristic” or “predictor”, which represents a automobile attribute that we wish to use to foretell the end result. $X$ and $Y$ are issues we observe and gather knowledge on. Under, we present a visualization of the linear regression above:

$beta_1$ represents the “slope”, or how the end result $Y$ modifications when the characteristic $X$ modifications. $beta_0$ represents the “intercept”, which might be the typical worth of the end result when the characteristic is 0. $epsilon$ represents the “error” left over that isn’t defined by the characteristic $X$, visualized by the crimson strains. These values, $beta_0$, $beta_1$, and $epsilon$, are known as **parameters**, and we have to calculate them from the information.

We might additionally add extra predictors into the mannequin by including one other parameter $beta_2$ to be related to the opposite options. For instance, including a second characteristic would end in a mannequin that appears like this:

$$

Y = beta_0 + beta_1 X_1 + beta_2 X_2 + epsilon

$$

We will calculate these parameters by hand, however it will be extra environment friendly to make use of Python to create our linear regression mannequin.

## Checking The Knowledge

Step one in making a machine studying mannequin is to look at the information! We’ll load within the `pandas`

library, in order that we are able to learn within the Cars Knowledge Set, which is saved as a `.csv`

file.

```
import pandas as pd
cars = pd.read_csv("cars.csv")
print(cars.columns)
```

`[1] Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price'], dtype='object')`

For this tutorial, we’ll use the `engine_size`

and `horsepower`

columns for our options within the linear regression mannequin. Our instinct right here is that as engine dimension will increase, the automobile turns into extra highly effective and able to greater speeds. These greater speeds may result in extra accidents, which result in greater “riskiness”.

The column that captures this “riskiness” is the `symboling`

column. The `symboling`

column ranges from -3 to three, the place the upper the worth, the riskier the automobile.

Realistically, the method of selecting options for a linear regression mannequin is completed extra by trial-and-error. We’ve picked engine dimension utilizing an instinct, however it will be higher to attempt to enhance our predictions primarily based on this preliminary mannequin.

## The Answer

We will rapidly create linear regressions utilizing the `scikit-learn`

Python library. Linear regressions are contained within the `LinearRegression`

class, so we’ll import every thing we want beneath:

```
from sklearn.linear_model import LinearRegression
mannequin = LinearRegression()
```

We’ve imported the `LinearRegression`

class and saved an occasion of it within the `mannequin`

variable. The following step is to divide the information right into a **coaching set** and a **check set**. We’ll use the coaching set to estimate the parameters of the linear regression, and we’ll use the check set to verify how properly the mannequin predicts the riskiness of automobiles it hasn’t seen earlier than.

```
import math
# Calculate what number of rows 80% of the information can be
nrows = math.ground(cars.form[0] * 0.8)
# Divide the information utilizing this calculation
coaching = cars.loc[:nrows]
check = cars.loc[nrows:]
```

Within the code above, we’ve devoted 80% of the information to the coaching set and the remaining 20% for the check set. Now that we’ve got a coaching set, we may give the options and consequence to our `mannequin`

object to estimate the parameters of the linear regression. That is also referred to as **mannequin becoming**.

```
X = coaching[["engine_size", "horsepower"]]
Y = coaching["symboling"]
mannequin.match(X, Y)
```

The `match()`

methodology takes within the options and the end result and makes use of them to estimate the mannequin parameters. After these parameters are estimated, we’ve got a usable mannequin!

## Mannequin Efficiency

We will attempt to predict the values of the `symboling`

column within the `check`

set and see the way it performs.

```
import numpy as np
predictions = mannequin.predict(check[["engine_size", "horsepower"]])
mae = np.imply((check["symboling"]- predictions)**2)
```

After operating the `match()`

methodology on the coaching knowledge, we are able to name the `predict()`

methodology on new knowledge containing the identical columns. Utilizing these `predictions`

, we are able to calculate the **imply absolute error** (MAE). The MAE describes how far the mannequin predictions are from the precise `symboling`

values on common.

`print(mae)`

`[1] 1.7894647963388066`

The mannequin has a median check error of about `1.79`

. It is a strong begin, however we’d be capable of enhance the error by together with extra options or utilizing a distinct mannequin.

## Subsequent Steps

On this tutorial, we discovered concerning the linear regression mannequin, and we used it to foretell automobile riskiness primarily based on engine dimension and horsepower. The linear regression is among the mostly used knowledge science instruments as a result of it matches properly with human instinct. We will see how modifications within the predictors produces proportion modifications within the consequence. We examined the information, constructed a mannequin in Python, and used this mannequin to provide predictions. This course of is on the core of the machine studying workflow and is important information for any knowledge scientist.

In case you’d prefer to study extra about linear regression and add it to your machine studying talent set, Dataquest has a full course protecting the subject in our Knowledge Scientist in Python Profession Path.