Basic Regression

Train a neural network to predict a continous value.

In a regression problem, the aim is to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where the aim is to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This tutorial uses the classic Auto MPG dataset and demonstrates how to build models to predict the fuel efficiency of the late-1970s and early 1980s automobiles. To do this, you will provide the models with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight.

This example uses the Keras API. (Visit the Keras tutorials and guides to learn more.)

library(tensorflow)
library(keras)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5     ✔ rsample      1.2.0
✔ dials        1.2.0     ✔ tune         1.1.2
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
✔ parsnip      1.1.1     ✔ yardstick    1.2.0
✔ recipes      1.0.8     
── Conflicts ───────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()        masks purrr::discard()
✖ dplyr::filter()          masks stats::filter()
✖ recipes::fixed()         masks stringr::fixed()
✖ yardstick::get_weights() masks keras::get_weights()
✖ dplyr::lag()             masks stats::lag()
✖ yardstick::spec()        masks readr::spec()
✖ recipes::step()          masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

The Auto MPG dataset

The dataset is available from the UCI Machine Learning Repository.

Get the data

First download and import the dataset:

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
col_names <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model_year", "origin","car_name")

raw_dataset <- read.table(
  url,
  header = T,
  col.names = col_names,
  na.strings = "?"
)
dataset <- raw_dataset %>% select(-car_name)
tail(dataset)
# A tibble: 6 × 8
    mpg cylinders displacement horsepower weight acceleration model_year
  <dbl>     <int>        <dbl>      <dbl>  <dbl>        <dbl>      <int>
1    27         4          151         90   2950         17.3         82
2    27         4          140         86   2790         15.6         82
3    44         4           97         52   2130         24.6         82
4    32         4          135         84   2295         11.6         82
5    28         4          120         79   2625         18.6         82
6    31         4          119         82   2720         19.4         82
# ℹ 1 more variable: origin <int>

Clean the data

The dataset contains a few unknown values:

lapply(dataset, function(x) sum(is.na(x))) %>% str()
List of 8
 $ mpg         : int 0
 $ cylinders   : int 0
 $ displacement: int 0
 $ horsepower  : int 6
 $ weight      : int 0
 $ acceleration: int 0
 $ model_year  : int 0
 $ origin      : int 0

Drop those rows to keep this initial tutorial simple:

dataset <- na.omit(dataset)

The "origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with the recipes package.

Note: You can set up the keras_model() to do this kind of transformation for you but that’s beyond the scope of this tutorial. Check out the Classify structured data using Keras preprocessing layers or Load CSV data tutorials for examples.

library(recipes)
dataset <- recipe(mpg ~ ., dataset) %>%
  step_num2factor(origin, levels = c("USA", "Europe", "Japan")) %>%
  step_dummy(origin, one_hot = TRUE) %>%
  prep() %>%
  bake(new_data = NULL)
glimpse(dataset)
Rows: 391
Columns: 10
$ cylinders     <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4…
$ displacement  <dbl> 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
$ horsepower    <dbl> 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
$ weight        <dbl> 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850…
$ acceleration  <dbl> 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, 1…
$ model_year    <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ mpg           <dbl> 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, …
$ origin_USA    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0…
$ origin_Europe <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ origin_Japan  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…

Split the data into training and test sets

Now, split the dataset into a training set and a test set. You will use the test set in the final evaluation of your models.

split <- initial_split(dataset, 0.8)
train_dataset <- training(split)
test_dataset <- testing(split)

Inspect the data

Review the joint distribution of a few pairs of columns from the training set.

The top row suggests that the fuel efficiency (MPG) is a function of all the other parameters. The other rows indicate they are functions of each other.

train_dataset %>%
  select(mpg, cylinders, displacement, weight) %>%
  GGally::ggpairs()
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Let’s also check the overall statistics. Note how each feature covers a very different range:

skimr::skim(train_dataset)
Data summary
Name train_dataset
Number of rows 312
Number of columns 10
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cylinders 0 1 5.50 1.72 3 4.00 4.00 8.0 8.0 ▇▁▃▁▅
displacement 0 1 196.99 105.81 68 105.00 151.00 302.0 455.0 ▇▂▂▃▁
horsepower 0 1 104.93 38.85 46 76.75 92.00 126.0 230.0 ▆▇▃▂▁
weight 0 1 2995.04 844.26 1613 2232.50 2849.00 3622.5 4997.0 ▇▇▆▅▂
acceleration 0 1 15.57 2.83 8 13.88 15.50 17.2 24.8 ▁▆▇▂▁
model_year 0 1 76.06 3.71 70 73.00 76.00 79.0 82.0 ▇▆▇▆▇
mpg 0 1 23.48 7.91 9 17.00 22.15 29.0 46.6 ▆▇▆▃▁
origin_USA 0 1 0.62 0.48 0 0.00 1.00 1.0 1.0 ▅▁▁▁▇
origin_Europe 0 1 0.17 0.38 0 0.00 0.00 0.0 1.0 ▇▁▁▁▂
origin_Japan 0 1 0.21 0.40 0 0.00 0.00 0.0 1.0 ▇▁▁▁▂

Split features from labels

Separate the target value—the “label”—from the features. This label is the value that you will train the model to predict.

train_features <- train_dataset %>% select(-mpg)
test_features <- test_dataset %>% select(-mpg)

train_labels <- train_dataset %>% select(mpg)
test_labels <- test_dataset %>% select(mpg)

Normalization

In the table of statistics it’s easy to see how different the ranges of each feature are:

my_skim <- skimr::skim_with(numeric = skimr::sfl(mean, sd))
train_dataset %>%
  select(where(~is.numeric(.x))) %>%
  pivot_longer(
    cols = everything(), names_to = "variable", values_to = "values") %>%
  group_by(variable) %>%
  summarise(mean = mean(values), sd = sd(values))
# A tibble: 10 × 3
   variable          mean      sd
   <chr>            <dbl>   <dbl>
 1 acceleration    15.6     2.83 
 2 cylinders        5.5     1.72 
 3 displacement   197.    106.   
 4 horsepower     105.     38.9  
 5 model_year      76.1     3.71 
 6 mpg             23.5     7.91 
 7 origin_Europe    0.170   0.376
 8 origin_Japan     0.205   0.404
 9 origin_USA       0.625   0.485
10 weight        2995.    844.   

It is good practice to normalize features that use different scales and ranges.

One reason this is important is because the features are multiplied by the model weights. So, the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without feature normalization, normalization makes training much more stable.

Note: There is no advantage to normalizing the one-hot features—it is done here for simplicity. For more details on how to use the preprocessing layers, refer to the Working with preprocessing layers guide and the Classify structured data using Keras preprocessing layers tutorial.

The Normalization layer

The layer_normalization() is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

normalizer <- layer_normalization(axis = -1L)

Then, fit the state of the preprocessing layer to the data by calling adapt():

normalizer %>% adapt(as.matrix(train_features))

Calculate the mean and variance, and store them in the layer:

print(normalizer$mean)
tf.Tensor(
[[5.5000000e+00 1.9699361e+02 1.0493270e+02 2.9950388e+03 1.5573077e+01
  7.6060890e+01 6.2500000e-01 1.6987181e-01 2.0512822e-01]], shape=(1, 9), dtype=float32)

When the layer is called, it returns the input data, with each feature independently normalized.

first <- as.matrix(train_features[1,])

cat('First example:', first)
First example: 8 318 150 4077 14 72 1 0 0
cat('Normalized:', as.matrix(normalizer(first)))
Normalized: 1.452718 1.145492 1.161751 1.283603 -0.5567052 -1.095416 0.7745967 -0.4523641 -0.5080006

Linear regression

Before building a deep neural network model, start with linear regression using one and several variables.

Linear regression with one variable

Begin with a single-variable linear regression to predict 'mpg' from 'horsepower'.

Training a model with Keras typically starts by defining the model architecture. Use a Sequential model, which represents a sequence of steps.

There are two steps in your single-variable linear regression model:

  • Normalize the 'horsepower' input features using the normalization preprocessing layer.
  • Apply a linear transformation (\(y = mx+b\)) to produce 1 output using a linear layer (dense).

The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

First, create a matrix made of the 'horsepower' features. Then, instantiate the layer_normalization and fit its state to the horsepower data:

horsepower <- matrix(train_features$horsepower)
horsepower_normalizer <- layer_normalization(input_shape = shape(1), axis = NULL)
horsepower_normalizer %>% adapt(horsepower)

Build the Keras Sequential model:

horsepower_model <- keras_model_sequential() %>%
  horsepower_normalizer() %>%
  layer_dense(units = 1)

summary(horsepower_model)
Model: "sequential"
____________________________________________________________________________
 Layer (type)                Output Shape              Param #   Trainable  
============================================================================
 normalization_1 (Normaliza  (None, 1)                 3         Y          
 tion)                                                                      
 dense (Dense)               (None, 1)                 2         Y          
============================================================================
Total params: 5 (24.00 Byte)
Trainable params: 2 (8.00 Byte)
Non-trainable params: 3 (16.00 Byte)
____________________________________________________________________________

This model will predict 'mpg' from 'horsepower'.

Run the untrained model on the first 10 ‘horsepower’ values. The output won’t be good, but notice that it has the expected shape of (10, 1):

predict(horsepower_model, horsepower[1:10,])
1/1 - 0s - 64ms/epoch - 64ms/step
              [,1]
 [1,]  1.146472692
 [2,]  1.146472692
 [3,] -0.227240101
 [4,] -0.278118342
 [5,] -0.761461735
 [6,] -0.201800972
 [7,]  0.001712025
 [8,]  0.128907651
 [9,]  0.128907651
[10,] -0.557948709

Once the model is built, configure the training procedure using the Keras compile() method. The most important arguments to compile are the loss and the optimizer, since these define what will be optimized (mean_absolute_error) and how (using the optimizer_adam).

horsepower_model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.1),
  loss = 'mean_absolute_error'
)

Use Keras fit() to execute the training for 100 epochs:

history <- horsepower_model %>% fit(
  as.matrix(train_features$horsepower),
  as.matrix(train_labels),
  epochs = 100,
  # Suppress logging.
  verbose = 0,
  # Calculate validation results on 20% of the training data.
  validation_split = 0.2
)

Visualize the model’s training progress using the stats stored in the history object:

plot(history)

Collect the results on the test set for later:

test_results <- list()
test_results[["horsepower_model"]] <- horsepower_model %>% evaluate(
  as.matrix(test_features$horsepower),
  as.matrix(test_labels),
  verbose = 0
)

Since this is a single variable regression, it’s easy to view the model’s predictions as a function of the input:

x <- seq(0, 250, length.out = 251)
y <- predict(horsepower_model, x)
8/8 - 0s - 33ms/epoch - 4ms/step
ggplot(train_dataset) +
  geom_point(aes(x = horsepower, y = mpg, color = "data")) +
  geom_line(data = data.frame(x, y), aes(x = x, y = y, color = "prediction"))

Linear regression with multiple inputs

You can use an almost identical setup to make predictions based on multiple inputs. This model still does the same \(y = mx+b\) except that \(m\) is a matrix and \(b\) is a vector.

Create a two-step Keras Sequential model again with the first layer being normalizer (layer_normalization(axis = -1)) you defined earlier and adapted to the whole dataset:

linear_model <- keras_model_sequential() %>%
  normalizer() %>%
  layer_dense(units = 1)

When you call predict() on a batch of inputs, it produces units = 1 outputs for each example:

predict(linear_model, as.matrix(train_features[1:10, ]))
1/1 - 0s - 29ms/epoch - 29ms/step
             [,1]
 [1,]  0.27293643
 [2,] -0.24607792
 [3,]  0.34286350
 [4,]  1.56522334
 [5,]  0.50951248
 [6,]  1.19462740
 [7,]  0.34053078
 [8,] -0.09993154
 [9,]  0.99074090
[10,] -0.15395613

When you call the model, its weight matrices will be built—check that the kernel weights (the \(m\) in \(y = mx+b\)) have a shape of (9, 1):

linear_model$layers[[2]]$kernel
<tf.Variable 'dense_1/kernel:0' shape=(9, 1) dtype=float32, numpy=
array([[ 0.07652861],
       [ 0.5407312 ],
       [-0.7088285 ],
       [ 0.4323746 ],
       [ 0.49774444],
       [-0.4361304 ],
       [-0.5177331 ],
       [-0.47860345],
       [ 0.40407896]], dtype=float32)>

Configure the model with Keras compile() and train with fit() for 100 epochs:

linear_model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.1),
  loss = 'mean_absolute_error'
)
history <- linear_model %>% fit(
  as.matrix(train_features),
  as.matrix(train_labels),
  epochs = 100,
  # Suppress logging.
  verbose = 0,
  # Calculate validation results on 20% of the training data.
  validation_split = 0.2
)

Using all the inputs in this regression model achieves a much lower training and validation error than the horsepower_model, which had one input:

plot(history)

Collect the results on the test set for later:

test_results[['linear_model']] <- linear_model %>%
  evaluate(
    as.matrix(test_features),
    as.matrix(test_labels),
    verbose = 0
  )

Regression with a deep neural network (DNN)

In the previous section, you implemented two linear models for single and multiple inputs.

Here, you will implement single-input and multiple-input DNN models.

The code is basically the same except the model is expanded to include some “hidden” non-linear layers. The name “hidden” here just means not directly connected to the inputs or outputs.

These models will contain a few more layers than the linear model:

  • The normalization layer, as before (with horsepower_normalizer for a single-input model and normalizer for a multiple-input model).
  • Two hidden, non-linear, Dense layers with the ReLU (relu) activation function nonlinearity.
  • A linear Dense single-output layer.

Both models will use the same training procedure so the compile method is included in the build_and_compile_model function below.

build_and_compile_model <- function(norm) {
  model <- keras_model_sequential() %>%
    norm() %>%
    layer_dense(64, activation = 'relu') %>%
    layer_dense(64, activation = 'relu') %>%
    layer_dense(1)

  model %>% compile(
    loss = 'mean_absolute_error',
    optimizer = optimizer_adam(0.001)
  )

  model
}

Regression using a DNN and a single input

Create a DNN model with only 'Horsepower' as input and horsepower_normalizer (defined earlier) as the normalization layer:

dnn_horsepower_model <- build_and_compile_model(horsepower_normalizer)

This model has quite a few more trainable parameters than the linear models:

summary(dnn_horsepower_model)
Model: "sequential_2"
____________________________________________________________________________
 Layer (type)                Output Shape              Param #   Trainable  
============================================================================
 normalization_1 (Normaliza  (None, 1)                 3         Y          
 tion)                                                                      
 dense_4 (Dense)             (None, 64)                128       Y          
 dense_3 (Dense)             (None, 64)                4160      Y          
 dense_2 (Dense)             (None, 1)                 65        Y          
============================================================================
Total params: 4356 (17.02 KB)
Trainable params: 4353 (17.00 KB)
Non-trainable params: 3 (16.00 Byte)
____________________________________________________________________________

Train the model with Keras Model$fit:

history <- dnn_horsepower_model %>% fit(
  as.matrix(train_features$horsepower),
  as.matrix(train_labels),
  validation_split = 0.2,
  verbose = 0,
  epochs = 100
)

This model does slightly better than the linear single-input horsepower_model:

plot(history)

If you plot the predictions as a function of 'horsepower', you should notice how this model takes advantage of the nonlinearity provided by the hidden layers:

x <- seq(0.0, 250, length.out = 251)
y <- predict(dnn_horsepower_model, x)
8/8 - 0s - 45ms/epoch - 6ms/step
ggplot(train_dataset) +
  geom_point(aes(x = horsepower, y = mpg, color = "data")) +
  geom_line(data = data.frame(x, y), aes(x = x, y = y, color = "prediction"))

Collect the results on the test set for later:

test_results[['dnn_horsepower_model']] <- dnn_horsepower_model %>% evaluate(
  as.matrix(test_features$horsepower),
  as.matrix(test_labels),
  verbose = 0
)

Regression using a DNN and multiple inputs

Repeat the previous process using all the inputs. The model’s performance slightly improves on the validation dataset.

dnn_model <- build_and_compile_model(normalizer)
summary(dnn_model)
Model: "sequential_3"
____________________________________________________________________________
 Layer (type)                Output Shape              Param #   Trainable  
============================================================================
 normalization (Normalizati  (None, 9)                 19        Y          
 on)                                                                        
 dense_7 (Dense)             (None, 64)                640       Y          
 dense_6 (Dense)             (None, 64)                4160      Y          
 dense_5 (Dense)             (None, 1)                 65        Y          
============================================================================
Total params: 4884 (19.08 KB)
Trainable params: 4865 (19.00 KB)
Non-trainable params: 19 (80.00 Byte)
____________________________________________________________________________
history <- dnn_model %>% fit(
  as.matrix(train_features),
  as.matrix(train_labels),
  validation_split = 0.2,
  verbose = 0,
  epochs = 100
)
plot(history)

Collect the results on the test set:

test_results[['dnn_model']] <- dnn_model %>% evaluate(
  as.matrix(test_features),
  as.matrix(test_labels),
  verbose = 0
)

Performance

Since all models have been trained, you can review their test set performance:

sapply(test_results, function(x) x)
    horsepower_model.loss         linear_model.loss 
                 3.572925                  2.461306 
dnn_horsepower_model.loss            dnn_model.loss 
                 3.139431                  1.787824 

These results match the validation error observed during training.

Make predictions

You can now make predictions with the dnn_model on the test set using Keras predict() and review the loss:

test_predictions <- predict(dnn_model, as.matrix(test_features))
3/3 - 0s - 38ms/epoch - 13ms/step
ggplot(data.frame(pred = as.numeric(test_predictions), mpg = test_labels$mpg)) +
  geom_point(aes(x = pred, y = mpg)) +
  geom_abline(intercept = 0, slope = 1, color = "blue")

It appears that the model predicts reasonably well.

Now, check the error distribution:

qplot(test_predictions - test_labels$mpg, geom = "density")
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

error <- test_predictions - test_labels

If you’re happy with the model, save it for later use with Model$save:

save_model_tf(dnn_model, 'dnn_model')

If you reload the model, it gives identical output:

reloaded <- load_model_tf('dnn_model')

test_results[['reloaded']] <- reloaded %>% evaluate(
  as.matrix(test_features),
  as.matrix(test_labels),
  verbose = 0
)
sapply(test_results, function(x) x)
    horsepower_model.loss         linear_model.loss 
                 3.572925                  2.461306 
dnn_horsepower_model.loss            dnn_model.loss 
                 3.139431                  1.787824 
            reloaded.loss 
                 1.787824 

Conclusion

This notebook introduced a few techniques to handle a regression problem. Here are a few more tips that may help:

  • Mean squared error (MSE) (loss_mean_squared_error()) and mean absolute error (MAE) (loss_mean_absolute_error()) are common loss functions used for regression problems. MAE is less sensitive to outliers. Different loss functions are used for classification problems.
  • Similarly, evaluation metrics used for regression differ from classification.
  • When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
  • Overfitting is a common problem for DNN models, though it wasn’t a problem for this tutorial. Visit the Overfit and underfit tutorial for more help with this.