# Mixed Precision Training

## Overview

Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. This guide describes how to use the experimental Keras mixed precision API to speed up your models. Using this API can improve performance by more than 3 times on modern GPUs and 60% on TPUs.

*Note: The Keras mixed precision API is currently experimental and may change.*

Today, most models use the `float32`

dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, `float16`

and `bfloat16`

, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.

NVIDIA GPUs can run operations in `float16`

faster than in `float32`

, and TPUs can run operations in `bfloat16`

faster than `float32`

. Therefore, these lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in `float32`

for numeric reasons so that the model trains to the same quality. The Keras mixed precision API allows you to use a mix of either `float16`

or `bfloat16`

with `float32`

, to get the performance benefits from `float16`

/`bfloat16`

and the numeric stability benefits from `float32`

.

*Note: In this guide, the term “numeric stability” refers to how a model’s quality is affected by the use of a lower-precision dtype instead of a higher precision dtype. We say an operation is “numerically unstable” in float16 or bfloat16 if running it in one of those dtypes causes the model to have worse evaluation accuracy or other metrics compared to running the operation in float32.*

## Setup

The Keras mixed precision API is available in TensorFlow 2.1.

```
library(keras)
library(tensorflow)
mixed_precision <- tf$keras$mixed_precision$experimental
```

## Supported hardware

While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Cloud TPUs. NVIDIA GPUs support using a mix of float16 and `float32`

, while TPUs support a mix of `bfloat16`

and `float32`

.

Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA’s CUDA GPU web page. Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the Titan V, and the V100.

You can check your GPU type with the following. The command only exists if the NVIDIA drivers are installed, so the following will raise an error otherwise.

All Cloud TPUs support `bfloat16`

.

Even on CPUs and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API.

## Setting the `dtype`

policy

To use mixed precision in Keras, you need to create a `tf.keras.mixed_precision.experimental.Policy`

, typically referred to as a `dtype`

policy. `dtype`

policies specify the `dtype`

s layers will run in. In this guide, you will construct a policy from the string ‘mixed_float16’ and set it as the global policy. This will cause subsequently created layers to use mixed precision with a mix of float16 and `float32`

.

The policy specifies two important aspects of a layer: the `dtype`

the layer’s computations are done in, and the `dtype`

of a layer’s variables. Above, you created a mixed_float16 policy. With this policy, layers use float16 computations and `float32`

variables. Computations are done in float16 for performance, but variables must be kept in `float32`

for numeric stability. You can directly query these properties of the policy.

As mentioned before, the mixed_float16 policy will most significantly improve performance on NVIDIA GPUs with compute capability of at least 7.0. The policy will run on other GPUs and CPUs but may not improve performance. For TPUs, the `mixed_bfloat16`

policy should be used instead.

## Building the model

Next, let’s start building a simple model. Very small toy models typically do not benefit from mixed precision, because overhead from the TensorFlow runtime typically dominates the execution time, making any performance improvement on the GPU negligible. Therefore, let’s build two large Dense layers with 4096 units each if a GPU is used.

```
inputs <- layer_input(shape = 784, name = 'digits')
if (length(tf$config$list_physical_devices('GPU')) > 0) {
print('The model will run with 4096 units on a GPU')
num_units = 4096
} else {
# Use fewer units on CPUs so the model finishes in a reasonable amount of time
print('The model will run with 64 units on a CPU')
num_units = 64
}
dense1 <- layer_dense(units = num_units, activation = 'relu', name = 'dense_1')
dense2 <- layer_dense(units = num_units, activation = 'relu', name = 'dense_2')
x <- inputs %>% dense1() %>% dense2()
```

Each layer has a policy and uses the global policy by default. Each of the dense layers therefore have the mixed_float16 policy because you set the global policy to mixed_float16 previously. This will cause the dense layers to do float16 computations and have `float32`

variables. They cast their inputs to float16 in order to do float16 computations, which causes their outputs to be float16 as a result. Their variables are `float32`

and will be cast to float16 when the layers are called to avoid errors from `dtype`

mismatches.

Next, create the output predictions. Normally, you can create the output predictions as follows, but this is not always numerically stable with float16.

```
# INCORRECT: softmax and model output will be float16, when it should be float32
outputs <- x %>% layer_dense(units = 10, activation = 'softmax', name = 'predictions')
outputs$dtype
```

A softmax activation at the end of the model should be float32`. Because the`

dtype`policy is`

mixed_float16`, the softmax activation would normally have a`

float16`compute`

dtype`and output a`

float16` tensors.

This can be fixed by separating the Dense and softmax layers, and by passing `dtype='float32'`

to the softmax layer:

```
# CORRECT: softmax and model output are float32
x <- x %>% layer_dense(10, name = 'dense_logits')
outputs <- x %>% layer_activation('softmax', dtype = 'float32', name = 'predictions')
outputs$dtype
```

Passing `dtype='float32'`

to the softmax layer constructor overrides the layer’s `dtype`

policy to be the `float32`

policy, which does computations and keeps variables in float32`. Equivalently, we could have instead passed`

dtype = mixed_precision$policy(‘float32’)`; layers always convert the dtype argument to a policy. Because the`

Activation`layer has no variables, the policy's variable dtype is ignored, but the policy's compute dtype of`

float32`causes softmax and the model output to be`

float32`.

Adding a `float16`

softmax in the middle of a model is fine, but a softmax at the end of the model should be in `float32`

. The reason is that if the intermediate tensor flowing from the softmax to the loss is float16 or `bfloat16`

, numeric issues may occur.

You can override the `dtype`

of any layer to be `float32`

by passing `dtype='float32'`

if you think it will not be numerically stable with float16 computations. But typically, this is only necessary on the last layer of the model, as most layers have sufficient precision with mixed_float16 and mixed_`bfloat16`

.

If the model does not end in a softmax, the outputs should still be `float32`

. While unnecessary for this model, the model outputs can be cast to `float32`

with the following:

```
# The linear activation is an identity function. So this simply casts 'outputs'
# to float32. In this particular case, 'outputs' is already float32 so this is a
# no-op.
outputs <- outputs %>% layer_activation('linear', dtype = 'float32')
```

Next, finish and compile the model, and generate input data.

```
model <- keras_model(inputs = inputs, outputs = outputs)
model %>% compile(
loss = 'sparse_categorical_crossentropy',
optimizer = 'rmsprop',
metrics = 'accuracy')
mnist <- dataset_mnist()
x_train <- mnist$train$x/255
x_test <- mnist$test$x/255
x_train <- array_reshape(x_train, c(nrow(x_train), 784), order = "F")
x_test <- array_reshape(x_test, c(nrow(x_test), 784), order = "F")
y_train <- mnist$train$y
y_test <- mnist$test$y
```

Here, `mnist$train$x`

and `mnist$test$x`

are R `array`

s, and will be converted to NumPy `float64`

by `reticulate`

. In TensorFlow, they then end up as `tf$float64`

.
The first layer of the model will then cast the inputs to float16, as each layer casts floating-point inputs to its compute `dtype`

.

Next, the initial weights of the model are retrieved. This will allow training from scratch again by loading the weights.

## Training the model

Next, train the model.

```
history <- model %>% fit(
x_train,
y_train,
batch_size = 8192,
epochs = 5,
validation_split = 0.2
)
test_scores <- model %>%
evaluate(
x_test,
y_test,
verbose = 2
)
print('Test loss:', test_scores$loss)
print('Test accuracy:', test_scores$accuracy)
```

Notice the model prints the time per sample in the logs: for example, “4us/sample”. The first epoch may be slower as TensorFlow spends some time optimizing the model, but afterwards the time per sample should stabilize.

You can compare the performance of mixed precision with `float32`

. To do so, change the policy from `mixed_float16`

to `float32`

in the “Setting the dtype policy” section, then rerun all the cells up to this point. On GPUs with at least compute capability 7.0, you should see the time per sample significantly increase, indicating mixed precision sped up the model. For example, with a Titan V GPU, the per-sample time increases from 4us to 12us. Make sure to change the policy back to `mixed_float16`

and rerun the cells before continuing with the guide.

For many real-world models, mixed precision also allows you to double the batch size without running out of memory, as float16 tensors take half the memory. This does not apply however to this toy model, as you can likely run the model in any `dtype`

where each batch consists of the entire MNIST dataset of 60,000 images.

If running mixed precision on a TPU, you will not see as much of a performance gain compared to running mixed precision on GPUs. This is because TPUs already do certain ops in `bfloat16`

under the hood even with the default `dtype`

policy of `float32`

. TPU hardware does not support `float32`

for certain ops which are numerically stable in `bfloat16`

, such as `matmul`

. For such ops the TPU backend will silently use `bfloat16`

internally instead. As a consequence, passing `dtype='float32'`

to layers which use such ops may have no numerical effect, however it is unlikely running such layers with `bfloat16`

computations will be harmful.

## Loss scaling

Loss scaling is a technique which `fit`

automatically performs with the `mixed_float16`

policy to avoid numeric underflow. This section describes loss scaling and how to customize its behavior.

### Underflow and Overflow

Float16 has has a narrow dynamic range compared to `float32`

. This means values above
will overflow to infinity and values below will underflow to zero. `float32`

and `bfloat16`

have a much higher dynamic range so that overflow and underflow are not a problem.

For example:

```
x = tf$constant(256, dtype='float16')
as.numeric(x^2) # Overflow
```

```
x = tf$constant(1e-5, dtype='float16')
as.numeric(x^2) # Underflow
```

In practice, overflow with float16 rarely occurs. Additionally, underflow also rarely occurs during the forward pass. However, during the backward pass, gradients can underflow to zero. Loss scaling is a technique to prevent this underflow.

### Loss scaling background

The basic concept of loss scaling is simple: Simply multiply the loss by some large number, say 1024. We call this number the loss scale. This will cause the gradients to be scaled by 1024 as well, greatly reducing the chance of underflow. Once the final gradients are computed, divide them by 1024 to bring them back to their correct values.

The pseudocode for this process is:

```
loss_scale <- 1024
loss <- model(inputs)
loss <- loss * loss_scale
# We assume `grads` are float32. We do not want to divide float16 gradients
grads <- compute_gradient(loss, model$trainable_variables)
grads <- grads / loss_scale
```

Choosing a loss scale can be tricky. If the loss scale is too low, gradients may still underflow to zero. If too high, the opposite the problem occurs: the gradients may overflow to infinity.

To solve this, TensorFlow dynamically determines the loss scale so you do not have to choose one manually. If you use Keras `fit`

, loss scaling is done for you so you do not have to do any extra work. This is explained further in the next section.

### Choosing the loss scale

Each `dtype`

policy optionally has an associated `tf$mixed_precision$experimental$LossScale`

object, which represents a fixed or dynamic loss scale. By default, the loss scale for the `mixed_float16`

policy is a `tf$mixed_precision$experimental$DynamicLossScale`

, which dynamically determines the loss scale value. Other policies do not have a loss scale by default, as it is only necessary when float16 is used. You can query the loss scale of the policy:

The loss scale prints a lot of internal state, but you can ignore it. The most important part is the `current_loss_scale`

part, which shows the loss scale’s current value.

You can instead use a static loss scale by passing a number when constructing a `dtype`

policy.

The `dtype`

policy constructor always converts the loss scale to a `LossScale`

object. In this case, it’s converted to a `tf.mixed_precision.experimental.FixedLossScale`

, the only other `LossScale`

subclass other than `DynamicLossScale`

.

*Note: Using anything other than a dynamic loss scale is not recommended. Choosing a fixed loss scale can be difficult, as making it too low will cause the model to not train as well, and making it too high will cause Infs or NaNs to appear in the gradients. A dynamic loss scale is typically near the optimal loss scale, so you do not have to do any work. Currently, dynamic loss scales are a bit slower than fixed loss scales, but the performance will be improved in the future.*

Models, like layers, each have a `dtype`

policy. If present, a model uses its policy’s loss scale to apply loss scaling in the `fit`

method. This means if `fit`

is used, you do not have to worry about loss scaling at all: The `mixed_float16`

policy will have a dynamic loss scale by default, and `fit`

will apply it.

With custom training loops, the model will ignore the policy’s loss scale, and you will have to apply it manually. This is explained in the next section.

## Training the model with a custom training loop

So far, you trained a Keras model with mixed precision using `fit`

. Next, you will use mixed precision with a custom training loop.

Running a custom training loop with mixed precision requires two changes over running it in `float32`

:

- Build the model with mixed precision (you already did this)
- Explicitly use loss scaling if mixed_float16 is used.

For step (2), you will use the `tf$keras$mixed_precision$experimental$LossScaleOptimizer`

class, which wraps an optimizer and applies loss scaling. It takes two arguments: the optimizer and the loss scale. Construct one as follows to use a dynamic loss scale:

```
optimizer = tf$keras$optimizers$RMSprop()
optimizer = mixed_precision$LossScaleOptimizer(optimizer, loss_scale = 'dynamic')
```

Passing ‘dynamic’ is equivalent to passing `tf.mixed_precision.experimental.DynamicLossScale()`

.

Next, define the loss object and the `datasets`

```
loss_object <- tf$keras$losses$SparseCategoricalCrossentropy()
library(tfdatasets)
train_dataset <- tensor_slices_dataset(list(x_train, y_train)) %>%
dataset_shuffle(10000) %>%
dataset_batch(8192)
test_dataset <- tensor_slices_dataset(list(x_train, y_train)) %>%
dataset_batch(8192)
```

Next, define the training step function. Two new methods from the loss scale optimizer are used in order to scale the loss and unscale the gradients:

- get_scaled_loss(loss): Multiplies the loss by the loss scale
- get_unscaled_gradients(gradients): Takes in a list of scaled gradients as inputs, and divides each one by the loss scale to unscale them

These functions must be used in order to prevent underflow in the gradients. `LossScaleOptimizer$apply_gradients`

will then apply gradients if none of them have `Infs`

or `NaNs.`

It will also update the loss scale, halving it if the gradients had `Infs`

or `NaNs`

and potentially increasing it otherwise.

```
train_step <- function(x, y) {
with (tf$GradientTape() %as% tape, {
predictions <- model(x)
loss <- loss_object(y, predictions)
scaled_loss <- optimizer$get_scaled_loss(loss)
})
scaled_gradients <- tape$gradient(scaled_loss, model$trainable_variables)
gradients <- optimizer$get_unscaled_gradients(scaled_gradients)
optimizer$apply_gradients(purrr::transpose(list(gradients, model$trainable_variables)))
loss
}
train_step <- tf_function(train_step)
```

The `LossScaleOptimizer`

will likely skip the first few steps at the start of training. The loss scale starts out high so that the optimal loss scale can quickly be determined. After a few steps, the loss scale will stabilize and very few steps will be skipped. This process happens automatically and does not affect training quality.

Now define the test step.

Load the initial weights of the model, so you can retrain from scratch.

`model$set_weights(initial_weights)`

Finally, run the custom training loop.

```
library(tfautograph)
library(glue)
for (epoch in 1:5) {
epoch_loss_avg <- tf$keras$metrics$Mean()
test_accuracy <- tf$keras$metrics$SparseCategoricalAccuracy(
name = 'test_accuracy')
autograph({
for (batch in train_dataset) {
loss <- train_step(batch[[1]], batch[[2]])
epoch_loss_avg(loss)
}
for (batch in test_dataset) {
predictions <- test_step(batch[[1]])
test_accuracy$update_state(batch[[2]], predictions)
}
})
cat('Epoch: ', epoch, ' loss: ', as.numeric(epoch_loss_avg$result()), ' test accuracy: ', as.numeric(test_accuracy$result()), '\n')
}
```

## GPU performance tips

Here are some performance tips when using mixed precision on GPUs.

#### Increasing your batch size

If it doesn’t affect model quality, try running with double the batch size when using mixed precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.

#### Ensuring GPU Tensor Cores are used

As mentioned previously, modern NVIDIA GPUs use a special hardware unit called Tensor Cores that can multiply float16 matrices very quickly. However, Tensor Cores requires certain dimensions of tensors to be a multiple of 8.

```
# units must be multiple of 8
layer_dense(units = 64)
# filters must be multiple of 8
layer_conv2d(filters = 48, kernel_size = 7, stride = 3)
# And similarly for other convolutional layers, such as tf.keras.layers.Conv3d
# units must be multiple of 8
layer_lstm(units = 64)
# And similar for other RNNs, such as tf.keras.layers.GRU
# batch_size must be multiple of 8
model %>% fit(x_train, y_train, batch_size = 128)
```

You should try to use Tensor Cores when possible. If you want to learn more NVIDIA deep learning performance guide describes the exact requirements for using Tensor Cores as well as other Tensor Core-related performance information.

#### XLA

XLA is a compiler that can further increase mixed precision performance, as well as `float32`

performance to a lesser extent. See the XLA guide for details.

#### Cloud TPU performance tips

As on GPUs, you should try doubling your batch size, as `bfloat16`

tensors use half the memory. Doubling batch size may increase training throughput.

TPUs do not require any other mixed precision-specific tuning to get optimal performance. TPUs already require the use of XLA. They benefit from having certain dimensions being multiples of 128, but this applies equally to `float32`

as it does for mixed precision. See the Cloud TPU Performance Guide for general TPU performance tips, which apply to mixed precision as well as `float32`

.

## Summary

You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.

You can use mixed precision with the following lines:

```
# On TPUs, use 'mixed_bfloat16' instead
policy <- tf$keras$mixed_precision$experimental$Policy('mixed_float16')
mixed_precision$set_policy(policy)
```

If your model ends in softmax, make sure it is

`float32`

. And regardless of what your model ends in, make sure the output is`float32`

.If you use a custom training loop with mixed_float16, in addition to the above lines, you need to wrap your optimizer with a

`tf$keras$mixed_precision$experimental$LossScaleOptimizer`

. Then call`optimizer$get_scaled_loss`

to scale the loss, and`optimizer$get_unscaled_gradients`

to unscale the gradients.Double the training batch size if it does not reduce evaluation accuracy

On GPUs, ensure most tensor dimensions are a multiple of 8 to maximize performance

For more examples of mixed precision using the

`tf$keras$mixed_precision`

API, see the official models repository. Most official models, such as ResNet and Transformer will run using mixed precision by passing`--dtype=fp16`

.