# Mixed Precision Training

## Overview

Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. This guide describes how to use the experimental Keras mixed precision API to speed up your models. Using this API can improve performance by more than 3 times on modern GPUs and 60% on TPUs.

Note: The Keras mixed precision API is currently experimental and may change.

Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.

NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. The Keras mixed precision API allows you to use a mix of either float16 or bfloat16 with float32, to get the performance benefits from float16/bfloat16 and the numeric stability benefits from float32.

Note: In this guide, the term “numeric stability” refers to how a model’s quality is affected by the use of a lower-precision dtype instead of a higher precision dtype. We say an operation is “numerically unstable” in float16 or bfloat16 if running it in one of those dtypes causes the model to have worse evaluation accuracy or other metrics compared to running the operation in float32.

## Setup

The Keras mixed precision API is available in TensorFlow 2.1.

library(keras)
library(tensorflow)

mixed_precision <- tf$keras$mixed_precision$experimental ## Supported hardware While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Cloud TPUs. NVIDIA GPUs support using a mix of float16 and float32, while TPUs support a mix of bfloat16 and float32. Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA’s CUDA GPU web page. Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the Titan V, and the V100. You can check your GPU type with the following. The command only exists if the NVIDIA drivers are installed, so the following will raise an error otherwise. nvidia-smi -L All Cloud TPUs support bfloat16. Even on CPUs and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API. ## Setting the dtype policy To use mixed precision in Keras, you need to create a tf.keras.mixed_precision.experimental.Policy, typically referred to as a dtype policy. dtype policies specify the dtypes layers will run in. In this guide, you will construct a policy from the string ‘mixed_float16’ and set it as the global policy. This will cause subsequently created layers to use mixed precision with a mix of float16 and float32. policy <- mixed_precision$Policy('mixed_float16')
mixed_precision$set_policy(policy) The policy specifies two important aspects of a layer: the dtype the layer’s computations are done in, and the dtype of a layer’s variables. Above, you created a mixed_float16 policy. With this policy, layers use float16 computations and float32 variables. Computations are done in float16 for performance, but variables must be kept in float32 for numeric stability. You can directly query these properties of the policy. # Compute dtype policy$compute_dtype

# Variable dtype
policy$variable_dtype As mentioned before, the mixed_float16 policy will most significantly improve performance on NVIDIA GPUs with compute capability of at least 7.0. The policy will run on other GPUs and CPUs but may not improve performance. For TPUs, the mixed_bfloat16 policy should be used instead. ## Building the model Next, let’s start building a simple model. Very small toy models typically do not benefit from mixed precision, because overhead from the TensorFlow runtime typically dominates the execution time, making any performance improvement on the GPU negligible. Therefore, let’s build two large Dense layers with 4096 units each if a GPU is used. inputs <- layer_input(shape = 784, name = 'digits') if (length(tf$config$list_physical_devices('GPU')) > 0) { print('The model will run with 4096 units on a GPU') num_units = 4096 } else { # Use fewer units on CPUs so the model finishes in a reasonable amount of time print('The model will run with 64 units on a CPU') num_units = 64 } dense1 <- layer_dense(units = num_units, activation = 'relu', name = 'dense_1') dense2 <- layer_dense(units = num_units, activation = 'relu', name = 'dense_2') x <- inputs %>% dense1() %>% dense2() Each layer has a policy and uses the global policy by default. Each of the dense layers therefore have the mixed_float16 policy because you set the global policy to mixed_float16 previously. This will cause the dense layers to do float16 computations and have float32 variables. They cast their inputs to float16 in order to do float16 computations, which causes their outputs to be float16 as a result. Their variables are float32 and will be cast to float16 when the layers are called to avoid errors from dtype mismatches. # tensor dtype x$dtype

# variable dtype
dense2$kernel$dtype

Next, create the output predictions. Normally, you can create the output predictions as follows, but this is not always numerically stable with float16.

# INCORRECT: softmax and model output will be float16, when it should be float32
outputs <- x %>% layer_dense(units = 10, activation = 'softmax', name = 'predictions')
outputs$dtype A softmax activation at the end of the model should be float32. Because thedtypepolicy ismixed_float16, the softmax activation would normally have afloat16computedtypeand output afloat16 tensors. This can be fixed by separating the Dense and softmax layers, and by passing dtype='float32' to the softmax layer: # CORRECT: softmax and model output are float32 x <- x %>% layer_dense(10, name = 'dense_logits') outputs <- x %>% layer_activation('softmax', dtype = 'float32', name = 'predictions') outputs$dtype

Passing dtype='float32' to the softmax layer constructor overrides the layer’s dtype policy to be the float32 policy, which does computations and keeps variables in float32. Equivalently, we could have instead passeddtype = mixed_precision$policy(‘float32’); layers always convert the dtype argument to a policy. Because theActivationlayer has no variables, the policy's variable dtype is ignored, but the policy's compute dtype offloat32causes softmax and the model output to befloat32. Adding a float16 softmax in the middle of a model is fine, but a softmax at the end of the model should be in float32. The reason is that if the intermediate tensor flowing from the softmax to the loss is float16 or bfloat16, numeric issues may occur. You can override the dtype of any layer to be float32 by passing dtype='float32' if you think it will not be numerically stable with float16 computations. But typically, this is only necessary on the last layer of the model, as most layers have sufficient precision with mixed_float16 and mixed_bfloat16. If the model does not end in a softmax, the outputs should still be float32. While unnecessary for this model, the model outputs can be cast to float32 with the following: # The linear activation is an identity function. So this simply casts 'outputs' # to float32. In this particular case, 'outputs' is already float32 so this is a # no-op. outputs <- outputs %>% layer_activation('linear', dtype = 'float32') Next, finish and compile the model, and generate input data. model <- keras_model(inputs = inputs, outputs = outputs) model %>% compile( loss = 'sparse_categorical_crossentropy', optimizer = 'rmsprop', metrics = 'accuracy') mnist <- dataset_mnist() x_train <- mnist$train$x/255 x_test <- mnist$test$x/255 x_train <- array_reshape(x_train, c(nrow(x_train), 784), order = "F") x_test <- array_reshape(x_test, c(nrow(x_test), 784), order = "F") y_train <- mnist$train$y y_test <- mnist$test$y Here, mnist$train$x and mnist$test$x are R arrays, and will be converted to NumPy float64 by reticulate. In TensorFlow, they then end up as tf$float64. The first layer of the model will then cast the inputs to float16, as each layer casts floating-point inputs to its compute dtype.

Next, the initial weights of the model are retrieved. This will allow training from scratch again by loading the weights.

initial_weights <- model$get_weights() ## Training the model Next, train the model. history <- model %>% fit( x_train, y_train, batch_size = 8192, epochs = 5, validation_split = 0.2 ) test_scores <- model %>% evaluate( x_test, y_test, verbose = 2 ) print('Test loss:', test_scores$loss)
print('Test accuracy:', test_scores$accuracy) Notice the model prints the time per sample in the logs: for example, “4us/sample”. The first epoch may be slower as TensorFlow spends some time optimizing the model, but afterwards the time per sample should stabilize. You can compare the performance of mixed precision with float32. To do so, change the policy from mixed_float16 to float32 in the “Setting the dtype policy” section, then rerun all the cells up to this point. On GPUs with at least compute capability 7.0, you should see the time per sample significantly increase, indicating mixed precision sped up the model. For example, with a Titan V GPU, the per-sample time increases from 4us to 12us. Make sure to change the policy back to mixed_float16 and rerun the cells before continuing with the guide. For many real-world models, mixed precision also allows you to double the batch size without running out of memory, as float16 tensors take half the memory. This does not apply however to this toy model, as you can likely run the model in any dtype where each batch consists of the entire MNIST dataset of 60,000 images. If running mixed precision on a TPU, you will not see as much of a performance gain compared to running mixed precision on GPUs. This is because TPUs already do certain ops in bfloat16 under the hood even with the default dtype policy of float32. TPU hardware does not support float32 for certain ops which are numerically stable in bfloat16, such as matmul. For such ops the TPU backend will silently use bfloat16 internally instead. As a consequence, passing dtype='float32' to layers which use such ops may have no numerical effect, however it is unlikely running such layers with bfloat16 computations will be harmful. ## Loss scaling Loss scaling is a technique which fit automatically performs with the mixed_float16 policy to avoid numeric underflow. This section describes loss scaling and how to customize its behavior. ### Underflow and Overflow Float16 has has a narrow dynamic range compared to float32. This means values above will overflow to infinity and values below will underflow to zero. float32 and bfloat16 have a much higher dynamic range so that overflow and underflow are not a problem. For example: x = tf$constant(256, dtype='float16')
as.numeric(x^2) # Overflow
x = tf$constant(1e-5, dtype='float16') as.numeric(x^2) # Underflow In practice, overflow with float16 rarely occurs. Additionally, underflow also rarely occurs during the forward pass. However, during the backward pass, gradients can underflow to zero. Loss scaling is a technique to prevent this underflow. ### Loss scaling background The basic concept of loss scaling is simple: Simply multiply the loss by some large number, say 1024. We call this number the loss scale. This will cause the gradients to be scaled by 1024 as well, greatly reducing the chance of underflow. Once the final gradients are computed, divide them by 1024 to bring them back to their correct values. The pseudocode for this process is: loss_scale <- 1024 loss <- model(inputs) loss <- loss * loss_scale # We assume grads are float32. We do not want to divide float16 gradients grads <- compute_gradient(loss, model$trainable_variables)
grads <- grads / loss_scale

Choosing a loss scale can be tricky. If the loss scale is too low, gradients may still underflow to zero. If too high, the opposite the problem occurs: the gradients may overflow to infinity.

To solve this, TensorFlow dynamically determines the loss scale so you do not have to choose one manually. If you use Keras fit, loss scaling is done for you so you do not have to do any extra work. This is explained further in the next section.

### Choosing the loss scale

Each dtype policy optionally has an associated tf$mixed_precision$experimental$LossScale object, which represents a fixed or dynamic loss scale. By default, the loss scale for the mixed_float16 policy is a tf$mixed_precision$experimental$DynamicLossScale, which dynamically determines the loss scale value. Other policies do not have a loss scale by default, as it is only necessary when float16 is used. You can query the loss scale of the policy:

policy$loss_scale The loss scale prints a lot of internal state, but you can ignore it. The most important part is the current_loss_scale part, which shows the loss scale’s current value. You can instead use a static loss scale by passing a number when constructing a dtype policy. new_policy <- mixed_precision$Policy('mixed_float16', loss_scale = 1024)
new_policy$loss_scale The dtype policy constructor always converts the loss scale to a LossScale object. In this case, it’s converted to a tf.mixed_precision.experimental.FixedLossScale, the only other LossScale subclass other than DynamicLossScale. Note: Using anything other than a dynamic loss scale is not recommended. Choosing a fixed loss scale can be difficult, as making it too low will cause the model to not train as well, and making it too high will cause Infs or NaNs to appear in the gradients. A dynamic loss scale is typically near the optimal loss scale, so you do not have to do any work. Currently, dynamic loss scales are a bit slower than fixed loss scales, but the performance will be improved in the future. Models, like layers, each have a dtype policy. If present, a model uses its policy’s loss scale to apply loss scaling in the fit method. This means if fit is used, you do not have to worry about loss scaling at all: The mixed_float16 policy will have a dynamic loss scale by default, and fit will apply it. With custom training loops, the model will ignore the policy’s loss scale, and you will have to apply it manually. This is explained in the next section. ## Training the model with a custom training loop So far, you trained a Keras model with mixed precision using fit. Next, you will use mixed precision with a custom training loop. Running a custom training loop with mixed precision requires two changes over running it in float32: • Build the model with mixed precision (you already did this) • Explicitly use loss scaling if mixed_float16 is used. For step (2), you will use the tf$keras$mixed_precision$experimental$LossScaleOptimizer class, which wraps an optimizer and applies loss scaling. It takes two arguments: the optimizer and the loss scale. Construct one as follows to use a dynamic loss scale: optimizer = tf$keras$optimizers$RMSprop()
optimizer = mixed_precision$LossScaleOptimizer(optimizer, loss_scale = 'dynamic') Passing ‘dynamic’ is equivalent to passing tf.mixed_precision.experimental.DynamicLossScale(). Next, define the loss object and the datasets loss_object <- tf$keras$losses$SparseCategoricalCrossentropy()

library(tfdatasets)
train_dataset <- tensor_slices_dataset(list(x_train, y_train)) %>%
dataset_shuffle(10000) %>%
dataset_batch(8192)
test_dataset <- tensor_slices_dataset(list(x_train, y_train)) %>%
dataset_batch(8192)

Next, define the training step function. Two new methods from the loss scale optimizer are used in order to scale the loss and unscale the gradients:

• get_scaled_loss(loss): Multiplies the loss by the loss scale
• get_unscaled_gradients(gradients): Takes in a list of scaled gradients as inputs, and divides each one by the loss scale to unscale them

These functions must be used in order to prevent underflow in the gradients. LossScaleOptimizer$apply_gradients will then apply gradients if none of them have Infs or NaNs. It will also update the loss scale, halving it if the gradients had Infs or NaNs and potentially increasing it otherwise. train_step <- function(x, y) { with (tf$GradientTape() %as% tape, {
predictions <- model(x)
loss <- loss_object(y, predictions)
scaled_loss <- optimizer$get_scaled_loss(loss) }) scaled_gradients <- tape$gradient(scaled_loss, model$trainable_variables) gradients <- optimizer$get_unscaled_gradients(scaled_gradients)
optimizer$apply_gradients(purrr::transpose(list(gradients, model$trainable_variables)))
loss
}

train_step <- tf_function(train_step)

The LossScaleOptimizer will likely skip the first few steps at the start of training. The loss scale starts out high so that the optimal loss scale can quickly be determined. After a few steps, the loss scale will stabilize and very few steps will be skipped. This process happens automatically and does not affect training quality.

Now define the test step.

test_step <- function(x) {
model(x, training = FALSE)
}

test_step <- tf_function(test_step)

Load the initial weights of the model, so you can retrain from scratch.

model$set_weights(initial_weights) Finally, run the custom training loop. library(tfautograph) library(glue) for (epoch in 1:5) { epoch_loss_avg <- tf$keras$metrics$Mean()
test_accuracy <- tf$keras$metrics$SparseCategoricalAccuracy( name = 'test_accuracy') autograph({ for (batch in train_dataset) { loss <- train_step(batch[[1]], batch[[2]]) epoch_loss_avg(loss) } for (batch in test_dataset) { predictions <- test_step(batch[[1]]) test_accuracy$update_state(batch[[2]], predictions)
}
})
cat('Epoch: ', epoch, '   loss: ', as.numeric(epoch_loss_avg$result()), ' test accuracy: ', as.numeric(test_accuracy$result()), '\n')

}

## GPU performance tips

Here are some performance tips when using mixed precision on GPUs.

If it doesn’t affect model quality, try running with double the batch size when using mixed precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.

#### Ensuring GPU Tensor Cores are used

As mentioned previously, modern NVIDIA GPUs use a special hardware unit called Tensor Cores that can multiply float16 matrices very quickly. However, Tensor Cores requires certain dimensions of tensors to be a multiple of 8.

# units must be multiple of 8
layer_dense(units = 64)

# filters must be multiple of 8
layer_conv2d(filters = 48, kernel_size = 7, stride = 3)
# And similarly for other convolutional layers, such as tf.keras.layers.Conv3d

# units must be multiple of 8
layer_lstm(units = 64)
# And similar for other RNNs, such as tf.keras.layers.GRU

# batch_size must be multiple of 8
model %>% fit(x_train, y_train, batch_size = 128)

You should try to use Tensor Cores when possible. If you want to learn more NVIDIA deep learning performance guide describes the exact requirements for using Tensor Cores as well as other Tensor Core-related performance information.

#### XLA

XLA is a compiler that can further increase mixed precision performance, as well as float32 performance to a lesser extent. See the XLA guide for details.

#### Cloud TPU performance tips

As on GPUs, you should try doubling your batch size, as bfloat16 tensors use half the memory. Doubling batch size may increase training throughput.

TPUs do not require any other mixed precision-specific tuning to get optimal performance. TPUs already require the use of XLA. They benefit from having certain dimensions being multiples of 128, but this applies equally to float32 as it does for mixed precision. See the Cloud TPU Performance Guide for general TPU performance tips, which apply to mixed precision as well as float32.

## Summary

• You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.

• You can use mixed precision with the following lines:

# On TPUs, use 'mixed_bfloat16' instead
policy <- tf$keras$mixed_precision$experimental$Policy('mixed_float16')
mixed_precision$set_policy(policy) • If your model ends in softmax, make sure it is float32. And regardless of what your model ends in, make sure the output is float32. • If you use a custom training loop with mixed_float16, in addition to the above lines, you need to wrap your optimizer with a tf$keras$mixed_precision$experimental$LossScaleOptimizer. Then call optimizer$get_scaled_loss to scale the loss, and optimizer$get_unscaled_gradients to unscale the gradients. • Double the training batch size if it does not reduce evaluation accuracy • On GPUs, ensure most tensor dimensions are a multiple of 8 to maximize performance • For more examples of mixed precision using the tf$keras\$mixed_precision API, see the official models repository. Most official models, such as ResNet and Transformer will run using mixed precision by passing --dtype=fp16.