# Automatic differentiation and gradient tape

In this tutorial we will cover automatic differentiation, a key technique for optimizing machine learning models.

## Setup

We will use the TensorFlow R package:

library(tensorflow)

TensorFlow provides the tf$GradientTape API for automatic differentiation - computing the gradient of a computation with respect to its input variables. Tensorflow “records” all operations executed inside the context of a tf$GradientTape onto a “tape”. Tensorflow then uses that tape and the gradients associated with each recorded operation to compute the gradients of a “recorded” computation using reverse mode differentiation.

For example:

x <- tf$ones(shape(2, 2)) with(tf$GradientTape() %as% t, {
t$watch(x) y <- tf$reduce_sum(x)
z <- tf$multiply(y, y) }) # Derivative of z with respect to the original input tensor x dz_dx <- t$gradient(z, x)
dz_dx
## tf.Tensor(
## [[8. 8.]
##  [8. 8.]], shape=(2, 2), dtype=float32)

You can also request gradients of the output with respect to intermediate values computed during a “recorded” tf$GradientTape context. x <- tf$ones(shape(2, 2))

with(tf$GradientTape() %as% t, { t$watch(x)
y <- tf$reduce_sum(x) z <- tf$multiply(y, y)
})

# Use the tape to compute the derivative of z with respect to the
# intermediate value y.
dz_dy <- t$gradient(z, y) dz_dy ## tf.Tensor(8.0, shape=(), dtype=float32) By default, the resources held by a GradientTape are released as soon as GradientTape$gradient() method is called. To compute multiple gradients over the same computation, create a persistent gradient tape. This allows multiple calls to the gradient() method as resources are released when the tape object is garbage collected. For example:

x <- tf$constant(3) with(tf$GradientTape(persistent = TRUE) %as% t, {
t$watch(x) y <- x * x z <- y * y }) # Use the tape to compute the derivative of z with respect to the # intermediate value y. dz_dx <- t$gradient(z, x) # 108.0 (4*x^3 at x = 3)
dz_dx
## tf.Tensor(108.0, shape=(), dtype=float32)
dy_dx <- t$gradient(y, x) # 6.0 dy_dx ## tf.Tensor(6.0, shape=(), dtype=float32) rm(t) # Drop the reference to the tape ### Recording control flow Because tapes record operations as they are executed, R control flow (using ifs and whiles for example) is naturally handled: f <- function(x, y) { output <- 1 for (i in seq_len(y)) { if (i > 2 & i <= 5) output = tf$multiply(output, x)
}
output
}

with(tf$GradientTape() %as% t, { t$watch(x)
out <- f(x, y)
})
t$gradient(out, x) } x <- tf$constant(2)
grad(x, 6)
## tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 5)
## tf.Tensor(12.0, shape=(), dtype=float32)
grad(x, 4)
## tf.Tensor(4.0, shape=(), dtype=float32)

Operations inside of the GradientTape context manager are recorded for automatic differentiation. If gradients are computed in that context, then the gradient computation is recorded as well. As a result, the exact same API works for higher-order gradients as well. For example:

x <- tf$Variable(1.0) # Create a Tensorflow variable initialized to 1.0 with(tf$GradientTape() %as% t, {

with(tf$GradientTape() %as% t2, { y <- x*x*x }) # Compute the gradient inside the 't' context manager # which means the gradient computation is differentiable as well. dy_dx <- t2$gradient(y, x)

})

dy_dx
## tf.Tensor(3.0, shape=(), dtype=float32)
d2y_dx
## tf.Tensor(6.0, shape=(), dtype=float32)