library(tensorflow)
library(keras)
Understanding masking & padding
Setup
Introduction
Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data.
Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.
Let’s take a close look.
Padding sequence data
When processing sequence data, it is very common for individual samples to have different lengths. Consider the following example (text tokenized as words):
list(
c("Hello", "world", "!"),
c("How", "are", "you", "doing", "today"),
c("The", "weather", "will", "be", "nice", "tomorrow")
)
[[1]]
[1] "Hello" "world" "!"
[[2]]
[1] "How" "are" "you" "doing" "today"
[[3]]
[1] "The" "weather" "will" "be" "nice" "tomorrow"
After vocabulary lookup, the data might be vectorized as integers, e.g.:
list(
c(71, 1331, 4231),
c(73, 8, 3215, 55, 927),
c(83, 91, 1, 645, 1253, 927)
)
[[1]]
[1] 71 1331 4231
[[2]]
[1] 73 8 3215 55 927
[[3]]
[1] 83 91 1 645 1253 927
The data is a nested list where individual samples have length 3, 5, and 6, respectively. Since the input data for a deep learning model must be a single tensor (of shape e.g. (batch_size, 6, vocab_size)
in this case), samples that are shorter than the longest item need to be padded with some placeholder value (alternatively, one might also truncate long samples before padding short samples).
Keras provides a utility function to truncate and pad lists to a common length: pad_sequences()
.
<- list(
raw_inputs c(711, 632, 71),
c(73, 8, 3215, 55, 927),
c(83, 91, 1, 645, 1253, 927)
)
# By default, this will pad using 0s; it is configurable via the
# "value" parameter.
# Note that you could use "pre" padding (at the beginning) or
# "post" padding (at the end).
# We recommend using "post" padding when working with RNN layers
# (in order to be able to use the
# CuDNN implementation of the layers).
<- pad_sequences(raw_inputs, padding = "post")
padded_inputs print(padded_inputs)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 711 632 71 0 0 0
[2,] 73 8 3215 55 927 0
[3,] 83 91 1 645 1253 927
Masking
Now that all samples have a uniform length, the model must be informed that some part of the data is actually padding and should be ignored. That mechanism is masking.
There are three ways to introduce input masks in Keras models:
- Add a
layer_masking()
layer. - Configure a
layer_embedding()
layer withmask_zero = TRUE
. - Pass a
mask
argument manually when calling layers that support this argument (e.g. RNN layers).
Mask-generating layers: embedding
and masking
Under the hood, these layers will create a mask tensor (2D tensor with shape (batch, sequence_length)
), and attach it to the tensor output returned by the layer_masking()
or layer_embedding()
layer.
<- layer_embedding(input_dim = 5000, output_dim = 16, mask_zero = TRUE)
embedding <- embedding(padded_inputs)
masked_output
print(masked_output$`_keras_mask`)
tf.Tensor(
[[ True True True False False False]
[ True True True True True False]
[ True True True True True True]], shape=(3, 6), dtype=bool)
<- layer_masking()
masking_layer # Simulate the embedding lookup by expanding the 2D input to 3D,
# with embedding dimension of 10.
<- tf$cast(
unmasked_embedding $tile(tf$expand_dims(padded_inputs, axis = -1L), c(1L, 1L, 10L)), tf$float32
tf
)<- masking_layer(unmasked_embedding)
masked_embedding print(masked_embedding$`_keras_mask`)
tf.Tensor(
[[ True True True False False False]
[ True True True True True False]
[ True True True True True True]], shape=(3, 6), dtype=bool)
As you can see from the printed result, the mask is a 2D boolean tensor with shape (batch_size, sequence_length)
, where each individual FALSE
entry indicates that the corresponding timestep should be ignored during processing.
Mask propagation in the Functional API and Sequential API
When using the Functional API or the Sequential API, a mask generated by an layer_embedding()
or layer_masking()
will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.
For instance, in the following Sequential model, the LSTM
layer will automatically receive a mask, which means it will ignore padded values:
<- keras_model_sequential() %>%
model layer_embedding(input_dim = 5000, output_dim = 16, mask_zero = TRUE) %>%
layer_lstm(32)
This is also the case for the following Functional API model:
<- layer_input(shape = shape(NULL), dtype = "int32")
inputs <- inputs %>%
outputs layer_embedding(input_dim = 5000, output_dim = 16, mask_zero = TRUE) %>%
layer_lstm(units = 32)
<- keras_model(inputs, outputs) model
Passing mask tensors directly to layers
Layers that can handle masks (such as the LSTM
layer) have a mask
argument in their call
method.
Meanwhile, layers that produce a mask (e.g. Embedding
) expose a compute_mask(input, previous_mask)
method which you can call.
Thus, you can pass the output of the compute_mask()
method of a mask-producing layer to the call
method of a mask-consuming layer, like this:
<- new_layer_class(
my_layer "my_layer",
initialize = function(...) {
super()$`__init__`(...)
$embedding <- layer_embedding(
selfinput_dim = 5000,
output_dim = 16,
mask_zero = TRUE
)$lstm <- layer_lstm(units = 32)
self
},call = function(inputs) {
<- self$embedding(inputs)
x # Note that you could also prepare a `mask` tensor manually.
# It only needs to be a boolean tensor
# with the right shape, i$e. (batch_size, timesteps).
<- self$embedding$compute_mask(inputs)
mask <- self$lstm(x, mask = mask) # The layer will ignore the masked values
output
output
}
)
<- my_layer()
layer <- array(as.integer(runif(32*10)*100), dim = c(32, 10))
x layer(x)
tf.Tensor(
[[ 0.00661029 0.00905568 0.00409111 ... -0.00269595 -0.00302358
0.00087318]
[-0.00190009 0.00422419 -0.00334891 ... -0.00201224 0.00238065
0.0036881 ]
[-0.00398618 0.00931766 0.00278585 ... -0.00403854 0.00363951
-0.00540061]
...
[-0.00302773 -0.00809171 0.00139773 ... -0.00075406 0.00909089
-0.00136076]
[ 0.00469031 -0.00315592 -0.00237436 ... 0.00177373 0.00785404
-0.00016403]
[-0.00911869 -0.00276186 0.00483396 ... -0.00481566 0.00428451
-0.00243689]], shape=(32, 32), dtype=float32)
Supporting masking in your custom layers
Sometimes, you may need to write layers that generate a mask (like Embedding
), or layers that need to modify the current mask.
For instance, any layer that produces a tensor with a different time dimension than its input, such as a Concatenate
layer that concatenates on the time dimension, will need to modify the current mask so that downstream layers will be able to properly take masked timesteps into account.
To do this, your layer should implement the layer$compute_mask()
method, which produces a new mask given the input and the current mask.
Here is an example of a temporal_split
layer that needs to modify the current mask.
<- new_layer_class(
layer_temporal_split "temporal_split",
call = function(inputs) {
# Expect the input to be 3D and mask to be 2D, split the input tensor into 2
# subtensors along the time axis (axis 1).
$split(inputs, 2L, axis = 1L)
tf
},compute_mask = function(inputs, mask = NULL) {
# Also split the mask into 2 if it presents.
if (is.null(mask)) return(NULL)
$split(mask, 2L, axis = 1L)
tf
}
)
c(first_half, second_half) %<-% layer_temporal_split()(masked_embedding)
print(first_half$`_keras_mask`)
tf.Tensor(
[[ True True True]
[ True True True]
[ True True True]], shape=(3, 3), dtype=bool)
print(second_half$`_keras_mask`)
tf.Tensor(
[[False False False]
[ True True False]
[ True True True]], shape=(3, 3), dtype=bool)
Here is another example of a custom_embedding
layer that is capable of generating a mask from input values:
<- new_layer_class(
layer_custom_embedding "custom_embedding",
initialize = function(input_dim, output_dim, mask_zero = FALSE, ...) {
super()$ `__init__`(...)
$input_dim <- input_dim
self$output_dim <- output_dim
self$mask_zero <- mask_zero
self
},build = function(input_shape) {
$embeddings <- self$add_weight(
selfshape = shape(self$input_dim, self$output_dim),
initializer = "random_normal",
dtype = "float32"
)
},call = function(inputs) {
$nn$embedding_lookup(self$embeddings, inputs)
tf
},compute_mask = function(inputs, mask = NULL) {
if (!self$mask_zero) return(NULL)
$not_equal(inputs, 0L)
tf
}
)
<- layer_custom_embedding(
layer input_dim = 10,
output_dim = 32,
mask_zero = TRUE
)
<- array(as.integer(runif(3*10)*9), dim = c(3, 10))
x <- layer(x)
y <- layer$compute_mask(x)
mask
print(mask)
tf.Tensor(
[[ True True False True False False True True True True]
[ True True True False True False True True True True]
[ True True True True True True False True True True]], shape=(3, 10), dtype=bool)
Note: For more details about format limitations related to masking, see the serialization guide.
Opting-in to mask propagation on compatible layers
Most layers don’t modify the time dimension, so don’t need to modify the current mask. However, they may still want to be able to propagate the current mask, unchanged, to the next layer. This is an opt-in behavior. By default, a custom layer will destroy the current mask (since the framework has no way to tell whether propagating the mask is safe to do).
If you have a custom layer that does not modify the time dimension, and if you want it to be able to propagate the current input mask, you should set self$supports_masking = TRUE
in the layer constructor. In this case, the default behavior of compute_mask()
is to just pass the current mask through.
Here’s an example of a layer that is whitelisted for mask propagation:
<- new_layer_class(
layer_my_activation "my_activation",
initialize = function(...) {
super()$`__init__`(...)
$supports_masking <- TRUE
self
},call = function(inputs) {
$nn$relu(inputs)
tf
} )
You can now use this custom layer in-between a mask-generating layer (like Embedding
) and a mask-consuming layer (like LSTM
), and it will pass the mask along so that it reaches the mask-consuming layer.
<- layer_input(shape = shape(NULL), dtype = "int32")
inputs <- inputs %>%
x layer_embedding(input_dim = 5000, output_dim = 16, mask_zero = TRUE) %>%
layer_my_activation() # Will pass the mask along
print(x$`_keras_mask`)
KerasTensor(type_spec=TensorSpec(shape=(None, None), dtype=tf.bool, name=None), name='Placeholder_1:0')
<- layer_lstm(x, 32)# Will receive the mask
outputs <- keras_model(inputs, outputs) model
Writing layers that need mask information
Some layers are mask consumers: they accept a mask
argument in call
and use it to determine whether to skip certain time steps.
To write such a layer, you can simply add a mask = NULL
argument in your call
signature. The mask associated with the inputs will be passed to your layer whenever it is available.
Here’s a simple example below: a layer that computes a softmax over the time dimension (axis 1) of an input sequence, while discarding masked timesteps.
<- new_layer_class(
layer_temporal_softmax "temporal_softmax",
call = function(inputs, mask = NULL) {
<- tf$expand_dims(tf$cast(mask, "float32"), -1L)
broadcast_float_mask <- tf$exp(inputs) * broadcast_float_mask
inputs_exp <- tf$reduce_sum(
inputs_sum * broadcast_float_mask,
inputs_exp axis = -1L,
keepdims = TRUE
)/ inputs_sum
inputs_exp
}
)
<- layer_input(shape = shape(NULL), dtype = "int32")
inputs <- inputs %>%
outputs layer_embedding(input_dim = 10, output_dim = 32, mask_zero = TRUE) %>%
layer_dense(1) %>%
layer_temporal_softmax()
<- keras_model(inputs, outputs)
model <- model(
y array(sample.int(9, 32*100, replace = TRUE), dim = c(32, 100)),
array(runif(32*100), dim = c(32, 100, 1))
)
Summary
That is all you need to know about padding & masking in Keras. To recap:
- “Masking” is how layers are able to know when to skip / ignore certain timesteps in sequence inputs.
- Some layers are mask-generators:
Embedding
can generate a mask from input values (ifmask_zero = TRUE
), and so can theMasking
layer. - Some layers are mask-consumers: they expose a
mask
argument in their__call__
method. This is the case for RNN layers. - In the Functional API and Sequential API, mask information is propagated automatically.
- When using layers in a standalone way, you can pass the
mask
arguments to layers manually. - You can easily write layers that modify the current mask, that generate a new mask, or that consume the mask associated with the inputs.
Environment Details
::tf_config() tensorflow
TensorFlow v2.11.0 (~/.virtualenvs/r-tensorflow-website/lib/python3.10/site-packages/tensorflow)
Python v3.10 (~/.virtualenvs/r-tensorflow-website/bin/python)
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default
BLAS: /home/tomasz/opt/R-4.2.1/lib/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_intel_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] keras_2.9.0.9000 tensorflow_2.9.0.9000
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 pillar_1.8.1 compiler_4.2.1
[4] base64enc_0.1-3 tools_4.2.1 zeallot_0.1.0
[7] digest_0.6.31 jsonlite_1.8.4 evaluate_0.18
[10] lifecycle_1.0.3 tibble_3.1.8 lattice_0.20-45
[13] pkgconfig_2.0.3 png_0.1-8 rlang_1.0.6
[16] Matrix_1.5-3 cli_3.4.1 yaml_2.3.6
[19] xfun_0.35 fastmap_1.1.0 stringr_1.5.0
[22] knitr_1.41 generics_0.1.3 vctrs_0.5.1
[25] htmlwidgets_1.5.4 rprojroot_2.0.3 grid_4.2.1
[28] reticulate_1.26-9000 glue_1.6.2 here_1.0.1
[31] R6_2.5.1 fansi_1.0.3 rmarkdown_2.18
[34] magrittr_2.0.3 whisker_0.4.1 htmltools_0.5.4
[37] tfruns_1.5.1 utf8_1.2.2 stringi_1.7.8
system2(reticulate::py_exe(), c("-m pip freeze"), stdout = TRUE) |> writeLines()
absl-py==1.3.0
asttokens==2.2.1
astunparse==1.6.3
backcall==0.2.0
cachetools==5.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
decorator==5.1.1
dill==0.3.6
etils==0.9.0
executing==1.2.0
flatbuffers==22.12.6
gast==0.4.0
google-auth==2.15.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.57.0
grpcio==1.51.1
h5py==3.7.0
idna==3.4
importlib-resources==5.10.1
ipython==8.7.0
jedi==0.18.2
kaggle==1.5.12
keras==2.11.0
keras-tuner==1.1.3
kt-legacy==1.0.4
libclang==14.0.6
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib-inline==0.1.6
numpy==1.23.5
oauthlib==3.2.2
opt-einsum==3.3.0
packaging==22.0
pandas==1.5.2
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.3.0
promise==2.3
prompt-toolkit==3.0.36
protobuf==3.19.6
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydot==1.4.2
Pygments==2.13.0
pyparsing==3.0.9
python-dateutil==2.8.2
python-slugify==7.0.0
pytz==2022.6
PyYAML==6.0
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.9.3
six==1.16.0
stack-data==0.6.2
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.0
tensorflow-datasets==4.7.0
tensorflow-estimator==2.11.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.28.0
tensorflow-metadata==1.12.0
termcolor==2.1.1
text-unidecode==1.3
toml==0.10.2
tqdm==4.64.1
traitlets==5.7.1
typing_extensions==4.4.0
urllib3==1.26.13
wcwidth==0.2.5
Werkzeug==2.2.2
wrapt==1.14.1
zipp==3.11.0
TF Devices:
- PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
- PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
CPU cores: 12
Date rendered: 2022-12-16
Page render time: 7 seconds