Text Classification with TF Hub

Use a pretrained model from TF Hub to classify text

This tutorial classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with TensorFlow Hub and Keras.

It uses the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

This notebook uses keras, a high-level API to build and train models in TensorFlow, and TensorFlow hub, a library for loading trained models from TFHub in a single line of code. For a more advanced text classification tutorial using Keras, see the MLCC Text Classification Guide.

library(tensorflow)
library(tfhub)
library(keras)

Download the IMDB dataset

The IMDB dataset is available on imdb reviews or on TensorFlow datasets. The following code downloads the IMDB dataset to your machine:

if (dir.exists("aclImdb/"))
  unlink("aclImdb/", recursive = TRUE)
url <- "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset <- get_file(
  "aclImdb_v1",
  url,
  untar = TRUE,
  cache_dir = '.',
  cache_subdir = ''
)
unlink("aclImdb/train/unsup/", recursive = TRUE)

We can then create a TensorFlow dataset from the directory structure using the text_dataset_from_directory() function:

batch_size <- 512
seed <- 42

train_data <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'training',
  seed = seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.

validation_data <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'validation',
  seed = seed
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.

test_data <- text_dataset_from_directory(
  'aclImdb/test',
  batch_size = batch_size
)

Found 25000 files belonging to 2 classes.

Explore the data

Let’s take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let’s print first 10 examples.

batch <- train_data %>%
  reticulate::as_iterator() %>%
  reticulate::iter_next()

batch[[1]][1]

tf.Tensor(b'Upon seeing this film once again it appeared infinitely superior to me this time than the previous times I have viewed it. The acting is stunningly wonderful. The characters are very clearly drawn. Brad Pitt is simply superb as the errant son who rebels. The other actors and actresses are equally fine in every respect. Robert Redford creates a wonderful period piece from the days of speakeasies of the 1920s. The scenery is incredibly beautiful of the mountains and streams of western Montana. All in all, this is one of the finest films made in the 1990s.<br /><br />You must see this movie!<br /><br />', shape=(), dtype=string)

Let’s also print the first 10 labels.

batch[[2]][1:10]

tf.Tensor([1 0 1 0 0 0 1 1 0 0], shape=(10), dtype=int32)

Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

How to represent the text?
How many layers to use in the model?
How many hidden units to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

You don’t have to worry about text preprocessing,
Benefit from transfer learning,
the embedding has a fixed size, so it’s simpler to process.

For this example you use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

google/nnlm-en-dim128/2 - trained with the same NNLM architecture on the same data as google/nnlm-en-dim50/2, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
google/nnlm-en-dim128-with-normalization/2 - the same as google/nnlm-en-dim128/2, but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
google/universal-sentence-encoder/4 - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.

And many more! Find more text embedding models on TFHub.

Let’s first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

embedding <- "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer <- tfhub::layer_hub(handle = embedding, trainable = TRUE)
hub_layer(batch[[1]][1:2])

tf.Tensor(
[[ 0.25639078  0.38771343  0.11458009  0.46377155 -0.27114585 -0.2354866
  -0.05462556  0.05912565 -0.54671913  0.31118712 -0.16002831 -0.07053422
  -0.24705385  0.09001825 -0.04209795 -0.33874807 -0.24183154 -0.3230999
   0.10837324 -0.6382228   0.07474955 -0.47535452  0.40693292  0.31290907
  -0.15077835  0.16694838 -0.63673955  0.18927398  0.44574216 -0.24568918
  -0.46415138  0.2513454   0.14228602 -0.44085872 -0.2652811   0.09904837
   0.18815234 -0.05307332  0.26779363 -0.6057924  -0.27559575  0.05044953
  -0.48596275  0.21479745 -0.17461553 -0.6422216  -0.3165064  -0.33656728
  -0.09484116 -0.07192937]
 [ 1.2899796   0.32863247 -0.00310844  0.8232223  -0.40982845 -0.5109542
   0.08370162 -0.13269287 -1.1700541   0.55316675 -0.05031178  0.14853314
  -0.15995869  0.26997143 -0.3404822  -0.49329752 -0.25939593  0.03390278
   0.25013074 -1.417716    0.19143656 -0.2392007   1.2250862   0.41607675
  -0.66565406  0.42407426 -1.2803619   0.47229245  0.53426725 -0.84027433
  -0.7578848   0.44375423  0.57404596 -0.5191065  -0.67364585  0.6255407
   0.54375225  0.22559978  0.17738008 -1.0557281   0.03807904  0.44274876
  -0.45797464  0.17220229 -0.2047742  -0.3091375  -0.7907681  -0.723012
   0.00783113 -0.0088165 ]], shape=(2, 50), dtype=float32)

Let’s now build the full model:

model <- keras_model_sequential() %>%
  hub_layer() %>%
  layer_dense(16, activation = 'relu') %>%
  layer_dense(1)

summary(model)

Model: <no summary available, model was not built>

The layers are stacked sequentially to build the classifier:

The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that you are using (google/nnlm-en-dim50/2) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension). For this NNLM model, the embedding_dimension is 50.
This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
The last layer is densely connected with a single output node.

Let’s compile the model.

Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), you’ll use the binary_crossentropy loss function.

This isn’t the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities—it measures the “distance” between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Later, when you are exploring regression problems (say, to predict the price of a house), you’ll see how to use another loss function called mean squared error.

Now, configure the model to use an optimizer and a loss function:

model %>% compile(
  optimizer = 'adam',
  loss = loss_binary_crossentropy(from_logits = TRUE),
  metrics = 'accuracy'
)

Train the model

Train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the x_train and y_train tensors. While training, monitor the model’s loss and accuracy on the 10,000 samples from the validation set:

history <- model %>% fit(
  train_data,
  epochs = 10,
  validation_data = validation_data,
  verbose <- 1
)

Epoch 1/10
40/40 - 7s - loss: 0.6572 - accuracy: 0.5390 - val_loss: 0.5898 - val_accuracy: 0.6238 - 7s/epoch - 183ms/step
Epoch 2/10
40/40 - 6s - loss: 0.5057 - accuracy: 0.7344 - val_loss: 0.4545 - val_accuracy: 0.7598 - 6s/epoch - 141ms/step
Epoch 3/10
40/40 - 5s - loss: 0.3572 - accuracy: 0.8533 - val_loss: 0.3606 - val_accuracy: 0.8518 - 5s/epoch - 136ms/step
Epoch 4/10
40/40 - 5s - loss: 0.2573 - accuracy: 0.9015 - val_loss: 0.3170 - val_accuracy: 0.8590 - 5s/epoch - 136ms/step
Epoch 5/10
40/40 - 5s - loss: 0.1915 - accuracy: 0.9293 - val_loss: 0.2995 - val_accuracy: 0.8680 - 5s/epoch - 134ms/step
Epoch 6/10
40/40 - 5s - loss: 0.1415 - accuracy: 0.9525 - val_loss: 0.2984 - val_accuracy: 0.8770 - 5s/epoch - 137ms/step
Epoch 7/10
40/40 - 5s - loss: 0.1034 - accuracy: 0.9711 - val_loss: 0.2986 - val_accuracy: 0.8752 - 5s/epoch - 135ms/step
Epoch 8/10
40/40 - 5s - loss: 0.0760 - accuracy: 0.9807 - val_loss: 0.3069 - val_accuracy: 0.8754 - 5s/epoch - 134ms/step
Epoch 9/10
40/40 - 4s - loss: 0.0546 - accuracy: 0.9885 - val_loss: 0.3195 - val_accuracy: 0.8748 - 4s/epoch - 104ms/step
Epoch 10/10
40/40 - 5s - loss: 0.0401 - accuracy: 0.9938 - val_loss: 0.3320 - val_accuracy: 0.8724 - 5s/epoch - 119ms/step

Evaluate the model

And let’s see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

results <- model %>% evaluate(test_data, verbose = 2)

49/49 - 1s - loss: 0.3647 - accuracy: 0.8550 - 677ms/epoch - 14ms/step

results

    loss accuracy 
0.364683 0.855040

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.