Tensorflow - NLP

December 28, 2024 · 3 min read

High Performance Developer

Some notes on NLP and Tensorflow.

Tensorflow

Tokenize

Tensorflow has a tokenizer:

from tensorflow.keras.preprocessing.text import Tokenizer

You can also specify a num_words param when instantiating the tokenizer.

word_index - the full list of words to token values

Unknown Words

When we create the tokenizer, can specify an Out of Value (or Out Of Vocab) token - when a sequencer requests the token value of a token unknown to the tokenizer, return the OOV token. This maintains some structure and also keeps the length of the training sentence the same length as the input sentence.

Sequences

Convert samples to sequences of numbers based on the results of tokenization.

Variable Length Inputs

Ragged Tensors

Ragged tensors are used for variable length inputs.

Padding

Pre or post pad all samples (with 0) to match the length of the longest sentence in the corpus.

Can also specify a maxlen for padding with a truncating param.

from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)

alt text

Recurrent Neural Networks

Takes the sequence of data into account when learning e.g. in a sentence. The traditional sentiment approach only takes into account the product of the word vectors, not the order of the words.

Consider the Fibonnaci sequence:

alt text

As a computation graph:

alt text

We pass the 2nd parameter of the first computation (which is the number 2) onto the next stage with the result (the result being 3)

alt text

This continues with the result of the previous computation being fed forward

alt text

This means that all elements in the series are part of the current value.

Recurrent Neurons

alt text

A function that gets an input, x and produces an output, y. It also outputs a feed forward value, F.

They can be grouped together as so:

alt text

When we pass in x0, it outputs y0 and also outputs a value to be fed into the next neuron.

alt text

The next neuron gets the output from the previous neuron and also gets x1 which it then uses to calculate y1. It also outputs a feed forward value.

alt text

Thus sequence is encoded into the output.

The number at position 1 has an impact on the number at position 100, even if it's smaller than the impact from the input at 99.

LSTMs

LSTM - Long Short Term Memory

Over a long distance, context becomes diluted. The LSTM architecture introduces a cell state which is bidirectional context across neurons.

alt text

The bidirectionally allows for more recent words to provide context for words at the start.

To do this in Keras, we introduce a bidirectional LSTM layer:

tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))

The 64 represents the number of hidden nodes in that layer and also the shape of the output from the LSTM layer.

You can also stack LSTM layers:

alt text

Note we've had to add return_sequences=true for all but the final LSTM layer.

Links and References

TensorFlow NLP Playlist, YouTube

Tokenize​

Unknown Words​

Sequences​

Variable Length Inputs​

Ragged Tensors​

Padding​

Recurrent Neural Networks​

Recurrent Neurons​

LSTMs​

Links and References​

Tokenize

Unknown Words

Sequences

Variable Length Inputs

Ragged Tensors

Padding

Recurrent Neural Networks

Recurrent Neurons

LSTMs

Links and References