Bigram Based LSTM with Regularization
In the previous LSTM tutorial, we used a single character at a time, now thebigram approach is to predict a character using two characters at a time. This tutorial will also introduce regularization technique known as Dropout in RNN.
From the TensorFlow prospect, the code remains almost the same, the slight differences are explained below.
For bigram, we will introduce embedding vector as we did in word2vec
. The use of embedding is because the number of possible bigrams will be too large, and using them directly into the one hot encoding will lead to wasteful computations.
The following code is a modified version of the previous code. We have changed thedimension to the embedding size.
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64
graph = tf.Graph()
with graph.as_default():
# Parameters:
vocabulary_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
# Input gate: input, previous output, and bias.
ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ib = tf.Variable(tf.zeros([1, num_nodes]))
# Forget gate: input, previous output, and bias.
fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
fb = tf.Variable(tf.zeros([1, num_nodes]))
# Memory cell: input, state and bias.
cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
cb = tf.Variable(tf.zeros([1, num_nodes]))
# Output gate: input, previous output, and bias.
ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
ob = tf.Variable(tf.zeros([1, num_nodes]))
Since we are using two characters at a time, we need to modify the shifting by two steps.
# Input data.
train_data = list()
for _ in range(num_unrollings + 1):
train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
train_chars = train_data[:num_unrollings]
train_inputs = zip(train_chars[:-1], train_chars[1:])
train_labels = train_data[2:] # labels are inputs shifted by twos time step.
Embedding needs a lookup, we will use the function on vocabulary_embeddings
, bigram_index
and the latter is the index calculated using the argmax function.
While for thedropout we will use it only in the input and output layers, the problem isgreatly explained in this paper.
for i in train_inputs:
bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
drop_i = tf.nn.dropout(i_embed, 0.7)
output, state = lstm_cell(drop_i, output, state)
drop_o = tf.nn.dropout(output, 0.7)
outputs.append(drop_o)
A similar modification is done in the validation part, you will see in the fullcode, get from here.
The training part has few lines need to be replaced; since we have two charactersat a time we need to feed two inputs, we will introduce a for loop and thefollowing two statements.
sentence = characters(feed[0])[0] + characters(feed[1])[0]
And:
predictions = sample_prediction.eval({
sample_input[0]: b[0],
sample_input[1]: b[1]
})
The bigram base model increases the perplexity as compared to the last model in which we were doing character by character. And even after the dropout the performance remains the same.
Dropout does not always increase the performance of the model, we have already experienced in the Neural Network that dropout can also decrease the accuracy, this depends on the environment in which we are using. Dropout usually works well late in the game, currently 10,000
iterations are way too low for them.
I would recommend you to try out and share your results with the community, especially if dropout helps.