Meme Text Generation with a Deep Convolutional Network in Keras & Tensorflow

The goal of this post is to describe end-to-end how to build a deep conv net for text generation, but in greater depth than some of the existing articles I’ve read. This will be a practical guide and while I suggest many best practices, I am not an expert in deep learning theory nor have I read every single relevant research paper. I’ll cover takeaways about data cleaning, training, model design, and prediction algorithms.

Step 1: Building training data

The raw dataset we’ll draw from is ~100M public meme captions by users of the Imgflip Meme Generator. To speed up training and reduce complexity of the model, we only use the 48 most popular memes and exactly 20,000 captions per meme, totaling 960,000 captions as training data. However, since we are building a generational model there will be one training example for each character in the caption, totaling ~45,000,000 training examples. Character-level generation rather than word-level was chosen here because memes tend to use spelling and grammar… uh… creatively. Also, character-level deep learning is a superset of word-level deep learning and can therefore achieve higher accuracy if you have enough data and your model design is sufficient to learn all the complexity. If you try the finished model below, you’ll also see that char-level can be more fun!

Below is what the training data looks if the first meme caption is “make all the memes”. I’m omitting the code for reading from the database and performing initial cleaning because it’s very standard and could be done in multiple ways.

training_data = [
["000000061533 0 ", "m"],
["000000061533 0 m", "a"],
["000000061533 0 ma", "k"],
["000000061533 0 mak", "e"],
["000000061533 0 make", "|"],
["000000061533 1 make|", "a"],
["000000061533 1 make|a", "l"],
["000000061533 1 make|al", "l"],
["000000061533 1 make|all", " "],
["000000061533 1 make|all ", "t"],
["000000061533 1 make|all t", "h"],
["000000061533 1 make|all th", "e"],
["000000061533 1 make|all the", " "],
["000000061533 1 make|all the ", "m"],
["000000061533 1 make|all the m", "e"],
["000000061533 1 make|all the me", "m"],
["000000061533 1 make|all the mem", "e"],
["000000061533 1 make|all the meme", "s"],
["000000061533 1 make|all the memes", "|"],

... 45 million more rows here ...
# we'll need our feature text and labels as separate arrays later
texts = [row[0] for row in training_data]
labels = [row[1] for row in training_data]

Like most things in machine learning, this is just a classification problem. We are classifying the text strings on the left into one of ~70 different buckets where the buckets are characters.

Let’s unpack the format.

  • The first 12 characters are the meme template ID. This allows the model to differentiate between the 48 distinct memes we’re feeding it. The string is left padded with zeros so all IDs are the same length.
  • The 0 or 1 is the index of the current text box being predicted, generally 0 is the top box and 1 is the bottom box, although many memes are more complex. The two spaces are just extra spacing to ensure the model can tell the box index apart from the template ID and meme text. Note: it is critical that our convolution kernel width (seen later in this post) is no wider than the 4 spaces plus the index character, aka ≤ 5.
  • After that is the text of the meme so far, with | used as the end-of-text-box character.
  • Finally, the last character by itself (the 2nd array item) is the next character in the sequence.

Several cleaning techniques were used on the data before training:

  • Trim leading and trailing whitespace and replace repeated whitespace (s+) with a single space character.
  • Apply a minimum string length of 10 characters so we don’t generate boring one-word or one-letter memes.
  • Apply a maximum string length of 82 characters so we don’t generate super long memes and because the model will train faster. 82 is arbitrary, it just made the overall training strings about 100 characters.
  • Convert everything to lowercase to reduce the number of characters the model must learn, and because many memes are just all caps anyway.
  • Skip meme captions with non-ascii characters to reduce the complexity the model has to learn. This means that both our feature text and labels will come from a set of only ~70 characters, depending on which ascii characters the training data happens to include.
  • Skip meme captions containing the pipe character | since it’s our special end-of-text-box character.
  • Run the text through a language detection library and skip meme captions that are unlikely to be English. Improves quality of the text we generate since the model only has to learn one language, and identical character sequences can have meanings in multiple languages.
  • Skip duplicate meme captions we’ve already added to the training set to reduce the chance the model simply memorizes entire meme captions.

Our data is now ready to feed into a neural net!

Step 2: Data Tensorization

That may or may not be a word. Fun fact: apparently we in deep learning are heathens for calling multi-dimensional arrays tensors rather than the mathematical term “holors”, which is a generalized tensor not requiring particular transformation properties. But whatever, tensor sounds cooler than holor 😉

First, here is the python import code for everything we’ll do below:

from keras import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Dropout, GlobalMaxPooling1D, Conv1D, MaxPooling1D, Embedding
from keras.layers.normalization import BatchNormalization
import numpy as np
import util # util is a custom file I wrote, see github link below

Neural nets operate on tensors of numbers (vectors/matrices/multi-dimensional arrays), so we need to restructure our text accordingly. Each of our training texts will be transformed into an array of integers (a rank 1 tensor) by replacing each character with its corresponding index from the array of ~70 unique characters found in the data. The order of the character array is arbitrary, but we choose to order it by character frequency so it stays roughly consistent when changing the amount of training data. Keras has a Tokenizer class which you can use for this (with char_level=True), but I wrote my own util functions because they were must faster than the Keras tokenizer.

# output: {' ': 1, '0': 2, 'e': 3, ... }
char_to_int = util.map_char_to_int(texts)
# output: [[2, 2, 27, 11, ...], ... ]
sequences = util.texts_to_sequences(texts, char_to_int)
labels = [char_to_int[char] for char in labels]

These are the characters our data contains in order of frequency:


Next we will pad our sequences of integers with leading zeros so they are all the same length, since the model‘s tensor math requires the shape of each training example to be identical. (note: I could have used as low as 100 length here because our texts are only 100 characters, but I wanted all my pooling operations later on to be perfectly divisible by 2.)

data = pad_sequences(sequences, maxlen=SEQUENCE_LENGTH)

And finally we’ll shuffle our training data and split it into training and validation sets. Shuffling (randomizing the order) ensures that a particular subset of the data is not always the subset we use to validate accuracy. Splitting some data into a validation set allows us to gauge how well the model is performing on examples that we are not allowing it to use for training.

# randomize order of training data
indices = np.arange(data.shape[0])
data = data[indices]
labels = labels[indices]
# validation set can be much smaller if we use a lot of data
validation_ratio = 0.2 if data.shape[0] < 1000000 else 0.02
num_validation_samples = int(validation_ratio * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

Step 3: Model Design

I chose to use a convolutional net because convolutions are simple and fast to train. I did briefly test a two layer LSTM, but the accuracy per time spent training was worse than the conv net, and predictions with even that small LSTM took longer than the age of the universe (okay, maybe it just felt that long). Generative Adversarial Networks (GANs) are beautiful creatures with massive potential, but using them for text generation is still in early stages and my first attempt at it was lackluster. Maybe that will be my next post…

Okay, here’s the code used to construct our conv net model in Keras:

model = Sequential()
model.add(Embedding(len(char_to_int) + 1, EMBEDDING_DIM, input_length=SEQUENCE_LENGTH))
model.add(Conv1D(1024, 5, activation='relu', padding='same'))
model.add(Conv1D(1024, 5, activation='relu', padding='same'))
model.add(Conv1D(1024, 5, activation='relu', padding='same'))
model.add(Conv1D(1024, 5, activation='relu', padding='same'))
model.add(Conv1D(1024, 5, activation='relu', padding='same'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(len(labels_index), activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

Lots of things going on there. Here’s what all that code is doing:

First the model converts each input example from an array of 128 integers (each representing one text character) into a 128×16 matrix using a Keras Embedding. An embedding is a layer that learns an optimal way to convert each of our characters from being represented as an integer to instead being represented as an array of 16 floats like [0.02, ..., -0.91]. This allows the model to learn which characters are used similarly by embedding them near one another in 16-dimensional space, and ultimately increases the accuracy of the model’s predictions.

Next we add 5 convolutional layers each with a kernel size of 5, 1024 filters, and a ReLU activation. Conceptually, the first conv layer is learning how to construct words from characters, and later layers are learning to construct longer words and chains of words (n-grams), each more abstracted than the previous.

  • padding='same' is used to ensure the output dimensions of the layer are the same as the input dimensions, since otherwise a width 5 convolution would reduce the dimension of the layer by 2 for each side of the kernel.
  • 1024 was chosen as the number of filters because it was a good tradeoff between training speed and model accuracy, determined by trial and error. For other datasets I would recommend starting with 128 filters and then increasing/decreasing it by a factor of two several times to see what happens. More filters generally means better model accuracy, but slower training, slower runtime predicting, and larger model size. However, if you have too little data or too many filters, your model may overfit and accuracy will plummet, in which case you should decrease the filters.
  • Kernel size of 5 was chosen after testing 2, 3, 5, and 7. Kernels of 2 and 3 did worse, and 7 was similar but slower due to needing to train 7/5 more parameters. In my research, other people have had success using kernel sizes from 3 to 7 in various combinations, but my takeaway is that a size 5 kernel usually performs decently on text data, and you can always experiment later on to squeeze out more accuracy for your particular dataset.
  • The ReLU activation was chosen because it is fast, simple, and very good for a huge variety of use cases. My takeaway from reading a few articles and research papers was that Leaky ReLU or other variations may give a slight improvement on some datasets, but it’s not guaranteed to be better, and it’s less likely to be noticeable on larger datasets.
  • Batch normalization is added after each conv layer so that the input parameters to the next layer are normalized based on the mean and variance for the given batch. This mechanism isn’t perfectly understood by deep learning engineers yet, but we know that normalizing input parameters improves training speed and becomes more important with deeper networks due to vanishing/exploding gradients. The original batch normalization paper had impressive results.
  • A bit of dropout is added after each conv layer to help prevent the layer from simply memorizing the data and overfitting. Dropout(0.25) randomly kills 25% of the parameters (sets them to zero).
  • MaxPooling1D(2) is added between each conv layer to “squeeze” our sequence of 128 characters in half into sequences of 64, 32, 16, and 8 characters in the following layers. Conceptually, this allows the convolutional filters to learn more abstract patterns from the text in the deeper layers, since our width 5 kernel will span twice as many characters after the dimensionality is reduced by 2X by each max pooling operation.

After all the conv layers we use a global max pooling layer, which is identical to the normal max pooling layers except that it automatically chooses how much to shrink the input size in order to match the size of our next layer. The final layers are just standard Dense (fully-connected) layers with 1024 neurons, and finally 70 neurons because our classifier needs to output the probability for each of our 70 different labels.

The model.compile step is pretty standard. The RMSprop optimizer is a decent all around optimizer and I didn’t experiment with changing it for this neural net. loss=sparse_categorical_crossentropy tells the model we want it to optimize for choosing the best category among a set of 2 or more categories (aka labels). The “sparse” part refers to the fact that our labels are integers between 0 and 70 rather than one-hot arrays each of length 70. Using one hot arrays for the labels takes WAY more memory, more time to process, and does not affect the model accuracy. Don’t use one hot labels!

Keras has a nice model.summary() function that lets us view our model:

Layer (type) Output Shape Param #
embedding_1 (Embedding) (None, 128, 16) 1136
conv1d_1 (Conv1D) (None, 128, 1024) 82944
batch_normalization_1 (Batch (None, 128, 1024) 4096
max_pooling1d_1 (MaxPooling1 (None, 64, 1024) 0
dropout_1 (Dropout) (None, 64, 1024) 0
conv1d_2 (Conv1D) (None, 64, 1024) 5243904
batch_normalization_2 (Batch (None, 64, 1024) 4096
max_pooling1d_2 (MaxPooling1 (None, 32, 1024) 0
dropout_2 (Dropout) (None, 32, 1024) 0
conv1d_3 (Conv1D) (None, 32, 1024) 5243904
batch_normalization_3 (Batch (None, 32, 1024) 4096
max_pooling1d_3 (MaxPooling1 (None, 16, 1024) 0
dropout_3 (Dropout) (None, 16, 1024) 0
conv1d_4 (Conv1D) (None, 16, 1024) 5243904
batch_normalization_4 (Batch (None, 16, 1024) 4096
max_pooling1d_4 (MaxPooling1 (None, 8, 1024) 0
dropout_4 (Dropout) (None, 8, 1024) 0
conv1d_5 (Conv1D) (None, 8, 1024) 5243904
batch_normalization_5 (Batch (None, 8, 1024) 4096
global_max_pooling1d_1 (Glob (None, 1024) 0
dropout_5 (Dropout) (None, 1024) 0
dense_1 (Dense) (None, 1024) 1049600
batch_normalization_6 (Batch (None, 1024) 4096
dropout_6 (Dropout) (None, 1024) 0
dense_2 (Dense) (None, 70) 71750
Total params: 22,205,622
Trainable params: 22,193,334
Non-trainable params: 12,288

The parameter counts are particularly useful if you’re not addicted to doing tensor shape multiplication in your head. When adjusting the hyperparameters we discussed above, it’s useful to keep an eye on the parameter count of the model, which roughly represents the model’s total amount of learning capacity.

Step 4: Training

Now we’re going to let the model train and use “checkpoints” to save the history and optimal model along the way so that we can check progress and make predictions using the latest model at any point during training.

# the path where you want to save all of this model's files
MODEL_PATH = '/home/ubuntu/imgflip/models/conv_model'
# just make this large since you can stop training at any time
# batch size below 256 will reduce training speed since
# CPU (non-GPU) work must be done between each batch
# callback to save the model whenever validation loss improves
checkpointer = ModelCheckpoint(filepath=MODEL_PATH + '/model.h5', verbose=1, save_best_only=True)
# custom callback to save history and plots after each epoch
history_checkpointer = util.SaveHistoryCheckpoint(MODEL_PATH)
# the main training function where all the magic happens!
history =, y_train, validation_data=(x_val, y_val), epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, callbacks=[checkpointer, history_checkpointer])

This is where you just sit and watch the magic number go up over a period of many hours…

Train on 44274928 samples, validate on 903569 samples
Epoch 1/48
44274928/44274928 [==============================] - 16756s 378us/step - loss: 1.5516 - acc: 0.5443 - val_loss: 1.3723 - val_acc: 0.5891
Epoch 00001: val_loss improved from inf to 1.37226, saving model to /home/ubuntu/imgflip/models/gen_2019_04_04_03_28_00/model.h5
Epoch 2/48
44274928/44274928 [==============================] - 16767s 379us/step - loss: 1.4424 - acc: 0.5748 - val_loss: 1.3416 - val_acc: 0.5979
Epoch 00002: val_loss improved from 1.37226 to 1.34157, saving model to /home/ubuntu/imgflip/models/gen_2019_04_04_03_28_00/model.h5
Epoch 3/48
44274928/44274928 [==============================] - 16798s 379us/step - loss: 1.4192 - acc: 0.5815 - val_loss: 1.3239 - val_acc: 0.6036
Epoch 00003: val_loss improved from 1.34157 to 1.32394, saving model to /home/ubuntu/imgflip/models/gen_2019_04_04_03_28_00/model.h5
Epoch 4/48
44274928/44274928 [==============================] - 16798s 379us/step - loss: 1.4015 - acc: 0.5857 - val_loss: 1.3127 - val_acc: 0.6055
Epoch 00004: val_loss improved from 1.32394 to 1.31274, saving model to /home/ubuntu/imgflip/models/gen_2019_04_04_03_28_00/model.h5
Epoch 5/48
1177344/44274928 [..............................] - ETA: 4:31:59 - loss: 1.3993 - acc: 0.5869

Fast forward and here are some shiny plots of our loss and accuracy at each epoch:

I’ve found that when training loss/accuracy is worse than validation loss/accuracy it’s a sign that the model is learning well and not overfitting.

As a side note if you’re using an AWS server for training, I found the optimal instance to be p3.2xlarge. This uses their fastest GPU as of April 2019 (Tesla V100), and the instance only has one GPU since our model cannot make use of multiple GPUs very efficiently. I did try using Keras’s multi_gpu_model but it requires making the batch size way larger to actually realize speed gains, which can mess with the model’s ability to converge, and it barely got 2X faster even when using 4 GPUs. The p3.8xlarge with 4 GPUs costs 4 times more, so for me it wasn’t worth it.

Step 5: Predicting

Okay so now we have a model that can output the probabilities for which character should come next in a meme caption, but how do we use that to actually create a full meme caption from scratch?

The basic premise is that we initialize a string with whichever meme we want to generate text for, and then we call model.predict once for each character until the model outputs the end-of-box-text character | as many times as there are text boxes in the meme. For the “X All The Y” meme seen above, the default number of text boxes is 2 and our initial text would be:

"000000061533  0  "

I tried a few different approaches for choosing the next character given the model’s output of 70 probabilities:

  1. Pick the character with highest score each time. This is ultra boring because it chooses the exact same text every time for a given meme, and it uses the same words over and over across memes. It spit out “when you find out your friends are the best party” for the X All The Y meme over and over. It liked to use the words “best” and “party” a lot in other memes too.
  2. Give each character a probability of being chosen equal to the score the model gave it, but only if the score is above a certain threshold (≥ 10% of the highest score works well for this model). This means multiple characters can be chosen, but bias is given to higher scored characters. This method succeeded in adding variety, but longer phrases sometimes lacked cohesion. Here’s one from the Futurama Fry meme: “not sure if she said or just put out of my day”.
  3. Give each character an equal probability of being chosen, but only if it’s score is high enough (≥ 10% of the highest score works well for this model). Also, use beam search to keep a running list of N texts at any given time, and use the product of all the character scores instead of just the last character’s score. This takes up to N times longer to compute, but seems to improve sentence cohesion in some cases.

I’m currently using method #2 because it’s much faster than beam search and both methods give decent results. Below are some random examples:

You can play with the latest model yourself and generate from any of the 48 memes at

The code for making runtime predictions using method #2 is below. The full implementation on Github is a generalized beam search algorithm so beam search can be enabled simply by increasing the beam width beyond 1.

# min score as percentage of the maximum score, not absolute
int_to_char = {v: k for k, v in char_to_int.items()}
def predict_meme_text(template_id, num_boxes, init_text = ''):
template_id = str(template_id).zfill(12)
final_text = ''
for char_count in range(len(init_text), SEQUENCE_LENGTH):
box_index = str(final_text.count('|'))
texts = [template_id + ' ' + box_index + ' ' + final_text]
sequences = util.texts_to_sequences(texts, char_to_int)
data = pad_sequences(sequences, maxlen=SEQUENCE_LENGTH)
predictions_list = model.predict(data)
predictions = []
for j in range(0, len(predictions_list[0])):
'text': final_text + int_to_char[j],
'score': predictions_list[0][j]
predictions = sorted(predictions, key=lambda p: p['score'], reverse=True)
top_predictions = []
top_score = predictions[0]['score']
rand_int = random.randint(int(MIN_SCORE * 1000), 1000)
for prediction in predictions:
# give each char a chance of being chosen based on its score
if prediction['score'] >= rand_int / 1000 * top_score:
final_text = top_predictions[0]['text']
if char_count >= SEQUENCE_LENGTH - 1 or final_text.count('|') == num_boxes - 1:
return final_text

You can view all the code including utility functions and a sample training data file on github.

Step 6: ???

Step 7: Profit from AI-generated memes, obviously

The end.