Predicting the Popularity of Instagram Posts

In this article, we will discuss our methods and results in attempting to predict what makes an Instagram post more or less popular. First we will describe our data collection methods as well as the data itself. Then, we will describe techniques we used for Image and Natural Language Processing. After this step, we use an XGBoost Regression Model to get a benchmark root mean square error score we will try to improve. The final step, will be to use Mixed Input Neural Network that takes in Categorical/Numerical Data and Image Data.

All code for the topics discussed in this article can be found here: https://github.com/GuiZamorano/Instagram_Like_Predictor

Motivation

Trends and popularity are largely driven by social media in the modern age. There are over 1 billion users on Instagram today. This is a large scale market with the potential to be optimized to increase popularity, viewership, and even revenue. We hope to capitalize on this market by creating a project that can identify key variables to increasing the ratio of the likes of a post to the average likes a user gets on a post. Using these features, we wish to be able to generate an estimation for this ratio so that posts can be optimized to gather the most amount of exposure for high profile influencers as well as daily users.

Collecting the Data

Collecting Instagram Usernames:

To be able to predict the popularity of an Instagram Post the first thing we need is plenty of data to train our models on. Our first line of work was to collect Instagram usernames that we could use to scrape post data from. Fortunately, we found a website with a list of the top 1000 Instagram influencers (not necessarily the ones with most followers). The website can be found here: https://hypeauditor.com/en/top-instagram/.

Using urllib and BeautifulSoup Python packages we crawled through the pages of this website and collected 1000 Instagram usernames. However, we still wanted more data so by looking at other projects involving Instagram users, we were able to increase our list to a total of 1897 usernames. All these users had a wide variety of popularity and followers.

Scraping Instagram Posts:

Now that we had an extensive list of usernames, our next challenge was to collect data from their posts. Our first idea was to use Facebook’s Instagram Graph API. By making requests to this API, you can collect most of the information from a profile and a post. Unfortunately, Instagram’s Graph API limits are becoming stricter every year. Currently the limit is 200 requests per hour. This limitation makes it very difficult and tedious to take advantage of the Instagram Graph API. Therefore, by doing some research, we created our own scraper that works as follows:

  1. Make a URL Request to https://www.instagram.com/ + {username}.
  2. Extract the Javascript metadata from the response by turning it into a JSON object.
  3. Use urllib to download the profile image of the user and each post image.

Using this scraper that directly calls the a user’s Instagram page provides you with the twelve most recent posts from each user. We attempted to use Selenium to load the entire page however, doing so only appends visual information to the HTML of the page and not the metadata we require. Therefore, we decided to stick with (up to) twelve posts per user and we removed the most recent post from our dataset as we believed it might be an inaccurate display of popularity since it is so recent. Now that we have 1897 JSON objects with profile and post information and a folder with all images we are ready to build a dataset.

Figure 1: JSON of user profile data
Figure 2: Last key-value pair of JSON is array of posts

Data Exploration and Visualization

Our first goal after scraping for readily available features was to see if we could find any relationships between them. This included looking for both expected or hypothesized as well as unexpected relationships. This exploration also includes some basic features we generated by ourselves in addition to the raw scraper data, to be discussed in more detail later. We first explored each feature’s correlation to the feature we will eventually try to predict: a post’s number of likes divided by the mean number of likes for the post’s account (number_of_likes_over_mean). We honed in on the features most correlated (or inversely correlated) with number_of_likes_over_mean for further investigation. Depicted below are correlations between these features.

Figure 3: Top features correlated with number_of_likes_over_mean

Some features were correlated with others fairly significantly for explainable reasons, such as hr_of_day or hour buckets (e.g. (16, 20], which means 4pm to 8pm) and hr_sin because they both deal with hour intervals. The most significant correlations are listed below, albeit the correlations are very small.

Figure 4: Some of the most significant feature correlations

Some relationships we found were that posts during later waking hours do a little better than posts late at night or early in the morning. As seen in Figure 5, midweek (Tuesday, Wednesday, Thursday) posts do slightly worse than an average post, and we found the same for video posts and posts with disabled comments.

Figure 5: Each day, one-hot-encoded vs. number_of_likes_over_mean

We also visualized our number of likes divided by a user’s average number of likes data in a distribution and a scatterplot. The scatterplot has a skew of about 1.6 and a kurtosis of about 6. It has a larger tail that tends to the right side of the distribution than a normal curve.

Figure 6: Distribution of likes divided by mean
Figure 6: Scatter plot of likes divided by mean

Feature Generation

After familiarizing ourselves with our dataset, we decided to generate some features that we thought would be useful in predicting how a given post performs. These features include those extracted from the post images themselves, NLP features that describe the raw text data such as captions in a more useful way, as well as more general features that could potentially contribute to a post’s success, such as time of day and day of the week. After generating these features, we hoped to utilize them and determine their importance once we began training a final model, by comparing the performance of a base regression model both with and without these features.

General Features:

After scraping data from Instagram, we had lots of metadata, including account followers and following, business/category information, and time and day information. Although some of the metadata is related to the account and not just the post, we felt including it would help give contextual information to the post, as well as help us derive more features specific to the post itself.

So, in addition to scraping data from Instagram, we performed some basic feature engineering to augment the dataset before going in to do more complicated feature generation. We thought it might be useful to augment the scraped data with some of the following. A measure of how “active” an account is might be a proxy for how popular a post will be, so we computed the time between posts and added it as a feature to the dataset. In addition, we computed time sine and cosine features to encode the cyclical nature of time (i.e. 11:55pm should be close to 12:05am, but 23 hr is very far from 00 hr). We also grouped posts into time buckets, for example from midnight to 4 am, 4am to 8am, etc, because we thought certain daily time periods (not just times) would contribute to the popularity of a post. Finally, because it would be very difficult to measure the number of likes for a post without simply learning the user, we calculated the associated account’s mean number of comments and mean number of likes in preparation for predicting how relatively popular a post would be. Next we look at more involved features such as those found in images and text.

Image Features:

We had many ideas relating to images and how to treat them. An initial idea that we had was that a post performs better when there is a human subject as compared to when there is not. Moreover, we hypothesized that images with many people in it, such as group photos, do not get as many likes as images with a single clear subject. We also wanted to test whether or not the presence of a smile in the image affects the post’s number of likes. Therefore, we decided to create a model that determines the number of faces and the number of smiles in a given image. This model would be applied to our dataset to generate these features.

After some preliminary research, we decided to utilize OpenCV’s existing facial recognition classifiers to complete this task. Using our downloaded images, we were able to load pre existing face and smile detectors from XML files found in OpenCV’s documentation, which we used to generate .csv files that contained the newly generated features to add to our existing dataset.

Natural Language Processing Features:

There are many different features within the dataset that utilize text including biography, hashtags, location, and caption. For the purposes of our project we chose to focus on the caption because we were analyzing independent posts and wanted to use as much information about the post, not the poster, as possible.

The first step in NLP was to clean the dataset. We focused on words themselves, and as such had to remove numbers, punctuation, emojis, and null values. Because our dataset includes popular Instagram users from all over the world, we faced a challenge with processing because there were so many different languages. We decided to use the Google Translate function call in order to translate all of the text into English as for the sake of processing this seemed best. After translating the captions into English, we removed stop words to eliminate words such as “the” or “is” to give more weight to other words.

After we cleaned the dataset, the first analysis was done on text sentiment. Using the TextBlob API, we passed each caption and received a sentiment score. A positive score up to +1 indicated positive words and a negative score up to -1 indicated negative words. Alongside this numerical feature, we made the original language into a categorical variable to see if different languages offered different fluctuations in the likes ratio.

The other NLP method we tried utilized bag-of-words. In this approach, we created a count vector of all the words in a given caption and found the 100 most used words among all the captions. These words were then turned into binary features, a 1 indicating it was in the caption and a 0 indicating it was not in the caption. This approach was heavily reliant on the caption being entirely English as having multiple languages would create different top words that would not be common among all the captions.

Now that we have all of our features generated we were ready to begin working on an initial model.

Regression Analysis

Our first step in building a model was to try various regression models with all data including number of likes and comments. Obviously this information is what we are trying to predict but seeing how well a regression model could do with this data would give us a benchmark score to reach. Our best model was XGBoost and Gradient Boosting, so we decided to stick with these regression models. They achieved a test mean squared error of around 0.032–0.034 which is a root mean squared error of around 0.18.

Figure 7: Mean squared error scores for models already including the information we are attempting to predict

This gave us a our goal of achieving a score better than a 0.18 root mean squared test error.

Now we removed the features that are not available to us before the picture is posted such as number of likes and comments to continue testing our actual predictive model. The mean squared error rose drastically to around 0.26–0.27 which is a root mean squared error around 0.52.

Figure 8: Mean squared error scores for true model

Our next step was to see which generated features worked best on XGBoost. The generated features we decided to try on the model was number of faces, number of smiles, and natural language features. XGBoost performed best with the original features with an added feature of number of faces. The root mean square error was 0.51.

With this regression analysis we were able to determine that using original features with number of faces yielded the best prediction with XGBoost and Gradient Boosting. However, these new features marginally improved the model, even after calibrating our XGBoost regressor. Therefore, our final step is to test the new features with a neural network.

Figure 9: Mean squared error scores after adding image and natural language features.

Neural Networks

Because XGBoost was not performing nearly as well as our target RMSE score of around 0.18 we decided to try using neural networks to solve our problem. This would allow us to also directly use the profile and post images we initially downloaded and determine whether or not they could play a role in a post’s popularity. Interestingly, we found that most neural networks are used for classification problems rather than regression. Still, we decided to pursue with using neural networks for regression, i.e. predicting a continuous value.

Our first step was using only categorical and numerical data in a neural network to see how it compared to the XGBoost results. Next we used a convolutional neural network on the images, both profile image and post image. Finally, we combined both into a two-branch neural network. All training was done using Google Colab which supports free GPU.

1. Multilayer Perceptron (MLP):

A multilayer perceptron is a deep, artificial neural network composed of an input layer, an output layer, and at least one hidden layer that perform the computation. The hidden layers and activation functions, give us a much more powerful model when compared to linear regression. The first thing we did was to use scikit-learn’s MinMaxScaler to scale our features to the range [0, 1]. We also scaled the output targets to [0, 1] to reduce the range of our output predictions. This significantly improved the neural networks results. This is because neural networks learn by adding gradient vectors multiplied by a learning rate, and this learning rate may overcompensate (or undercompensate) its corrections if the ranges of distributions in the features are all different. The simple MLP architecture we used is shown below and used a linear activation function for the final dense layer.

Figure 10: MLP architecture

Results: This MLP architecture worked significantly better than any other previous method! The RMSE was much, much lower than XGBoost and we were able to achieve and far surpass our benchmark score of 0.18. The results after training for 30 epochs using the Adam optimizer are described below.

Mean Absolute Percentage Difference: 55.40%

Root Mean Square Error: 0.0807

Figure 11: Line of best fit by MLP model

2. Convolutional Neural Network (CNN):

Processing Images:

Using a CNN now gives us the power to put all those Gigabytes of downloaded images to good use. This did pose a few problems however.

Image Size: The downloaded images were not standardized in size and most were too big for ideal computation times. By analyzing our image shapes it seemed that most were around 960px x 1200px. Ideally, we would resize the images to maintain their aspect ratio, but then they would all be different sizes. Our solution was to resize the images to 100px x 128px. This seemed to maintain image features better than square images since most images included faces or bodies. Additionally, it would not require the CNN to learn unnecessary information as is the case with padding.

Two Images: The next problem we dealt with was deciding on how to process our images in the CNN. Each data entry consists of two images, a profile picture and a post picture. We considered two options of dealing with this issue.

  1. Use two CNN’s, one for post images and one for profile images and use the popularity of the post as the label for each image
  2. Create a new image that is the profile image and the post image side-by-side.

The first option seems like a bad choice. First, we will have a post image and a profile image that both share the same label. Second, we will have a different label for every single profile picture since popularity only differs by post. Therefore this option will probably make it much harder for the CNN to learn popularity by the images.

Option two seems to be ideal for many reasons. A big indicator of popularity is the person who is posting. Consider the exact same image of a plate of food at a fancy restaurant. One is posted by me and the other by Kim Kardashian. Which one do you think will get more likes? Therefore, combining the profile image with the post image side-by-side can be a great indicator of a posts’ popularity. Additionally this allows the CNN to learn features from the profile and post image at the same time.

Figure 12: Final image example input into CNN

CNN Architecture:

Building a CNN architecture comes from experience and hours of modifications. There is no blackbox way to get the best architecture for your data given the problem you are trying to solve. As such, we spend a long time testing different architectures with different numbers of convolutional layers, max-pool layers, and fully connected layers. We also experimented with batch normalization and dropout. Finally we decided on the best architecture for our regression problem, shown below.

Figure 13: CNN architecture

This architecture is a deep network that consists of four convolutional layers, four max-pool layers, and five fully connected layers. Using this architecture, we attempted two variations of predicting popularity.

Gradient Boosted CNN:

Initially, we attempted to take the final feature layer of our CNN before the output, and use those features to train our XGBoost model. We reasoned that since this would include image features along with our categorical and numerical data it would significantly improve our results. We attempted this on both the ‘4’ node layer and the ‘25’ node layer but did not find any significant improvement. The RMSE was still around 0.52 which is much worse than our MLP results.

Image Regression:

Recall, that CNN’s are used mostly for classification. To use a CNN for regression one simply has to replace the fully-connected softmax layer with a single node (with a linear activation function). Additionally make sure your error metric is for continuous values. For example we used root mean square error. We did this and passed in our side-by-side images to obtain a predicted popularity value for each post. The results were better than XGBoost but inferior to the MLP. The results are summarized below. As you can tell the line of best fit worsened a lot compared to the MLP.

Mean Absolute Percentage Difference: 66.78%

Root Mean Square Error: 0.0850

Figure 14: CNN results

3. Two-Branch Neural Network: Mixed Data:

Our final idea was to bring these two neural networks together. This is a mixed data machine learning problem where our model must be able to use all of our data (image, categorical, numerical) to make predictions. Mixed data is still a very open area of machine learning research so we were very eager to give an attempt at it. We have been using Keras for our neural networks and Keras models can handle multiple inputs. We already had the two models that we showed above so the next step was to combine them into one two-branch neural network. Each branch will be tasked to train on each type of data. After this, the branches will be concatenated to make a final prediction.

Figure 15: How to combine mixed inputs

To create a multiple input neural network we first need to create our two branches. Each branch will be the neural networks we described in the first two parts: the MLP and the CNN. There is one change however, because these two models will be concatenated we remove the final output layer of each model. We want the actual regression to be computed at the end by the multiple branch neural network, as opposed to the individual branches. The resulting architecture is shown below:

Figure 16: Mixed Input NN architecture

As you can see, the combined outputs of the two branches (four neurons each) are the input to the layers of the mixed data network. We add another dense layer of four neurons to this and finally a linear activation on our final neuron which is the predicted popularity. This resulted in our best model with the smallest error. The results are described next.

Mean Absolute Percentage Difference: 55.22%

Root Mean Square Error: 0.08021

This model gave us popularity predictions within 0.25 of the actual popularity for over 60% of our data. The line of best fit also improved dramatically. We were very happy with our final RMSE value of 0.08021. On average we were off by 0.08 in our popularity metric which is a very good estimator.

Figure 17: Line of best fit of our mixed input model

Here is a table of some of our predictions.

Figure 18: Predictions of our mixed input model

Conclusion and Future Work

For this project, the best results were delivered by the MLP and two layer neural networks. The resulting RMSE was signifcantly better than the benchmark scores XGBoost gave even when it knew the number of likes on a post. By combining all the methods described above we were able to achieve a very good predictive model. Still, there are a few avenues we could have explored further but due to time constraints, were not able to. For example, there was noise generated due to translating text from various other languages to English. By limiting our search for posts to just English posts, this noise could have been reduced. Another step we could have taken to reduce variability in our data was to exclude some types of accounts such as the most popular ones. Due to this natural variability in data and features, it was difficult for any model to learn and predict with exact accuracy the popularity of a post. Finally we could have generated further combinations of features for the NLP task by taking into account hashtags and mentions on posts to see if they increased variability in the number of likes the posts received.

By Dylan Bray, Hassan Chughtai, Numan Gilani, Simon Xie, Charlie Yeng, and Gui Zamorano