Zerotogans series 2: Predicting Twitter Buzz Using Pytorch

Cassie Guo
3 min readJun 5, 2020

--

When it comes to machine learning 101, everyone will have to start with regression model. In this model we are trying to answer this question:

Can we use twitter topic statistics to evaluate their topic popularity?

image source (https://images.app.goo.gl/4gFvadPppcokbakR6)

Before we dive into the pool of linear regression, STOP

let’s first review the Safety Measures of Regressions:

What are the assumptions that we have to make before doing regressions?

  • Gauss-Markov theorem helped to explain these basic assumptions on Ordinary Least Squares (OLS), which is the bare bone of Generalized Linear Model (GLM).
  • These assumptions includes: Linearity between X and Y, constant variance of the error, no perfect collinearity, zero conditional mean of the error, normal error terms.
  • But for GLM these assumptions are loosely followed, depending on your purpose of the regression.

1. First peek of the data

The original dataset were used by researchers to understand the dynamics of the buzz on social media. It collects statistics around each topics on twitter.

Examples of the statistics:

  • number of discussions created at time step t and involving the instance’s topic;
  • measure of the attention payed to a the instance’s topic on a social media;
  • measures the number of authors interacting on the instance’s topic at time t;

In our case, we will use the proxy of popularity (buzz) — Number of Active Discussion (NAD) as Y and all other statistical measurements as Xs in the regression.

These exploratory plots gave us a better idea on what are the factors that could impact NAD:

Among these features, number of created discussions (NCD) is a strong indicator of NAD. This is not surprising.

2. From numpy to tensor

This is a baby step towards a great adventure!

Here we use DataLoader() with batch_size to create a iterable data sampler from the TensorDataset we created. This helps to avoid manually create batches and keeping track of them. No sweat for the downstream process!

3. PopularityModel Class

To follow a factory pattern, we can write our model into a class to help with compartmentalization and reuse. What should be included in our PopularityModel Class?

  • A constructor: to initialize the instance
  • forward()
  • a training method: to train the model and calculate loss
  • a validation method: to validate and calculate loss
  • a validation ending method: to aggregate the validation results from each batch and get the average loss

the class looks like below:

4. Fit Method

After we instantiate the PopularityModel() class, what we need is a function to:

  • take training data, model, learning rate, optimizer as input
  • call the train method of the model
  • compute the loss
  • compute the gradient
  • update the parameter using the calculated gradient
  • reset the gradient to zero: this step is critical since Pytorch accumulates gradients

We choose Stochastic Gradient Descent (SGD) as our optimizer; which is to perform Gradient Descent on each batch thus increase the speed of optimization

The code is as below:

And we can track the history of our fitting process

5. Try different learning rate

Choosing learning rate deliberately is key to successful training. Since the loss function of linear regression is a convex function, the learning process can go wrong both ways.

If learning rate is too small, it might take more epochs to reach the optimal solution, resulting in a slow training process; if learning rate is too large, you might jump over when taking bigger steps, this results in the oscillation of your solution.

As the graph shown below, when we use a learning rate = 1e-6, we will see the loss actually increased a lot after several epochs. We don’t want this to happen because we are beating around the bush!

Learning rate comparison

6. Prediction

Here we are using a batch from the validation set to predict the result, we have to use unsqueeze function to insert a dimension at position 0, so the model view it as a single batch.

The entire notebook can be found in below. Happy regression!

--

--

Cassie Guo

Data Scientist; write about data culinary (data: ingredients 🍗🥬🧂🍅; model: recipe 📖; results: delicious dish🥘) and other shenanigans