Training a logistic regression from Twitter comments

In this example, we’ll train a logistic regression to classify tweets using only the natural language text found in these tweets. We’ll only need about 800 tweets per user account.

Setup

To run this example, you will need the following packages.

install.packages("dplyr", "ROCR")

Step 1: Embedding all tweets

We’ll use a collection of about 800 tweets from Bill Gates and Kanye West and train a logistic regression to predict (given a tweet) which account the tweet belongs to. In order to do that, we’ll first load the tweets from the basilica package.

library(jsonlite)
bill <- fromJSON(system.file("extdata/twitter/billgates.json", package="basilica"))
kanye <- fromJSON(system.file("extdata/twitter/kanyewest.json", package="basilica"))

Now that we’ve loaded the JSON files, we can embedded the text of these tweets using Basilica.

library(basilica)
conn <- connect("05e19f1c-39de-ed9c-ae42-feab42f5f84d")

embeddings <- rbind(embed_sentences(bill[, 7], conn=conn), embed_sentences(kanye[, 7], conn=conn)) # 7 is the index of the text

Step 2: Running PCA + Cleaning Data

Now that we have these embeddings, we’ll want to run PCA and get the 100 features that explain the most variance. We’ll also add a column to the matrix with the corresponding category each tweet belongs to.

pca <- prcomp(t(embeddings), center = TRUE,scale = TRUE)
features <- pca$rotation[,1:100]

type <- c(integer(dim(bill)[1]) + 1, integer(dim(kanye)[1]))
features <- cbind(type, features)
features <- data.frame(features[sample.int(nrow(features)),])

Step 3: Training the model

Finally, we can now train our model. In order to do that we’ll separate out the data into training and test data.

library(dplyr)
train_data <- sample_frac(features, 0.8)
train_index <- as.numeric(rownames(train_data))
test_data <- features[-train_index, ]

model <- glm(type ~ ., data = train_data, family = "binomial")

Step 4: Verifying Results

After training the model, we can verify who well it’s trained by taking a look at the confusion matrix.

predict <- predict(model, newdata=test_data, type = 'response')
table(train_data$type, predict > 0.5)

library(ROCR)
ROCRpred <- prediction(predict, test_data$type)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

You have now trained a logistic regression with only the natural language text of the tweets and 800 data points per category and getting an R squared of about 0.80.