Machine learning classification: the success of Kickstarter tech projects

As of April 2019, over 400,000 projects have been launched on Kickstarter. With crowdfunding becoming an ever-increasingly popular method of raising capital, I thought it would be interesting to explore the data behind Kickstarter projects and also apply a machine learning model to predict whether or not a project will be successful based on its category and fundraising target.

Obtaining the data & exploratory analysis

The data used was taken from Kaggle here and offers data on projects up until January 2018. No web scraping or data cleaning was required (which felt like a bit of a cop-out), I simply downloaded the CSV straight from Kaggle.

Many types of project are funded on Kickstarter, so I thought it made sense to first compare how much different types of projects try to raise, and how successful they are in doing so. I filtered for only “successful” or “failed” projects, choosing not to worry about those that had been cancelled, suspended, were undefined, or still live.

After creating a binary variable for success/fail, I grouped the data by project category (using group_by) to find the average target & amount raised for each group, then ordered them by ascending average target and used ggplot2 to display the results.

# get only success/fail
data <- data[data$state %in% c("successful", "failed"), ]
# making dummy variable based on success/fail
data$success <- as.numeric(data$state == "successful")
# activate libraries
library(ggplot2)
library(dplyr)
# group by project category
grouped_data <- data %>%
group_by(main_category) %>%
summarise(
target = mean(goal),
success_rate = mean(success)
)
# change to data frame format
grouped_data <- as.data.frame(grouped_data)
# put in correct order
grouped_data$main_category <- factor(grouped_data$main_category, levels = grouped_data$main_category[order(grouped_data$target)])

Perhaps unsurprisingly, it seems the most success comes when your initial fundraising target is low. The dance, theatre & comics categories top the list of best success rates, and all have fundraising targets that average out at less than $30,000.

On the other hand, 2 of the least successful categories find themselves in the top 3 when it comes to mean fundraising goal. Technology, with the highest average goal of $114,700, is also the least successful category with an average success rate of just 23.8%. Journalism projects weren’t far behind, with 24.4%.

Diving deeper into the tech project stats, a geographical breakdown might give us a little more colour into the poor success rate. Splitting out the projects into US-based and non-US-based seems to suggest that US projects experience marginally more success (26.2% v. 23.8%). Running a simple t-test confirms (with p-value = 1.36e-08) that this difference is statistically significant.

# splitting US from non-US projects
us_tech_projects <- filter(tech_projects, country == "US")
# checking for significantly different means
t.test(us_tech_projects$success, y = tech_projects$success, alternative = c("two.sided", "less", "greater"), mu = 0,
paired = FALSE, var.equal = FALSE, conf.level = 0.95)

Whether it’s Kickstarter brand awareness in the US or just the impact of a more affluent population in general, if you set up your tech project in the US you’ve got a better chance of succeeding than if you did so elsewhere.

If we run the same t-test as above, but this time on projects with an original goal of either below or above $50,000, we get another significant difference between the groups. Aim to raise below $50,000 with your project, and your chances of success sit at 26%. Over $50,000, that number drops to just 16%. Obviously, the impact is not binary as described here, but it nevertheless shows the impact of setting an ambitious goal.

With all the tech projects, there are plenty of subcategories:

The popularity of tech project sub-categories

If we skirt past the rather confusingly named “Technology” sub-category, some of the most popular projects are Apps, Web, Software & Hardware. This doesn’t tell us a whole lot though. The following code prints out the success rate of each individual sub-category:

sort(tapply(tech_projects$success, tech_projects$category, mean))

This offers a lot more insight, as we see that some of the most successful categories include 3D Printing (43%) and Robots (46%). These tower over some of the poorest performing sub-categories such as Apps (7%), Web (8%) & Software (14%).

What about project length? If you max out your project duration by setting a deadline 60 days from your launch date, are you more likely to hit your target?

To find out, I had to split the “launched” variable from its date-time format into 2 separate variables: “launch_date” & “launch_time”. From here, I could simply subtract “launch_date” from “deadline” and create a new variable that tracks project duration.

# split "launched" variable into date and time variables
library(tidyr)
tech_projects <- separate(data = tech_projects, col = launched, into = c("launch_date", "launch_time"), sep = " ")
# subtract from one another
tech_projects$date_diff <- as.Date(as.character(tech_projects$deadline), format="%Y-%m-%d")-
as.Date(as.character(tech_projects$launch_date), format="%Y-%m-%d")
# group by sub-category before plotting
grouped_data_tech <- tech_projects %>%
group_by(category) %>%
summarise(
target = mean(date_diff),
success_rate = mean(success)
)

Using ggplot2, the scatter diagram below could be plotted:

Relationship between campaign duration & success rate

Running a quick linear regression model on these data revealed that, unsurprisingly considering the plot, there’s no relationship between success and project duration.

Use classification models to predict project success

The insights obtained so far are interesting, but I wanted to go further and build a couple of classification models that look at a project and determine whether or not the project is likely to be successful. I’m going to compare the output of a support-vector machine (SVM) model with that of a logistic regression model.

Before starting, I need to reconstruct my categorical “category” variable into multiple dummy variables that the models can understand.

# get the category variable
tech_projects_dummy <- select(tech_projects, category)
# split into multiple dummy variables
tech_projects_dummy <- dummy.data.frame(tech_projects_dummy, sep = ".")
# reattach variables of interest to tech_projects_dummy
tech_projects_dummy$goal <- tech_projects$usd_goal_real
tech_projects_dummy$success <- tech_projects$success
# encode dependent variable as a factor
tech_projects_dummy$success <- factor(tech_projects_dummy$success, levels = c(0,1))

With the above code, we have a data frame with a bunch of dummy variables, one for each sub-category of tech project, the fundraising target of each project, and the binary dependent variable, “success”. The last section simply ensures our dependent variable is encoded correctly as a factor.

Next step, split the data frame so we have one for training the model and another to test its results. This can easily be done in a few lines of code:

# splitting dataset into training and test set
library(caTools)
set.seed(123)
split <- sample.split(tech_projects_dummy$success, SplitRatio = 0.75)
training_set <- subset(tech_projects_dummy, split == TRUE)
test_set <- subset(tech_projects_dummy, split == FALSE)

As with any project like this, we need to apply feature scaling to our data so that each variable carries equal weight and doesn’t distort the model output.

# feature scaling everything but dependent variable
training_set[-18] = scale(training_set[-18])
test_set[-18] = scale(test_set[-18])

From here, all that’s left to do is to apply our classifiers to our training sets and let them do their work!

# fitting logistic regression to the training set
l_classifier = glm(formula = success ~ .,
family = binomial,
data = training_set)
# fitting SVM to the training set
library(e1071)
svm_classifier = svm(formula = success ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
# predicting the logistic regression test set results
l_pred = predict(l_classifier, type = 'response', newdata = test_set[-18])
# create binary variable from the logistic regression predictions
l_pred = ifelse(l_pred > 0.5, 1, 0)
# predicting the SVM test set results
svm_pred = predict(svm_classifier, newdata = test_set[-18])

Now that we have our predictions l_pred and svm_pred we can build a confusion matrix which essentially compares our predictions with the actual results from the success variable in our test set data.

# logistic regression confusion matrix
l_cm = table(test_set[,18], l_pred)
# SVM confusion matrix
svm_cm = table(test_set[,18], svm_pred)

Of the 6,762 projects in our test set, the logistic regression model correctly predicted 5,166 (76.4% accuracy) and the SVM model 5,131 (75.9% accuracy). The logistic regression model just comes out on top, but there is clearly still more fine-tuning to do here in order to build a more reliable indicator of tech project success on Kickstarter.

Thanks so much for reading! I’m no expert so I welcome all feedback and constructive criticism in the comments.