Kickstarter Investigation
Published:
Data 501 Project - Kickstarter Analysis
DATA501 Foundations of Data Science Using R
Travis McKenzie
2023-12-12
ABSTRACT
I have received and evaluated a broad data set which provides numerous variables associated Kickstarter campaigns from 2009 through 2020. I have applied the data science process to this information to evaluate this data. Through much exploration, primary questions of interest were identified and addressed. This required extensive wrangling and observation. This also required adapting hypothesis tests and modeling. Please enjoy the findings of this investigation. The goal is to provide context and insight into how a Kickstarter campaign might be broadly considered successful.
INTRODUCTION
This is a massive data set with numerous variables and limited contextual information. The data and methods will be described in later detail later in this analysis. A Kickstarter campaign is an effort to raise money for a stated project. These various projects fall within 15 primary categories. There are also subcategories that may be assigned to better delineate the type of project being pursued. Campaigns seek backers to fund their projects. Generally, the project can be successfully launched if the campaign successfully reaches their funding goal with pledged monetary commitment from the backers. There are also situations where some projects receive an acceptable amount of pledges to still allow the project to move forward, even while falling short of the overall goal. Even with some successful funding, some projects are still canceled, suspended or fail. Given the wide range and style of the campaigns, it can be difficult to quantify when a project has been truly successful. My analysis aims to address what success means in the capacity of a Kickstarter campaign. It also aims to shed light onto what category or categories might hold the best opportunities for success. As an introduction, here is the head (or top) of our data set.
This should grant you an introductory view into how the data is structured:
## # A tibble: 6 × 22
## CASEID NAME PID CATEGORY CATEGORY_ID SUBCATEGORY SUBCATEGORY_ID
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 MASKED BY ICPSR 2.14e9 Film & … 11 Science Fi… 301
## 2 2 MASKED BY ICPSR 1.50e9 Film & … 11 Fantasy 296
## 3 3 MASKED BY ICPSR 9.53e8 Technol… 16 Software 51
## 4 4 MASKED BY ICPSR 1.37e9 Publish… 18 Publishing 18
## 5 5 MASKED BY ICPSR 1.72e9 Art 1 Illustrati… 22
## 6 6 MASKED BY ICPSR 1.12e9 Journal… 13 Video 360
## # ℹ 15 more variables: PROJECT_PAGE_LOCATION_NAME <chr>,
## # PROJECT_PAGE_LOCATION_STATE <chr>, PROJECT_PAGE_LOCATION_COUNTY <chr>,
## # UID <dbl>, LAUNCHED_DATE <chr>, DEADLINE_DATE <chr>,
## # PROJECT_CURRENCY <chr>, GOAL_IN_ORIGINAL_CURRENCY <dbl>,
## # PLEDGED_IN_ORIGINAL_CURRENCY <dbl>, GOAL_IN_USD <dbl>,
## # PLEDGED_IN_USD <dbl>, BACKERS_COUNT <dbl>, STATE <chr>, URL_NAME <chr>,
## # pledgeVgoal <dbl>###Primary Questions of Interest
The category of “Film & Video” contained the greatest raw quantity of campaigns at a total of 75,808 total. The category of “Dance” contained the least quantity of campaigns at 4,298 total. We can see those totals here:
## # A tibble: 15 × 2
## CATEGORY categoryTotal
## <chr> <int>
## 1 Film & Video 75808
## 2 Music 63486
## 3 Games 56700
## 4 Publishing 52082
## 5 Technology 44706
## 6 Design 43503
## 7 Art 41455
## 8 Fashion 33066
## 9 Food 30758
## 10 Comics 17560
## 11 Photography 12646
## 12 Theater 12349
## 13 Crafts 11917
## 14 Journalism 5865
## 15 Dance 4298Using bar graphs, histograms and point graphs, we can observe some time based trends in our data.
Using our bar graph, we can easily observe that July generated the most attempted Kickstarter launches. We can also observe that December was the month representing the fewest number of Kickstarter launches.
Using our histogram showing all data from 2009 - 2020. This very clearly shows that total launches peaked around 2015.
Using our point graph, we can quickly see which years resulted in the most total Kickstarter launches. The year 2015 was a clear peak. We can also clearly see that the number of launches has decreased relatively consistently since 2015. The number of launches for our last year of available data in 2020 was the lowest it has been since 2011.
Ultimately, the data actually contains information as to whether or not a Kickstarter campaign was successful as you noted in our data sample. The categorical character variable “STATE” communicates how many of the campaigns were deemed “successful” compared to the other options of “canceled”, “failed” and “suspended”. Using the codebook, I was able to see that 38.4% of campaigns were successful and the remaining 61.6% were canceled/failed/suspended. I can replicate this in my analysis, grouping by CATEGORY, but then also grouping by STATE. This gives me a very nice view of the success ratios within each CATEGORY, which I can then plot. I also can use this to establish a rating structure for which CATEGORY achieved the highest percentage of success. Surprisingly, Dance has the highest percentage of success across the various categories even though I previously established that it had the lowest raw quantity of campaigns earlier in the analysis.
If I were to quantify success purely as the percent of successful campaigns within each category, then Dance performs the highest. Technology the lowest performer, by this standard. This is the overall conclusion I will decide to consider when attempting to apply my hypotheses for evaluating success. Will also leverage this conclusion to focus my modeling efforts.
However, we can see from these views, trying to quantify success is not so simple. Please view these exhibits and my further exploration of this concept.
## # A tibble: 15 × 4
## # Groups: CATEGORY [15]
## CATEGORY STATE total totalPercent
## <chr> <chr> <int> <dbl>
## 1 Dance successful 2650 61.7
## 2 Comics successful 10666 60.7
## 3 Theater successful 7408 60.0
## 4 Music successful 31854 50.2
## 5 Art successful 18955 45.7
## 6 Games successful 24028 42.4
## 7 Design successful 17000 39.1
## 8 Film & Video successful 28597 37.7
## 9 Publishing successful 17831 34.2
## 10 Photography successful 4175 33.0
## 11 Fashion successful 9615 29.1
## 12 Crafts successful 3061 25.7
## 13 Food successful 7844 25.5
## 14 Journalism successful 1351 23.0
## 15 Technology successful 9423 21.1This is an additional exploration of “success” while evaluating and attempting to map out my analysis. This highlights the litany of different directions that this evaluation could have pursued.
From a raw numerical standpoint, I can see that Music has the greatest total of successful campaign outcomes at 31,854 and completely overshadows the raw numbers of Dance at 2,650 by a factor of 12. If limiting to this variable, Music is the most successful and Journalism the least successful. However, when limiting to this variable, we are not considering the ratio of failure to success. We are also not considering the extent of success. To remedy this, I conducted further analysis.
Earlier in the preparation of the data, I created a pledged vs. goal (referred henceforth as PVG) ratio to create yet another marker of potential success for the campaigns. It is interesting to note in my statistical summary here; there are two categories that had campaigns that failed to meet goal but still were considered successful. This included Film & Video, Design and Comics. This does call into question the nature of the “Successful” STATE. Within these categories, there could be opportunities to proceed with the project, even with incomplete funding of the goal. This also poses the question as to whether or not the STATE categorical variable is the best indicator of true success when it comes to evaluating the campaign. The category of Games has the highest average PVG. I interpret this to mean that pledges for Games campaigns tend to exceed their funding goals at the highest rate, on average. Successful campaigns for Games also have the highest median PVG. There is a notable gap in the median PVG between Games and the next highest median. I know median resists outliers, so this helps clarify the middle of the data.
Attempting to evaluate all categories without limiting to the successful campaigns allows me to view the success in contrast with failure. This provides interesting context in regard to interpreting and defining Kickstarter success.
For the purpose of this analysis, I am choosing to define success by both limiting to campaigns deemed successful under the STATE variable. Within that filtered group, I am prioritizing the median PVG due to it’s resistance to the numerous outliers within and across categories. Ultimately, I find the Games category to be the most successful when considering what was pledged vs. the goal and what was being sought. Within this alternative analysis, Dance would be considered the least successful with the lowest PVG median within the filtered successful STATE. This conflicts my first conclusion related to success and merits enhanced techniques and hypothesis testing. I now have established where to direct the focus of my hypothesis testing and modeling approach.
## # A tibble: 15 × 4
## CATEGORY successful meanPVG medianPVG
## <chr> <int> <dbl> <dbl>
## 1 Games 24028 18.0 2.26
## 2 Music 31854 15.5 1.11
## 3 Film & Video 28597 12.0 1.09
## 4 Technology 9423 11.5 1.77
## 5 Comics 10666 10.0 1.46
## 6 Crafts 3061 7.86 1.42
## 7 Design 17000 6.54 1.80
## 8 Publishing 17831 5.71 1.18
## 9 Art 18955 5.17 1.27
## 10 Fashion 9615 4.94 1.34
## 11 Food 7844 3.99 1.13
## 12 Journalism 1351 2.44 1.13
## 13 Photography 4175 1.88 1.18
## 14 Theater 7408 1.53 1.08
## 15 Dance 2650 1.40 1.08DATA AND METHODOLOGY
This is global kickstarter data from 2009 - 2020. Each entry is a different kickstarter. Each has a category (variety of 15) and subcategory (variety of 161). Each of those categories and subcategories has a respective numerical id. The data also contains information the project location name, location state and location county. There is a unique identifier number (UID) associated with each entry. There are dates related to when the campaign was lauched as well as the deadline. Each project has a financial goal and amount that was pledged. This is expressed in both the original currency and US dollars. Backers expresses the number of backers associated with the project. This appears to be the number of contributors who pledged, when contrasted against the amount pledged. The Names and URLs of each of each campaign are masked to protect respondent anonymity and prevent disclosure risk. There is a STATE variable which asserts whether or not each campaign was successful, canceled, failed or suspended.
The data is gathered by way of observational study. The primary response variable was the amount of money pledged for the respective campaign.
There were limitations to the quality of the data. The lack of names and URLs limits specificity. Many of the variables were entered as characters. In particular, when attempting to evaluate the success of the campaigns by evaluating the goal and pledged amounts, the data had to be wrangled to modify the characters into numeric variables. This included removal of the dollar sign (which was more problematic than initially expected) and removing commas. When attempting to evaluate the pledge vs. goal ratio, there were also issues with there being campaigns that had no stated goal.
Location information has also been troublesome. The entries were not sufficiently constrained. Preference would have been a clear city and/or state, and country. This would have allowed for better opportunities to map the data and evaluate location trends.
## CASEID NAME PID CATEGORY
## Min. : 1 Length:506199 Min. :2.941e+03 Length:506199
## 1st Qu.:126551 Class :character 1st Qu.:5.372e+08 Class :character
## Median :253100 Mode :character Median :1.075e+09 Mode :character
## Mean :253100 Mean :1.074e+09
## 3rd Qu.:379650 3rd Qu.:1.610e+09
## Max. :506199 Max. :2.147e+09
##
## CATEGORY_ID SUBCATEGORY SUBCATEGORY_ID PROJECT_PAGE_LOCATION_NAME
## Min. : 1.0 Length:506199 Min. : 1.0 Length:506199
## 1st Qu.: 9.0 Class :character 1st Qu.: 22.0 Class :character
## Median :12.0 Mode :character Median : 34.0 Mode :character
## Mean :11.6 Mean :104.8
## 3rd Qu.:15.0 3rd Qu.:262.0
## Max. :26.0 Max. :396.0
##
## PROJECT_PAGE_LOCATION_STATE PROJECT_PAGE_LOCATION_COUNTY UID
## Length:506199 Length:506199 Min. :3.000e+00
## Class :character Class :character 1st Qu.:5.359e+08
## Mode :character Mode :character Median :1.072e+09
## Mean :1.073e+09
## 3rd Qu.:1.610e+09
## Max. :2.147e+09
##
## LAUNCHED_DATE DEADLINE_DATE PROJECT_CURRENCY
## Length:506199 Length:506199 Length:506199
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## GOAL_IN_ORIGINAL_CURRENCY PLEDGED_IN_ORIGINAL_CURRENCY GOAL_IN_USD
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2000 1st Qu.: 48 1st Qu.: 2000
## Median : 5000 Median : 765 Median : 5000
## Mean : 50845 Mean : 16875 Mean : 41951
## 3rd Qu.: 16000 3rd Qu.: 4733 3rd Qu.: 15000
## Max. :100000000 Max. :481621841 Max. :166877652
##
## PLEDGED_IN_USD BACKERS_COUNT STATE URL_NAME
## Min. : 0 Min. : -2.00 Length:506199 Length:506199
## 1st Qu.: 45 1st Qu.: 2.00 Class :character Class :character
## Median : 748 Median : 14.00 Mode :character Mode :character
## Mean : 10792 Mean : 60.14
## 3rd Qu.: 4510 3rd Qu.: 59.00
## Max. :20338986 Max. :999.00
## NA's :10148
## pledgeVgoal
## Min. : 0.00
## 1st Qu.: 0.01
## Median : 0.18
## Mean : 4.02
## 3rd Qu.: 1.10
## Max. :194843.00
## RESULTS
After conducting a rigorous assessment of the data, the Dance, Music and Games categories each possessed indicators which might indicate a proclivity towards having a successful Kickstarter campaign. I am primarily interested in whether or not there is a significant difference across categories the particular categories of focus, in meeting or exceeding the goal with pledge funding
I established two primary hypotheses.
Hypothesis A; my null hypothesis is that each category tends to have the same pledge to goal (PVG) ratio on average, when the campaign STATE is marked as successful. My alternative hypothesis is that there is a significant difference across categories in relationship to the PVG. For this hypothesis, I require a Tukey HSD test. I need to evaluate the mean PVG across Dance, Music and Games to determine if they have the same PVG on average, or if there is a meaningful difference in this data.
My density graph allows me to view the distribution of the PVG. I then run my tests.
## # A tibble: 105 × 5
## sizeComp diff lwr upr p.adj
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Publishing-Games -6.20 -13.8 1.38 0.260
## 2 Games-Food 7.13 -1.71 16.0 0.283
## 3 Publishing-Music -5.82 -13.2 1.56 0.320
## 4 Music-Food 6.75 -1.92 15.4 0.343
## 5 Games-Fashion 6.64 -2.00 15.3 0.364
## 6 Music-Fashion 6.26 -2.20 14.7 0.433
## 7 Games-Art 5.76 -2.31 13.8 0.500
## 8 Music-Art 5.38 -2.50 13.3 0.580
## 9 Games-Design 5.41 -2.55 13.4 0.587
## 10 Technology-Games -5.22 -13.1 2.67 0.633
## # ℹ 95 more rowsHypothesis B; my null hypothesis is that across the categories of Dance, Music and Games, each is equally likely to have a successful STATE campaign. My alternative hypothesis is that each is not equally likely to result in a successful STATE campaign. For this, we will conduct a chisq test of homogeneity. This requires a table comparison. We are able to compare and contrast the observed, expected and residuals.
##
## Pearson's Chi-squared test
##
## data: kickChisq
## X-squared = 3453.1, df = 6, p-value < 2.2e-16## canceled failed successful suspended
## Dance 204 1427 2650 17
## Games 9686 22565 24028 421
## Music 4598 26805 31854 229## canceled failed successful suspended
## Dance 500.2203 1753.844 2020.907 23.02919
## Games 6598.9975 23137.029 26660.168 303.80531
## Music 7388.7822 25906.127 29850.925 340.16550## canceled failed successful suspended
## Dance -13.24446 -7.804497 13.99400 -1.256376
## Games 38.00126 -3.760661 -16.12064 6.723730
## Music -32.46680 5.584663 11.59360 -6.027329Lastly, considering these findings, we are going to evaluate the Dance category for a relationship to the output of USD pledged.
## `geom_smooth()` using formula = 'y ~ x'##
## Call:
## lm(formula = log(PLEDGED_IN_USD) ~ BACKERS_COUNT + GOAL_IN_USD,
## data = kickDance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7234 -0.2109 0.1543 0.3670 1.5889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.354e+00 1.629e-02 451.40 <2e-16 ***
## BACKERS_COUNT 3.676e-03 2.750e-04 13.37 <2e-16 ***
## GOAL_IN_USD 1.182e-04 3.039e-06 38.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5628 on 2641 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6499, Adjusted R-squared: 0.6497
## F-statistic: 2452 on 2 and 2641 DF, p-value: < 2.2e-16## BACKERS_COUNT GOAL_IN_USD
## 2.01797 2.01797## Analysis of Variance Table
##
## Response: log(PLEDGED_IN_USD)
## Df Sum Sq Mean Sq F value Pr(>F)
## BACKERS_COUNT 1 1073.82 1073.82 3390.8 < 2.2e-16 ***
## GOAL_IN_USD 1 478.99 478.99 1512.5 < 2.2e-16 ***
## Residuals 2641 836.38 0.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## [1] 4468.198## Df Sum Sq Mean Sq F value Pr(>F)
## BACKERS_COUNT 1 1073.8 1073.8 3391 <2e-16 ***
## GOAL_IN_USD 1 479.0 479.0 1512 <2e-16 ***
## Residuals 2641 836.4 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingnessCONCLUSION
In regard to hypothesis A, when comparing our 15 categories, there appears to be no statistical evidence that any one category is a predictor of the pledge v. goal ratio (PVG). We fail to reject the null and conclude that category is not statistically significant to the amount of money pledged against the goal being sought.
In regard to hypothesis B, when comparing our 3 focus categories, there does appear to be a notable difference in the distribution of the various STATES of canceled, failed, successful and suspended. Dance does appear to achieve more successful outcomes than expected and our data is statistically significant. Music also appears to achieve a successful campaign more than expected. However, Games performs poorer than expected in the observed data.
The findings are quite surprising, particularly when contrasted against the initial exploration of the data.
Dance appears to ultimately be one of most reliably successful categories. Within that category, I created a multiple linear regression model to evaluate the best predictors of money pledged in US dollars. Both the backer count and the goal in US dollars turned out to be decent coefficients for fitting a model to the data. However, there are outliers and leverage points that were impacting the model. I was able to remove a few of those to improve the fit of the model. Ultimately, the data available has limits to its predictive capacity with this particular data set.
APPENDIX
The assumptions for the Tukey HSD test were met: 1. Observations are indepenedent within and across groups 2. Each groups observations are normally distributed 3. There is a homogeneity of variance within groups
The assumptions for the Chi Sq test for homogeneity were met: 1. Expected counts of all cells >5 2. Each observation contributes to only 1 cell 3. Independent groups.
Professor Gore, I am generally satisfied with my investigation into this data set. I found it extremely challenging and thought provoking. I continued to iterate, then reiterate and found myself going in circles trying to consider all of the statistical challenges that are present when trying to craft a cohesive and thoughtful evaluation of this set of data. I probably got too hung up on analyzing “success” and did not move to the hypotheses and modeling as quickly as I should have. I hope you enjoyed my analysis as much as I did creating it. It certainly is not perfect and has a lot of room for improvement but this has given me a lot to consider in how the lessons of the course can be implemented into real world analysis. I hope that I achieved a reasonable outcome for what was expected for this assignment. I still find myself needing to go back to my notes regularly and check/re-check my intuitions and assumptions. There are still some things that I don’t feel like I have fully mastered and this project has helped me better identify what I need to spend more time practicing. Going to spend some of the break playing around with the other categories and working on the modeling process.
Thank you for a very challenging experience. I hope that this product approximates the expected target of the project.
-Travis McKenzie
