Kickstarter Investigation

Published:

Data 501 Project - Kickstarter Analysis

ABSTRACT

I have received and evaluated a broad data set which provides numerous variables associated Kickstarter campaigns from 2009 through 2020. I have applied the data science process to this information to evaluate this data. Through much exploration, primary questions of interest were identified and addressed. This required extensive wrangling and observation. This also required adapting hypothesis tests and modeling. Please enjoy the findings of this investigation. The goal is to provide context and insight into how a Kickstarter campaign might be broadly considered successful.

INTRODUCTION

This is a massive data set with numerous variables and limited contextual information. The data and methods will be described in later detail later in this analysis. A Kickstarter campaign is an effort to raise money for a stated project. These various projects fall within 15 primary categories. There are also subcategories that may be assigned to better delineate the type of project being pursued. Campaigns seek backers to fund their projects. Generally, the project can be successfully launched if the campaign successfully reaches their funding goal with pledged monetary commitment from the backers. There are also situations where some projects receive an acceptable amount of pledges to still allow the project to move forward, even while falling short of the overall goal. Even with some successful funding, some projects are still canceled, suspended or fail. Given the wide range and style of the campaigns, it can be difficult to quantify when a project has been truly successful. My analysis aims to address what success means in the capacity of a Kickstarter campaign. It also aims to shed light onto what category or categories might hold the best opportunities for success. As an introduction, here is the head (or top) of our data set.

This should grant you an introductory view into how the data is structured:

## # A tibble: 6 × 22
##   CASEID NAME                PID CATEGORY CATEGORY_ID SUBCATEGORY SUBCATEGORY_ID
##    <dbl> <chr>             <dbl> <chr>          <dbl> <chr>                <dbl>
## 1      1 MASKED BY ICPSR  2.14e9 Film & …          11 Science Fi…            301
## 2      2 MASKED BY ICPSR  1.50e9 Film & …          11 Fantasy                296
## 3      3 MASKED BY ICPSR  9.53e8 Technol…          16 Software                51
## 4      4 MASKED BY ICPSR  1.37e9 Publish…          18 Publishing              18
## 5      5 MASKED BY ICPSR  1.72e9 Art                1 Illustrati…             22
## 6      6 MASKED BY ICPSR  1.12e9 Journal…          13 Video                  360
## # ℹ 15 more variables: PROJECT_PAGE_LOCATION_NAME <chr>,
## #   PROJECT_PAGE_LOCATION_STATE <chr>, PROJECT_PAGE_LOCATION_COUNTY <chr>,
## #   UID <dbl>, LAUNCHED_DATE <chr>, DEADLINE_DATE <chr>,
## #   PROJECT_CURRENCY <chr>, GOAL_IN_ORIGINAL_CURRENCY <dbl>,
## #   PLEDGED_IN_ORIGINAL_CURRENCY <dbl>, GOAL_IN_USD <dbl>,
## #   PLEDGED_IN_USD <dbl>, BACKERS_COUNT <dbl>, STATE <chr>, URL_NAME <chr>,
## #   pledgeVgoal <dbl>

###Primary Questions of Interest

The category of “Film & Video” contained the greatest raw quantity of campaigns at a total of 75,808 total. The category of “Dance” contained the least quantity of campaigns at 4,298 total. We can see those totals here:

## # A tibble: 15 × 2
##    CATEGORY     categoryTotal
##    <chr>                <int>
##  1 Film & Video         75808
##  2 Music                63486
##  3 Games                56700
##  4 Publishing           52082
##  5 Technology           44706
##  6 Design               43503
##  7 Art                  41455
##  8 Fashion              33066
##  9 Food                 30758
## 10 Comics               17560
## 11 Photography          12646
## 12 Theater              12349
## 13 Crafts               11917
## 14 Journalism            5865
## 15 Dance                 4298

Using bar graphs, histograms and point graphs, we can observe some time based trends in our data.

Using our bar graph, we can easily observe that July generated the most attempted Kickstarter launches. We can also observe that December was the month representing the fewest number of Kickstarter launches.

Using our histogram showing all data from 2009 - 2020. This very clearly shows that total launches peaked around 2015.

Using our point graph, we can quickly see which years resulted in the most total Kickstarter launches. The year 2015 was a clear peak. We can also clearly see that the number of launches has decreased relatively consistently since 2015. The number of launches for our last year of available data in 2020 was the lowest it has been since 2011.

Ultimately, the data actually contains information as to whether or not a Kickstarter campaign was successful as you noted in our data sample. The categorical character variable “STATE” communicates how many of the campaigns were deemed “successful” compared to the other options of “canceled”, “failed” and “suspended”. Using the codebook, I was able to see that 38.4% of campaigns were successful and the remaining 61.6% were canceled/failed/suspended. I can replicate this in my analysis, grouping by CATEGORY, but then also grouping by STATE. This gives me a very nice view of the success ratios within each CATEGORY, which I can then plot. I also can use this to establish a rating structure for which CATEGORY achieved the highest percentage of success. Surprisingly, Dance has the highest percentage of success across the various categories even though I previously established that it had the lowest raw quantity of campaigns earlier in the analysis.

If I were to quantify success purely as the percent of successful campaigns within each category, then Dance performs the highest. Technology the lowest performer, by this standard. This is the overall conclusion I will decide to consider when attempting to apply my hypotheses for evaluating success. Will also leverage this conclusion to focus my modeling efforts.

However, we can see from these views, trying to quantify success is not so simple. Please view these exhibits and my further exploration of this concept.

## # A tibble: 15 × 4
## # Groups:   CATEGORY [15]
##    CATEGORY     STATE      total totalPercent
##    <chr>        <chr>      <int>        <dbl>
##  1 Dance        successful  2650         61.7
##  2 Comics       successful 10666         60.7
##  3 Theater      successful  7408         60.0
##  4 Music        successful 31854         50.2
##  5 Art          successful 18955         45.7
##  6 Games        successful 24028         42.4
##  7 Design       successful 17000         39.1
##  8 Film & Video successful 28597         37.7
##  9 Publishing   successful 17831         34.2
## 10 Photography  successful  4175         33.0
## 11 Fashion      successful  9615         29.1
## 12 Crafts       successful  3061         25.7
## 13 Food         successful  7844         25.5
## 14 Journalism   successful  1351         23.0
## 15 Technology   successful  9423         21.1

This is an additional exploration of “success” while evaluating and attempting to map out my analysis. This highlights the litany of different directions that this evaluation could have pursued.

From a raw numerical standpoint, I can see that Music has the greatest total of successful campaign outcomes at 31,854 and completely overshadows the raw numbers of Dance at 2,650 by a factor of 12. If limiting to this variable, Music is the most successful and Journalism the least successful. However, when limiting to this variable, we are not considering the ratio of failure to success. We are also not considering the extent of success. To remedy this, I conducted further analysis.

Earlier in the preparation of the data, I created a pledged vs. goal (referred henceforth as PVG) ratio to create yet another marker of potential success for the campaigns. It is interesting to note in my statistical summary here; there are two categories that had campaigns that failed to meet goal but still were considered successful. This included Film & Video, Design and Comics. This does call into question the nature of the “Successful” STATE. Within these categories, there could be opportunities to proceed with the project, even with incomplete funding of the goal. This also poses the question as to whether or not the STATE categorical variable is the best indicator of true success when it comes to evaluating the campaign. The category of Games has the highest average PVG. I interpret this to mean that pledges for Games campaigns tend to exceed their funding goals at the highest rate, on average. Successful campaigns for Games also have the highest median PVG. There is a notable gap in the median PVG between Games and the next highest median. I know median resists outliers, so this helps clarify the middle of the data.

Attempting to evaluate all categories without limiting to the successful campaigns allows me to view the success in contrast with failure. This provides interesting context in regard to interpreting and defining Kickstarter success.

For the purpose of this analysis, I am choosing to define success by both limiting to campaigns deemed successful under the STATE variable. Within that filtered group, I am prioritizing the median PVG due to it’s resistance to the numerous outliers within and across categories. Ultimately, I find the Games category to be the most successful when considering what was pledged vs. the goal and what was being sought. Within this alternative analysis, Dance would be considered the least successful with the lowest PVG median within the filtered successful STATE. This conflicts my first conclusion related to success and merits enhanced techniques and hypothesis testing. I now have established where to direct the focus of my hypothesis testing and modeling approach.

## # A tibble: 15 × 4
##    CATEGORY     successful meanPVG medianPVG
##    <chr>             <int>   <dbl>     <dbl>
##  1 Games             24028   18.0       2.26
##  2 Music             31854   15.5       1.11
##  3 Film & Video      28597   12.0       1.09
##  4 Technology         9423   11.5       1.77
##  5 Comics            10666   10.0       1.46
##  6 Crafts             3061    7.86      1.42
##  7 Design            17000    6.54      1.80
##  8 Publishing        17831    5.71      1.18
##  9 Art               18955    5.17      1.27
## 10 Fashion            9615    4.94      1.34
## 11 Food               7844    3.99      1.13
## 12 Journalism         1351    2.44      1.13
## 13 Photography        4175    1.88      1.18
## 14 Theater            7408    1.53      1.08
## 15 Dance              2650    1.40      1.08

DATA AND METHODOLOGY

This is global kickstarter data from 2009 - 2020. Each entry is a different kickstarter. Each has a category (variety of 15) and subcategory (variety of 161). Each of those categories and subcategories has a respective numerical id. The data also contains information the project location name, location state and location county. There is a unique identifier number (UID) associated with each entry. There are dates related to when the campaign was lauched as well as the deadline. Each project has a financial goal and amount that was pledged. This is expressed in both the original currency and US dollars. Backers expresses the number of backers associated with the project. This appears to be the number of contributors who pledged, when contrasted against the amount pledged. The Names and URLs of each of each campaign are masked to protect respondent anonymity and prevent disclosure risk. There is a STATE variable which asserts whether or not each campaign was successful, canceled, failed or suspended.

The data is gathered by way of observational study. The primary response variable was the amount of money pledged for the respective campaign.

There were limitations to the quality of the data. The lack of names and URLs limits specificity. Many of the variables were entered as characters. In particular, when attempting to evaluate the success of the campaigns by evaluating the goal and pledged amounts, the data had to be wrangled to modify the characters into numeric variables. This included removal of the dollar sign (which was more problematic than initially expected) and removing commas. When attempting to evaluate the pledge vs. goal ratio, there were also issues with there being campaigns that had no stated goal.

Location information has also been troublesome. The entries were not sufficiently constrained. Preference would have been a clear city and/or state, and country. This would have allowed for better opportunities to map the data and evaluate location trends.

##      CASEID           NAME                PID              CATEGORY        
##  Min.   :     1   Length:506199      Min.   :2.941e+03   Length:506199     
##  1st Qu.:126551   Class :character   1st Qu.:5.372e+08   Class :character  
##  Median :253100   Mode  :character   Median :1.075e+09   Mode  :character  
##  Mean   :253100                      Mean   :1.074e+09                     
##  3rd Qu.:379650                      3rd Qu.:1.610e+09                     
##  Max.   :506199                      Max.   :2.147e+09                     
##                                                                            
##   CATEGORY_ID   SUBCATEGORY        SUBCATEGORY_ID  PROJECT_PAGE_LOCATION_NAME
##  Min.   : 1.0   Length:506199      Min.   :  1.0   Length:506199             
##  1st Qu.: 9.0   Class :character   1st Qu.: 22.0   Class :character          
##  Median :12.0   Mode  :character   Median : 34.0   Mode  :character          
##  Mean   :11.6                      Mean   :104.8                             
##  3rd Qu.:15.0                      3rd Qu.:262.0                             
##  Max.   :26.0                      Max.   :396.0                             
##                                                                              
##  PROJECT_PAGE_LOCATION_STATE PROJECT_PAGE_LOCATION_COUNTY      UID           
##  Length:506199               Length:506199                Min.   :3.000e+00  
##  Class :character            Class :character             1st Qu.:5.359e+08  
##  Mode  :character            Mode  :character             Median :1.072e+09  
##                                                           Mean   :1.073e+09  
##                                                           3rd Qu.:1.610e+09  
##                                                           Max.   :2.147e+09  
##                                                                              
##  LAUNCHED_DATE      DEADLINE_DATE      PROJECT_CURRENCY  
##  Length:506199      Length:506199      Length:506199     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  GOAL_IN_ORIGINAL_CURRENCY PLEDGED_IN_ORIGINAL_CURRENCY  GOAL_IN_USD       
##  Min.   :        0         Min.   :        0            Min.   :        0  
##  1st Qu.:     2000         1st Qu.:       48            1st Qu.:     2000  
##  Median :     5000         Median :      765            Median :     5000  
##  Mean   :    50845         Mean   :    16875            Mean   :    41951  
##  3rd Qu.:    16000         3rd Qu.:     4733            3rd Qu.:    15000  
##  Max.   :100000000         Max.   :481621841            Max.   :166877652  
##                                                                            
##  PLEDGED_IN_USD     BACKERS_COUNT       STATE             URL_NAME        
##  Min.   :       0   Min.   : -2.00   Length:506199      Length:506199     
##  1st Qu.:      45   1st Qu.:  2.00   Class :character   Class :character  
##  Median :     748   Median : 14.00   Mode  :character   Mode  :character  
##  Mean   :   10792   Mean   : 60.14                                        
##  3rd Qu.:    4510   3rd Qu.: 59.00                                        
##  Max.   :20338986   Max.   :999.00                                        
##                     NA's   :10148                                         
##   pledgeVgoal       
##  Min.   :     0.00  
##  1st Qu.:     0.01  
##  Median :     0.18  
##  Mean   :     4.02  
##  3rd Qu.:     1.10  
##  Max.   :194843.00  
## 

RESULTS

After conducting a rigorous assessment of the data, the Dance, Music and Games categories each possessed indicators which might indicate a proclivity towards having a successful Kickstarter campaign. I am primarily interested in whether or not there is a significant difference across categories the particular categories of focus, in meeting or exceeding the goal with pledge funding

I established two primary hypotheses.

Hypothesis A; my null hypothesis is that each category tends to have the same pledge to goal (PVG) ratio on average, when the campaign STATE is marked as successful. My alternative hypothesis is that there is a significant difference across categories in relationship to the PVG. For this hypothesis, I require a Tukey HSD test. I need to evaluate the mean PVG across Dance, Music and Games to determine if they have the same PVG on average, or if there is a meaningful difference in this data.

My density graph allows me to view the distribution of the PVG. I then run my tests.

## # A tibble: 105 × 5
##    sizeComp          diff    lwr   upr p.adj
##    <chr>            <dbl>  <dbl> <dbl> <dbl>
##  1 Publishing-Games -6.20 -13.8   1.38 0.260
##  2 Games-Food        7.13  -1.71 16.0  0.283
##  3 Publishing-Music -5.82 -13.2   1.56 0.320
##  4 Music-Food        6.75  -1.92 15.4  0.343
##  5 Games-Fashion     6.64  -2.00 15.3  0.364
##  6 Music-Fashion     6.26  -2.20 14.7  0.433
##  7 Games-Art         5.76  -2.31 13.8  0.500
##  8 Music-Art         5.38  -2.50 13.3  0.580
##  9 Games-Design      5.41  -2.55 13.4  0.587
## 10 Technology-Games -5.22 -13.1   2.67 0.633
## # ℹ 95 more rows

Hypothesis B; my null hypothesis is that across the categories of Dance, Music and Games, each is equally likely to have a successful STATE campaign. My alternative hypothesis is that each is not equally likely to result in a successful STATE campaign. For this, we will conduct a chisq test of homogeneity. This requires a table comparison. We are able to compare and contrast the observed, expected and residuals.

## 
##  Pearson's Chi-squared test
## 
## data:  kickChisq
## X-squared = 3453.1, df = 6, p-value < 2.2e-16
##       canceled failed successful suspended
## Dance      204   1427       2650        17
## Games     9686  22565      24028       421
## Music     4598  26805      31854       229
##        canceled    failed successful suspended
## Dance  500.2203  1753.844   2020.907  23.02919
## Games 6598.9975 23137.029  26660.168 303.80531
## Music 7388.7822 25906.127  29850.925 340.16550
##        canceled    failed successful suspended
## Dance -13.24446 -7.804497   13.99400 -1.256376
## Games  38.00126 -3.760661  -16.12064  6.723730
## Music -32.46680  5.584663   11.59360 -6.027329

Lastly, considering these findings, we are going to evaluate the Dance category for a relationship to the output of USD pledged.

## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = log(PLEDGED_IN_USD) ~ BACKERS_COUNT + GOAL_IN_USD, 
##     data = kickDance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7234 -0.2109  0.1543  0.3670  1.5889 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.354e+00  1.629e-02  451.40   <2e-16 ***
## BACKERS_COUNT 3.676e-03  2.750e-04   13.37   <2e-16 ***
## GOAL_IN_USD   1.182e-04  3.039e-06   38.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5628 on 2641 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6499, Adjusted R-squared:  0.6497 
## F-statistic:  2452 on 2 and 2641 DF,  p-value: < 2.2e-16

## BACKERS_COUNT   GOAL_IN_USD 
##       2.01797       2.01797
## Analysis of Variance Table
## 
## Response: log(PLEDGED_IN_USD)
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## BACKERS_COUNT    1 1073.82 1073.82  3390.8 < 2.2e-16 ***
## GOAL_IN_USD      1  478.99  478.99  1512.5 < 2.2e-16 ***
## Residuals     2641  836.38    0.32                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 4468.198
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## BACKERS_COUNT    1 1073.8  1073.8    3391 <2e-16 ***
## GOAL_IN_USD      1  479.0   479.0    1512 <2e-16 ***
## Residuals     2641  836.4     0.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

CONCLUSION

In regard to hypothesis A, when comparing our 15 categories, there appears to be no statistical evidence that any one category is a predictor of the pledge v. goal ratio (PVG). We fail to reject the null and conclude that category is not statistically significant to the amount of money pledged against the goal being sought.

In regard to hypothesis B, when comparing our 3 focus categories, there does appear to be a notable difference in the distribution of the various STATES of canceled, failed, successful and suspended. Dance does appear to achieve more successful outcomes than expected and our data is statistically significant. Music also appears to achieve a successful campaign more than expected. However, Games performs poorer than expected in the observed data.

The findings are quite surprising, particularly when contrasted against the initial exploration of the data.

Dance appears to ultimately be one of most reliably successful categories. Within that category, I created a multiple linear regression model to evaluate the best predictors of money pledged in US dollars. Both the backer count and the goal in US dollars turned out to be decent coefficients for fitting a model to the data. However, there are outliers and leverage points that were impacting the model. I was able to remove a few of those to improve the fit of the model. Ultimately, the data available has limits to its predictive capacity with this particular data set.

APPENDIX

The assumptions for the Tukey HSD test were met: 1. Observations are indepenedent within and across groups 2. Each groups observations are normally distributed 3. There is a homogeneity of variance within groups

The assumptions for the Chi Sq test for homogeneity were met: 1. Expected counts of all cells >5 2. Each observation contributes to only 1 cell 3. Independent groups.

Professor Gore, I am generally satisfied with my investigation into this data set. I found it extremely challenging and thought provoking. I continued to iterate, then reiterate and found myself going in circles trying to consider all of the statistical challenges that are present when trying to craft a cohesive and thoughtful evaluation of this set of data. I probably got too hung up on analyzing “success” and did not move to the hypotheses and modeling as quickly as I should have. I hope you enjoyed my analysis as much as I did creating it. It certainly is not perfect and has a lot of room for improvement but this has given me a lot to consider in how the lessons of the course can be implemented into real world analysis. I hope that I achieved a reasonable outcome for what was expected for this assignment. I still find myself needing to go back to my notes regularly and check/re-check my intuitions and assumptions. There are still some things that I don’t feel like I have fully mastered and this project has helped me better identify what I need to spend more time practicing. Going to spend some of the break playing around with the other categories and working on the modeling process.

Thank you for a very challenging experience. I hope that this product approximates the expected target of the project.

-Travis McKenzie