Introduction

Do you love where you live and are shocked why everyone just doesn’t live in your country? Or maybe you absolutely hate where you live and don’t understand why anyone would move to your country. Either way, you can get your questions answered through this R analysis! We will perform a data science pipeline on the World Happiness Report. The World Happiness Report is a dataset released by the United Nations anually that ranks countries based on happiness levels. The report contains a Happiness Score that is then used to determine a country's rank relative to other country’s scores. There are different metrics that are juxtaposed with the score and I was curious to see which one metric could be singled out (if at all) to be the best predictor of a country’s score.

To learn more about the World Happiness Report visit : http://worldhappiness.report/ed/2018/

Required Libraries

Replication of analysis requires RStudio version 3.4.3 and the following libraries.

require(rvest)
require(magrittr)
require(dplyr)
require(tidyr)
require(ggplot2)
require(tree)
require(maps)
library(ggmap)
library(stringr)
library(rworldmap)
require(RColorBrewer)
library(randomForest)
library(ISLR)
library(cvTools)

1. Gather Data

Welcome to the first step in the data pipeline! We can’t perform any operations or conduct any analysis until we have data to work with. Because we are strong, independent data scientists, we do not require a CSV file to be provided beforehand. Instead, we obtain our data through a process called HTML scraping. HTML scraping involves using pre-existing html files in webpages to extract the data we are interested in. HTML code is organized in a bunch of different way’s but what we are concerned with is the table class because that’s where the data lies.

url <- "https://en.wikipedia.org/wiki/World_Happiness_Report"
df_2018 <- read_html(url) %>% 
        #look for html node table
        html_nodes("table") %>% 
        #extract 5th occurence of html table (Wikipedia treats things as tables that shouldn't be tables)
        extract2(5) %>%
        html_table() %>% 
        as.data.frame()
as_tibble(df_2018)
## # A tibble: 156 x 9
##    `Overall Rank` Country     Score `GDP per capita` `Social support`
##             <int> <chr>       <dbl>            <dbl>            <dbl>
##  1              1 Finland      7.63             1.30             1.59
##  2              2 Norway       7.59             1.46             1.58
##  3              3 Denmark      7.56             1.35             1.59
##  4              4 Iceland      7.50             1.34             1.64
##  5              5 Switzerland  7.49             1.42             1.55
##  6              6 Netherlands  7.44             1.36             1.49
##  7              7 Canada       7.33             1.33             1.53
##  8              8 New Zealand  7.32             1.27             1.60
##  9              9 Sweden       7.31             1.36             1.50
## 10             10 Australia    7.27             1.34             1.57
## # ... with 146 more rows, and 4 more variables: `Healthy life
## #   expectancy` <dbl>, `Freedom to make life choices` <dbl>,
## #   Generosity <dbl>, `Perceptions of corruption` <chr>

Hey that’s a pretty good looking data frame! However, looks can be decieving and we still need to tidy up this frame.

2. Tidy Data

Tidying data is a process where data scientists structure a dataset to make it easier to perform analysis and operations on the data. Hidden in this data frame is a single NA, or missing entry for our Perceptions of Corruption column for the entity identifying as the United Arab Emirates. Unfortunately, I have no friends in the United Nations and so I never got an answer to why this entry was missing, but that’s okay, we will continue on using plan B, imputing data. Imputing data requires us to insert a value for the missing entry in an unbiased way. We will accomplish this by calculating the mean of Perceptions of Corruption and inserting that value as our missing data. This will ensure the central tendency of the data is the same.

Another checkpoint for tidy data is ensuring that column types match their values. Everything is dandy up until Perceptions of Corruption. This column is encoded as “chr vector” or string which mean’s all of its values will also be a string. This is detrimental to our methods because we have to compute the mean of the column to use in imputing for our missing data and the mean of strings makes no sense. Let’s convert every value in the column to a double.

df_2018 <- 
  df_2018 %>% 
  #rename columns so that they are easier on the eye
  set_colnames(c("Rank", "Country","Score","GDP", "Social_Support","Health_Expectancy","Freedom","Generosity", "Perceptions_Of_Corruption"))
  #type convert from character vector to numeric double
  df_2018$Perceptions_Of_Corruption <- as.numeric(as.character(df_2018$Perceptions_Of_Corruption))
  #impute mean value for missing entry
  df_2018[is.na(df_2018)] <- mean(df_2018$Perceptions_Of_Corruption, na.rm = TRUE)
as_tibble(df_2018)
## # A tibble: 156 x 9
##     Rank Country     Score   GDP Social_Support Health_Expectancy Freedom
##    <int> <chr>       <dbl> <dbl>          <dbl>             <dbl>   <dbl>
##  1     1 Finland      7.63  1.30           1.59             0.874   0.681
##  2     2 Norway       7.59  1.46           1.58             0.861   0.686
##  3     3 Denmark      7.56  1.35           1.59             0.868   0.683
##  4     4 Iceland      7.50  1.34           1.64             0.914   0.677
##  5     5 Switzerland  7.49  1.42           1.55             0.927   0.660
##  6     6 Netherlands  7.44  1.36           1.49             0.878   0.638
##  7     7 Canada       7.33  1.33           1.53             0.896   0.653
##  8     8 New Zealand  7.32  1.27           1.60             0.876   0.669
##  9     9 Sweden       7.31  1.36           1.50             0.913   0.659
## 10    10 Australia    7.27  1.34           1.57             0.910   0.647
## # ... with 146 more rows, and 2 more variables: Generosity <dbl>,
## #   Perceptions_Of_Corruption <dbl>

Our data set is looking good! We’re ready to explore.

3. Exploratory Data Analysis

Our data set is now tidy so we can start analyzing and inferring information from our data manipulation operations. This is accomplished through data visualization, and statistical measures.

Let's look at our original mean and standard deviation before we do any data transformations.

mean <- mean(df_2018$Score)
mean
## [1] 5.375917
std <- sd(df_2018$Score)
std
## [1] 1.119506

A country, lets call it mean country, with the score of 5.375 would lie between the 78th and 79th observations in our data. In other words mean country would be happier than Greece but not as happy as Serbia.

3.1 Data Transformation

Now, we will center out data, and then scale it. The centering can be done by subtracting the mean of the score from each observations’ score. Scaling can be done by dividing by the standard deviation.

standardized_df <- df_2018 %>%
  mutate(mean_score = mean(Score)) %>%
  mutate(sd_aff = sd(Score)) %>%
  mutate(z_aff = (Score - mean_score) / sd_aff)
mean_std <- mean(standardized_df$z_aff)
mean_std
## [1] -2.273211e-16
std_std <- sd(standardized_df$Score)
std_std
## [1] 1.119506

Great now we have a standardized dataset. Note the new value of our mean (it’s a miniscule number but it should be 0 in a perfect world). Standardized data sets will always have a mean value of 0.

3.2 Visualization

Let’s begin with some simple visuals. Below are three graphs, the first two will be scatter plots of Score vs GDP and Score vs Health Expectancy, and the final graph will be a box plot of Score vs Freedom.

df_2018 %>% ggplot(aes(x= Score, y = GDP)) + geom_point() + geom_smooth(lm = loess) + labs(title = "Score vs GDP")

Treating Score as the independent variable, we see GDP rises as Score increases. We will test this inclination more thoroughly through a linear regression model later.

df_2018 %>% ggplot(aes(x= Score, y = Health_Expectancy)) + geom_point() + geom_smooth(lm = loess) + labs(title = "Score vs Health Expectancy")

Treating Score as the independent variable, we see Health Expectancy rises as Score increases. We will test this inclination more thoroughly through a linear regression model later.

df_2018 %>% ggplot(aes(x= Score, y = Freedom)) + geom_boxplot() + labs(title = "Score vs Freedom")

I chose a boxplot for freedom to switch thigs up a bit. Box plots have 5 metrics that they display(also called the five number summary) - minimum, maximum, first quartile, third quartile, and mean. Box plots also show distribution. From the plot we can justify that half of the data has a Freedom score of .35-.6 and the mean is around .5.

3.2.1 World Map

To easily discertain which countries have a high Happiness score vs countries that do not have a high happiness score, we will create a world map. Each country will be colored a shade of green where the darkest green indicates a high happiness score and a light color indicates a poor happiness score.

d <- data.frame(
  country=df_2018$Country,
  value=df_2018$Score)
cols <- colorRampPalette(brewer.pal(7,"Greens"))(length(df_2018))
n <- invisible(joinCountryData2Map(d, joinCode="NAME", nameJoinColumn="country"))
mapCountryData(n, nameColumnToPlot="value", mapTitle="World Map for Happiness Score",colourPalette=cols, oceanCol = "#CCCCCCCC", addLegend = TRUE,aspect = 1.1, borderCol = "Black", lwd =.1)

4 Machine Learning

Machine learning is the process where a computer is fed data, the more the merrier, that is then used by the computer to simulate interactions and create a prediction. Let’s begin by defining what our model should predict. Our model will predict whether or not a country is a happy country or not based on it’s score. Arbitrarily, a score of >5.375 indicates a happy country, anything below that threshold is considered an unhappy country looking to improve it’s conditions. Notice, the cutoff is the mean calculated in our Exploratory Data Analysis. By using this value, roughly half the countries in our data set will be happy and the other half will be unhappy.

4.1 Hypothesis Testing

The hypothesis testing framework is set up so that we use our findings to reject the hypothesis that a change in each factor does not change the score. We reject this null hypothesis if our p value is greater than our \(\alpha\) value, which we will take to be .05.

4.2 Linear Regression

Let’s fit a regression model to our data. This can be accomplished with the lm function. We want to see how Score changes based on all the relevant columns so our model should look something like : \[ Score \approx \beta_0 + \beta_1 \times GDP + \beta_2 \times SocialSupport + \beta_3 \times HealthExpectancy + \beta_4 \times Freedom + \beta_5 \times Generosity + \beta_6 \times PerceptionsOfCorruption \]

fit <- lm(Score ~ 1 + GDP+Social_Support+Health_Expectancy+Freedom+Generosity+Perceptions_Of_Corruption, data = df_2018)
broom::tidy(fit) %>% knitr::kable() 
term estimate std.error statistic p.value
(Intercept) 1.8803969 0.1939790 9.693818 0.0000000
GDP 1.1026589 0.2092118 5.270539 0.0000005
Social_Support 0.9977967 0.2013778 4.954850 0.0000019
Health_Expectancy 0.8055928 0.3290109 2.448529 0.0155047
Freedom 1.4069590 0.3183240 4.419897 0.0000189
Generosity 0.5587707 0.4703165 1.188074 0.2366946
Perceptions_Of_Corruption 0.7160442 0.5263550 1.360383 0.1757634

The table above tells us how score changes if the other variables were to change. For example, an increase in GDP would cause the Score value to increase by a factor of 1.1. \(\beta_0\)’s value is 1.88 which means that a company with 0’s in all factors would have a happiness score of 1.88 put it way in last place. Taking our \(\alpha\) value to be .05, we can’t reject our null hypothesis for the Perceptions of Corruption and Generosity metrics.

4.3 Tree Based Methods

Tree based methods are antoher method in performing regression. A decision tree partitions our predictor (GDP) into regions and values (Score) are determine based on conditioning.

tree <- tree(Score~GDP, data=df_2018)
plot(tree)
text(tree, pretty=0, cex=1.3)

From this tree we can see that Score is determined based on conditioning on only GDP. Our model predicts a country with GDP of .8 to have a happiness score of 5.275.

tree <- tree(Score~GDP+Social_Support+Health_Expectancy+Freedom, data=df_2018)
plot(tree)
text(tree, pretty=10, cex=.5)

From this tree we can see that Score is determined based on conditioning on the relevant predictors observed through the linear regression model earlier. As you can see creating more predictors causes the tree to be more complex and harder to interpret because of the increasing amount of conditionals. From this tree we can predict that a country with GDP of 1.1 and Freedom score of .65 will have a happiness score of 7.287.

4.3.1 Random Forests
set.seed(1234)
train_indices <- sample(nrow(df_2018), nrow(df_2018)/2)
train_set <- df_2018[train_indices,]
test_set <- df_2018[-train_indices,]

model <- randomForest(Score~GDP+Social_Support+Health_Expectancy+Freedom, importance=TRUE, mtry=3, data=train_set)
plot(model)

This plot shows the error rate based on the number of trees used. 500 trees is a bit too much, especially for this data set. According to https://www.researchgate.net/publication/230766603_How_Many_Trees_in_a_Random_Forest the optimal number of trees should be between 64 and 128 to optimize processing time and results.

variable_importance <- importance(model)
knitr::kable(head(round(variable_importance, digits=2)))
%IncMSE IncNodePurity
GDP 23.49 32.56
Social_Support 18.52 28.07
Health_Expectancy 16.79 26.25
Freedom 8.52 7.95

Once again, we see that GDP is the strongest predictor.

4.4 Cross Validation

Thus far, we’ve seen linear regression and tree fit for our Happiness Data. Let’s do one more, logistic regression, and then let’s use cross validation to determine which model is better. Cross validation, specifically the t-test, obtains error rates by comparing predicted value to observed value and determines which model is a more accurate representation. Ultimately, a regression model will be fit of error rates to determine which model outperforms the other (based on comparing estimate values).

data(df_2018)
df_2018 <- df_2018 %>% mutate(happy = ifelse(Score > 5.375, "Yes", "No"))
fold_indices <- cvFolds(n=nrow(df_2018), K=10)

error_rates <- sapply(1:10, function(fold_index) {
  test_indices <- which(fold_indices$which == fold_index)
  test_set <- df_2018[test_indices,]
  train_set <- df_2018[-test_indices,]
  
  logis_fit <- glm(Score~GDP+Social_Support+Health_Expectancy+Freedom+Generosity+Perceptions_Of_Corruption, data=train_set)
  logis_pred <-ifelse(predict(logis_fit, newdata=test_set, type="response") > 5.375,"Yes","No")
  logis_error <- mean(test_set$happy != logis_pred)
  
  tree_fit <- tree(Score~GDP+Social_Support+Health_Expectancy+Freedom+Generosity+Perceptions_Of_Corruption, data=train_set)
  pruned_tree <- prune.tree(tree_fit, best=3)

  tree_pred <- ifelse(predict(pruned_tree, newdata=test_set) > 5.375, "Yes", "No")
  tree_error <- mean(test_set$happy != tree_pred)
  c(logis_error, tree_error)
  })
rownames(error_rates) <- c("logis", "tree")
error_rates <- as.data.frame(t(error_rates))

error_rates <- error_rates %>%
  mutate(fold=1:n()) %>%
  gather(method,error,-fold)

error_rates %>%
  head() %>%
  knitr::kable("html")
fold method error
1 logis 0.0000
2 logis 0.1875
3 logis 0.0625
4 logis 0.0625
5 logis 0.1875
6 logis 0.3125
dotplot(error~method, data=error_rates, ylab="Mean Prediction Error")

lm(error~method, data=error_rates) %>% 
  broom::tidy() %>%
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 0.14125 0.030163 4.682883 0.0001853
methodtree 0.09625 0.042657 2.256372 0.0367220

We see that both models have very similar estimates and thus we cannot justify one model being superior to the other.

Conclusion

The World Happiness report is a useful report produced by the United Nations that allows the nations to see how it’s citizens’ happiness level relates to other country’s happiness levels. To a data scientist, this report opens up alot more avenues to go take a ride on. Questions such as “what specific characteristics of a nation determine it’s citizens’ happiness level?” and “which of those characteristics should a nation look to improve upon to ensure it’s citizens are happy?” (hey if you made it this far now you know!) can be answered through the data science pipeline process.

To answer the quesitons above, we know that GDP is the strongest indicator of happiness. A country dissappointed in it’s score should look to increase it’s GDP per capita first-and-foremost.Obtaining better ratings in Health Expectancy, Social Support, and Freedom wouldn’t hurt, and Generosity and Perceptions of Corruptness may be hit or miss (all of those metrics would help it’s just a matter of which one a country should focus on to maximize return on time and resource investment into bettering it’s happiness score)

Congratulations! You’re now a data scientist. To flex your new muscles I encourage you to ask different questions about this report and solve them using the data science pipeline. To get you started: If a Utopia is considered a 9.5 on the Happiness score and a dystopia is a .5, curate all the past Happiness reports and answer the two questions “which country is on it’s way to becoming the first Utopia?” and “Which country is on it’s way to becoming the first Dystopia?”. Hint: Figure out how each country’s score changes over time and extrapolate using machine learning techniques learned in this tutorial.