1 Background

We will again use the hate crimes data we used in Problem Set 05. The FiveThirtyEight article about those data are in the Jan 23, 2017 "Higher Rates Of Hate Crimes Are Tied To Income Inequality." This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.

1.1 Setup

First load the necessary packages:

library(ggplot2)
library(dplyr)
library(moderndive)
library(readr)
library(scales)

The following code uses the function read_csv() to read in the data and store the information in an object named hate_crimes.

hate_crimes <- read_csv("http://bit.ly/2ItxYg3")

Next, let's explore the hate_crimes data set using the glimpse() function from the dplyr package:

glimpse(hate_crimes)
Rows: 51
Columns: 9
$ state            <chr> "New Mexico", "Maine", "New York", "Illinois", "Dela…
$ median_house_inc <chr> "low", "low", "low", "low", "high", "high", "high", …
$ share_pop_metro  <dbl> 0.69, 0.54, 0.94, 0.90, 0.90, 1.00, 0.87, 0.86, 0.97…
$ hs               <dbl> 83, 90, 85, 86, 87, 85, 89, 90, 81, 91, 89, 89, 87, …
$ hate_crimes      <dbl> 0.295, 0.616, 0.351, 0.195, 0.323, 0.095, 0.833, 0.6…
$ trump_support    <chr> "low", "low", "low", "low", "low", "low", "low", "lo…
$ unemployment     <chr> "high", "low", "low", "high", "low", "high", "high",…
$ urbanization     <chr> "low", "low", "high", "high", "high", "high", "high"…
$ income           <dbl> 46686, 51710, 54310, 54916, 57522, 58633, 58875, 590…

You should also examine the data in the data viewer.

Each case/row in these data is a state in the US. This week we will consider the response variable income, which is the numeric variable of median income of households in each state.

We will use

  • A categorical explanatory variable urbanization: level of urbanization in a region
  • A numerical explanatory variable share_pop_hs: the percentage of adults 25 and older with a high school degree

1.2 Income, Education and Urbanization

We will start by modeling the relationship between:

  • \(y\): Median household income in 2016
  • \(x_1\): numerical variable percent of adults 25 and older with a high-school degree, contained in the hs variable
  • \(x_2\): categorical variable level of urbanization in a state: low, or high, as contained in the variable urbanization

2 Exploratory Data Analysis

We will start by creating a scatterplot showing:

  • Median household income on the \(y\) axis
  • Percent of adults 25 or older with a high school degree on the \(x\) axis
  • The points colored by the variable urbanization
  • A line of best fit (regression line) for each level of the variable urbanization (one for "low", one for "high")
ggplot(data = hate_crimes, aes(y = income, x = hs, color = urbanization)) +
  geom_point()+ 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(
    x = "Percent of adults with high-school degree", 
    y = "Median household income in USD $", 
    title = "Education and income in states with differing levels of urbanization") + 
  theme_bw()
Education and income in states with differing levels of urbanization

Figure 2.1: Education and income in states with differing levels of urbanization


  1. Do you think the relationship between hs and income is strong or weak, linear or non-linear?

The relationship appears to be a weakly positive linear one according to this scatter plot.

  1. Which regression line (high urbanization or low urbanization) appears to have the larger intercept?

The high urbanization regression line appears to have the larger intercept.


  1. Does the slope look fairly similar (parallel) for the two levels of urbanization?

Though not parallel, the slopes of the two regression lines appear fairly similar.

  1. Based on the data visualization in Figure 2.1, and your answer to 3, do you think it would be best to run a "parallel slopes" model (i.e. a model that estimates one shared slope for the two levels of urbanization), or a more complex "interaction model" (i.e. a model that estimates a separate slope for the two levels of urbanization)?

Considering the slim differential apperance of the two lines, it would be more useful to combine them into a common one via running a parrallel slopes model.


  1. Create a data visualization comparing median household income at "low" and "high" levels of urbanization (you do not need to include the hs variable in this plot). Please include axis labels and and title.
ggplot(data = hate_crimes, aes(x = urbanization, y = income, fill = urbanization)) +
  geom_boxplot() +
  labs(title = "Median Household Income Based on Levels of Urbanization",
       x = "Urbanization Level",
       y = "Median Household Income")


  1. Run a linear regression model that examines the relationship between household income (as response variable), and high-school education, and urbanization as explanatory variables. Store the results of your model in an object named med_income_model. Generate the regression table using the get_regression_table() function from the moderndive package.
med_income_model <- lm(income ~ hs + urbanization, hate_crimes)
get_regression_table(med_income_model)
term estimate std_error statistic p_value lower_ci upper_ci
intercept -113725.193 23552.919 -4.828 0 -161163.206 -66287.180
hs 1986.794 272.930 7.279 0 1437.084 2536.504
urbanizationlow -7333.326 1857.659 -3.948 0 -11074.843 -3591.808

  1. Is the intercept the same for the states with a "low" and "high" level of urbanization? Is the slope the same? (look at the data visualization in Figure 2.1 to help with this!)

The intercept for the states with "low" levels of urbanization is around $7,333.33 lower than those with "high" levels. The approximate slope of $1,986.79 (Income / % of adults w/ high-school degree) is the same for both levels of urbanization.

  1. What is the slope for the regression line of the states with a "high" level of urbanization? What is the intercept?

The slope for states with a "high" level of urbanization is about $1,986.79 (Income / % of adults w/ high-school degree) while its intercept is approximately -$113,725.

  1. What is the slope for the regression line of the states with a "low" level of urbanization? What is the intercept?

The slope for states with a "low" level of urbanization is about $1,986.79 (Income / % of adults w/ high-school degree) while its intercept is approximately -$121,059.

  1. Based on your regression table output (and the data visualizations), is median household income greater in states that have lower or higher levels of urbanization? By how much?

The median household income is greater in states that have higher levels of urbanization by about $7,333.33.

  1. For every increase in 1 percentage point of high-school educated adults, what is the associated increase in the median household income of a state?

About $1,986.79.


  1. The regression equation for med_income_model is given in Equation (2.1). Write the regression equation for a US state in which urbanization is "high".
\[\begin{equation} \widehat{income} = -113725 + 1986 \times \text{hs} + -7333 \times 1_{\mbox{low urbanization}}(x) \tag{2.1} \end{equation}\]
\[\begin{equation} \widehat{income} = -113725 + 1986 \times \text{hs} \end{equation}\]

  1. What would you predict as the median household income for a state with a high level of urbanization, in which 85% of the share of adults have a high school degree?
q13 <- dollar((get_regression_table(med_income_model)$estimate[1] + (get_regression_table(med_income_model)$estimate[2] * 85)))

$55,152.30


  1. What would you predict as the median household income for a state with a low level of urbanization, in which 85% of the share of adults have a high school degree?
q14 <- dollar((get_regression_table(med_income_model)$estimate[1] + (get_regression_table(med_income_model)$estimate[2] * 85) + get_regression_table(med_income_model)$estimate[3]))

$47,818.97


  1. What would you predict as the median household income for a state with a low level of urbanization, in which 30% of adults have a high school degree?

-$61,454.70 This is an extrapolation so it wouldn't be feasible do actually use this number for anything other than a placeholder value.


  1. What was the observed income value for Maine (row 2)? What was the prediction for Maine according to our model (med_income_model)? What is the residual? Did our model over or underestimate the median income for this state?
q16 <- get_regression_points(med_income_model) %>%
        filter(ID == 2)
q16
ID income hs urbanization income_hat residual
2 51710 90 low 57752.93 -6042.926
  • The observed income value for Maine was ≈ $51,710
  • The fitted income value according to the model was ≈ $57,752.93
  • The residual is ≈ -$6,042.93
  • The model overestimated the median income for this state by about $6,042.93

3 Independent Analysis

You will now use the tools you have learned, and a new data set to solve a conservation problem.

Wildlife biologists are interested in managing/protecting habitats for a declining species of vole, but are not sure about what habitats it prefers. Two things that biologists can easily control with management is percent cover of vegetation, and where habitat improvements occur (i.e. is it important to create/protect habitat in moist or dry sites, etc). To help inform habitat management of this vole species, the researchers in this study counted the number of voles at 56 random study sites. At each site, they measured percent cover of vegetation, and recorded whether a site had moist or dry soil.

The data are read into the object vole_trapping using the read_csv() function below.

vole_trapping <- read_csv("http://bit.ly/2IgDF0E")

The data contains the variables:

  • site for the id of each random study site (each case or row is a survey/trapping site)
  • voles for the vole count at each site
  • veg for the percent cover of vegetation at each site
  • soil identifying a site as "moist" or "dry"

Generate a regression model with voles as the response variable y and veg and soil as explanatory variables. Store the model in an object named voles_mod. Use the results of the model to answer the following questions based on the available data. Create a data visualization (parallel slopes) to help answer the questions.

voles_mod <- lm(voles ~ veg + soil, vole_trapping)
voles_mod

Call:
lm(formula = voles ~ veg + soil, data = vole_trapping)

Coefficients:
(Intercept)          veg    soilmoist  
    15.4640       0.2591       9.1003  
ggplot(vole_trapping, aes(x = veg, y = voles, color = soil)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE) + 
  labs(title = "Number of Voles observed based on % of Vegetation Coverage w/ dry and moist soil",
       x = "Percentage of Vegetation Coverage",
       y = "Number of Voles")


  1. Would you recommend to a manager that they try to protect sites with high or low vegetation cover? Why?

High vegetation coverage, because the ratio voles:veg is moderately positive.


  1. Dry sites are typically a lot less money to purchase and maintain for conservation organizations. Thus, if a conservation organization decides to purchase a few dry sites, roughly what percent cover of vegetation do they need to maintain on these sites (at a minimum) to support a population of about 30 voles at the site?

Approximately 56.12% vegetation coverage would be necessary.


  1. The Nature Conservancy is looking at purchasing a site for this species (in the same study area) that has moist soil and 40% vegetation cover. Using the regression equation what would you predict as the possible vole population the site might be able to support?

About 35 voles.