We will again use the hate crimes data we used in Problem Set 05. The FiveThirtyEight article about those data are in the Jan 23, 2017 "Higher Rates Of Hate Crimes Are Tied To Income Inequality." This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.
First load the necessary packages:
library(ggplot2)
library(dplyr)
library(moderndive)
library(readr)
library(scales)
The following code uses the function read_csv()
to read in the data and store the information in an object named hate_crimes
.
hate_crimes <- read_csv("http://bit.ly/2ItxYg3")
Next, let's explore the hate_crimes
data set using the glimpse()
function from the dplyr
package:
glimpse(hate_crimes)
Rows: 51
Columns: 9
$ state <chr> "New Mexico", "Maine", "New York", "Illinois", "Dela…
$ median_house_inc <chr> "low", "low", "low", "low", "high", "high", "high", …
$ share_pop_metro <dbl> 0.69, 0.54, 0.94, 0.90, 0.90, 1.00, 0.87, 0.86, 0.97…
$ hs <dbl> 83, 90, 85, 86, 87, 85, 89, 90, 81, 91, 89, 89, 87, …
$ hate_crimes <dbl> 0.295, 0.616, 0.351, 0.195, 0.323, 0.095, 0.833, 0.6…
$ trump_support <chr> "low", "low", "low", "low", "low", "low", "low", "lo…
$ unemployment <chr> "high", "low", "low", "high", "low", "high", "high",…
$ urbanization <chr> "low", "low", "high", "high", "high", "high", "high"…
$ income <dbl> 46686, 51710, 54310, 54916, 57522, 58633, 58875, 590…
You should also examine the data in the data viewer.
Each case/row in these data is a state in the US. This week we will consider the response variable income
, which is the numeric variable of median income of households in each state.
We will use
urbanization
: level of urbanization in a regionshare_pop_hs
: the percentage of adults 25 and older with a high school degreeWe will start by modeling the relationship between:
hs
variablelow
, or high
, as contained in the variable urbanization
We will start by creating a scatterplot showing:
income
on the \(y\) axisurbanization
urbanization
(one for "low", one for "high")ggplot(data = hate_crimes, aes(y = income, x = hs, color = urbanization)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Percent of adults with high-school degree",
y = "Median household income in USD $",
title = "Education and income in states with differing levels of urbanization") +
theme_bw()
hs
and income
is strong or weak, linear or non-linear?The relationship appears to be a weakly positive linear one according to this scatter plot.
urbanization
or low urbanization
) appears to have the larger intercept?The high urbanization
regression line appears to have the larger intercept.
Though not parallel, the slopes of the two regression lines appear fairly similar.
Considering the slim differential apperance of the two lines, it would be more useful to combine them into a common one via running a parrallel slopes model.
hs
variable in this plot). Please include axis labels and and title.ggplot(data = hate_crimes, aes(x = urbanization, y = income, fill = urbanization)) +
geom_boxplot() +
labs(title = "Median Household Income Based on Levels of Urbanization",
x = "Urbanization Level",
y = "Median Household Income")
income
(as response variable), and high-school education, and urbanization
as explanatory variables. Store the results of your model in an object named med_income_model
. Generate the regression table using the get_regression_table()
function from the moderndive
package.med_income_model <- lm(income ~ hs + urbanization, hate_crimes)
get_regression_table(med_income_model)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | -113725.193 | 23552.919 | -4.828 | 0 | -161163.206 | -66287.180 |
hs | 1986.794 | 272.930 | 7.279 | 0 | 1437.084 | 2536.504 |
urbanizationlow | -7333.326 | 1857.659 | -3.948 | 0 | -11074.843 | -3591.808 |
The intercept for the states with "low" levels of urbanization is around $7,333.33 lower than those with "high" levels. The approximate slope of $1,986.79 (Income / % of adults w/ high-school degree) is the same for both levels of urbanization.
The slope for states with a "high" level of urbanization is about $1,986.79 (Income / % of adults w/ high-school degree) while its intercept is approximately -$113,725.
The slope for states with a "low" level of urbanization is about $1,986.79 (Income / % of adults w/ high-school degree) while its intercept is approximately -$121,059.
The median household income is greater in states that have higher levels of urbanization by about $7,333.33.
About $1,986.79.
med_income_model
is given in Equation (2.1). Write the regression equation for a US state in which urbanization
is "high".q13 <- dollar((get_regression_table(med_income_model)$estimate[1] + (get_regression_table(med_income_model)$estimate[2] * 85)))
$55,152.30
q14 <- dollar((get_regression_table(med_income_model)$estimate[1] + (get_regression_table(med_income_model)$estimate[2] * 85) + get_regression_table(med_income_model)$estimate[3]))
$47,818.97
-$61,454.70 This is an extrapolation so it wouldn't be feasible do actually use this number for anything other than a placeholder value.
income
value for Maine (row 2)? What was the prediction for Maine according to our model (med_income_model
)? What is the residual? Did our model over or underestimate the median income for this state?q16 <- get_regression_points(med_income_model) %>%
filter(ID == 2)
q16
ID | income | hs | urbanization | income_hat | residual |
---|---|---|---|---|---|
2 | 51710 | 90 | low | 57752.93 | -6042.926 |
income
value for Maine was ≈ $51,710income
value according to the model was ≈ $57,752.93You will now use the tools you have learned, and a new data set to solve a conservation problem.
Wildlife biologists are interested in managing/protecting habitats for a declining species of vole, but are not sure about what habitats it prefers. Two things that biologists can easily control with management is percent cover of vegetation, and where habitat improvements occur (i.e. is it important to create/protect habitat in moist or dry sites, etc). To help inform habitat management of this vole species, the researchers in this study counted the number of voles
at 56 random study sites. At each site, they measured percent cover of veg
etation, and recorded whether a site had moist or dry soil
.
The data are read into the object vole_trapping
using the read_csv()
function below.
vole_trapping <- read_csv("http://bit.ly/2IgDF0E")
The data contains the variables:
site
for the id of each random study site (each case or row is a survey/trapping site)voles
for the vole count at each siteveg
for the percent cover of vegetation at each sitesoil
identifying a site as "moist" or "dry"Generate a regression model with voles
as the response variable y
and veg
and soil
as explanatory variables. Store the model in an object named voles_mod
. Use the results of the model to answer the following questions based on the available data. Create a data visualization (parallel slopes) to help answer the questions.
voles_mod <- lm(voles ~ veg + soil, vole_trapping)
voles_mod
Call:
lm(formula = voles ~ veg + soil, data = vole_trapping)
Coefficients:
(Intercept) veg soilmoist
15.4640 0.2591 9.1003
ggplot(vole_trapping, aes(x = veg, y = voles, color = soil)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
labs(title = "Number of Voles observed based on % of Vegetation Coverage w/ dry and moist soil",
x = "Percentage of Vegetation Coverage",
y = "Number of Voles")
High vegetation coverage, because the ratio voles
:veg
is moderately positive.
Approximately 56.12% vegetation coverage would be necessary.
About 35 voles.