Welcome to the WiDS Data Dive, in partnership with the Chicago/Northern Illinois Red Cross.
You will have 2.5 hours to answer the primary question at the event.
Below is a table of number of incidents per neighborhood from 2015-2019 that the Chicago/Northern Illinois Red Cross responded to. There seems to be a few neighborhoods that have a lot of the incidents, those neighborhoods with over 150 incidents over the time period are shown in the simple bar chart. Our question is: What is going on in these neighborhoods?
primary_neighborhood | n |
---|---|
englewood | 476 |
austin | 355 |
roseland | 248 |
garfield park | 213 |
humboldt park | 198 |
auburn gresham | 189 |
north lawndale | 181 |
south shore | 176 |
west pullman | 175 |
new city | 165 |
grand crossing | 156 |
south chicago | 119 |
chicago lawn | 118 |
chatham | 114 |
little village | 101 |
woodlawn | 101 |
NOTE: This is not a question that we or the Red Cross definitively know the answer to. Feel free to be as creative as time allows!
With the first question looking into where these incidents are occuring, another questions the Red Cross is asking is when these incidents are occuring? Look at seasonality of the dataset at a monthly, weekly, and/or daily level.
The Red Cross of Greater Chicago/Illinois responds to disaster incidents throughout Illinois and aids those who are affected by those disasters.
In the dataset provided by the Red Cross, each row is an incident and each column is a variable related to the incidents. The Red Cross responds to primarily fires and disasters, and most of the disasters happen in the city of Chicago. The data provided is only from Chicago disasters.
The US Fire Administration has collected data on fires from around the U.S. While these are national estimates and may not adequately represent Chicago-specific fires, it can further our understanding on fire disasters. Data on buildings can also help us understand why and how these disasters occur. To aid your exploratory analysis, we have provided data on building code violations, environmental factors, vacant buildings, and 311 calls data.
NOTE: This is not a modeling exercise; we simply are trying to look into interesting pathways to explore further.
Your team will need to create 1 slides with your best visualization or takeaway. You will get a maximum 90 seconds to present your slide at WiDS downtown on Friday, March 6.
Please email your 1 slide with the subject: “DATA DIVE”. In the body of the email list of team members’ names that will be attending WiDS downtown. Send to mcwhalen@u.northwestern.edu by 8:00 pm on Thursday, March 5.
The codebook lists each variable name, a description of the variable, and the variable type. If variable is categorical, each of the possible categorical responses is listed.
We will give you a zip file to download HERE that has:
Red Cross incident data from 2015-2019 as Redcross.csv
. The is described more below.
Shapefile,Neighborhoods_2012b.shp
, for community areas in Chicago (NOTE: you need ALL of the file types to stay in that neighborhood_shapefile/
directory, not just the *.shp!)
chicago_community_data.csv
: Community Snapshot Data by neighborhood. This has a TON of information you can use: population, demographics, and transit information!
calls_311.csv
: Data from the City of Chicago Data Portal about 311 calls.
enviro_inspection.csv
: Data from the City of Chicago Data Portal about environmental inspections.
build_vio_nhood_data.csv
: Data from the City of Chicago Data Portal about building violations.
vacant.csv
: Data from the City of Chicago Data Portal about vacant buildings. Note that this does not have latitude/longitude information and therefore was not joined with neighborhood names like the above datasets.
If you simply clone the repository on your local machine: all of these are also included in the data/
folder, but not the build_vio_nhood_data.csv
, as it is extremely big. You can use as many or as few of these as you would like, and also supplement with any additional information you can find and cite.
Here are some potential packages that might be of use in your exploratory data analysis.
Below you’ll also find:
# Loading packages ------------------------------------------------------------
require(tidyverse)
require(rgdal)
require(maptools)
require(rgeos)
require(maps)
require(mapproj)
require(ggthemes)
require(RCurl)
require(lubridate)
require(janitor)
require(hms)
require(kableExtra)
require(naniar) #for all you R nerds, here is the a new package that is great for handling missingness
require(sf)
If you would like to download the zip file, unzip, and read in all the .csvs into R, here is code below to do that. The below code will read the .csvs into your environment.
download.file(url = "https://raw.githubusercontent.com/menawhalen/WiDS_data_dive_2020/master/data_sources.zip"
, destfile = "data_sources_WiDS.zip")
# unzip the .zip file
unzip(zipfile = "data_sources_WiDS.zip", exdir='data_from_zip')
# list all files
files <- list.files(path ="data_from_zip")
# apply map_df() to iterate read_csv over files
data <- map_df(paste("data_from_zip/", files, sep = "/"),
read_csv
)
Below is some R code for downloading already cleaned data. It has all dates and time stamps coded to be the correct type, also changed incident_num
to an integer without a dash.
red_cross<- read_csv("data/Redcross.csv")
missingness_info<-red_cross %>% miss_var_summary() #displays number and percent of NAs in each column
Below is the code that was used create the cleaned rds file.
# read in the text file for red cross from Github
red_cross_text<-getURL("https://raw.githubusercontent.com/menawhalen/WiDS_data_dive_2020/master/data/Redcross.csv")
# puts into a csv with the first row of table being the column names, seperated by ","
red_cross<-read.csv(text = red_cross_text, header = TRUE, sep = ",", fileEncoding = "UTF-8") %>%
as_tibble() %>% # puts into a tibble
clean_names() %>% #makes names lowercase column names
rename(date = x_u_feff_date) %>%
# puts all the dates and time stamps in the right format for R
mutate(date=as.Date(date, format = "%Y-%m-%d"),
verified=as.Date(verified, format = "%Y-%m-%d"),
dispatched=as.Date(dispatched, format = "%Y-%m-%d"),
volunteers_identified=as.Date(volunteers_identified, format = "%Y-%m-%d"),
on_scene=as.Date(on_scene, format = "%Y-%m-%d"),
off_scene=as.Date(off_scene, format = "%Y-%m-%d"),
incident_number=as.character(incident_number),
incident_number=as.integer(gsub("-", "", incident_number)),# makes incident an integer
primary_neighborhood = factor(tolower(primary_neighborhood)),
secondary_neighborhood = factor(tolower(secondary_neighborhood)))
#write_rds(red_cross, "data/Redcross.rds")
Our incident data has as primary and secondary neighborhood variable. The below example code joins Red Cross incident data with spatial data based on primary neighborhood, and makes a map of the number of incidents in 2019.
# Illinois Zip Code Map ----------------------------------------------------
red_cross <- read_rds("data/Redcross.rds") %>%
as_tibble() %>%
clean_names()
# Forming Chicago Zip Map Tibble ------------------------------------------
#all spatial data files should be in neighborhood_shapefile directory! You need all of them!
il_map_dat <- readOGR(dsn="data/neighborhood_shapefiles/Neighborhoods_2012b.shp")
il_map_dat@data <- il_map_dat@data %>%
clean_names() %>%
mutate(pri_neigh= str_to_lower(pri_neigh))#read in chicago neighborhood shapefile
il_map_dat@data$id <- rownames(il_map_dat@data)
il_points <- fortify(il_map_dat, region = "pri_neigh") %>% #fortify helps us turn the map into data points in a data frame
as_tibble() %>%
filter(id %in% red_cross$primary_neighborhood)
#creating mapping dataset
il_map_df <- full_join(il_points, il_map_dat@data, by=c("id"= "pri_neigh")) %>%
as_tibble()
neighborhood_raw_count <- red_cross %>%
mutate(year = year(date),
primary_neighborhood = as.character(primary_neighborhood)) %>%
group_by(primary_neighborhood, year) %>%
summarise(incident_count = n()) %>% #counts the number of incidents in each community area in each year
ungroup() %>%
filter(year == 2019) %>% #we just will look at 2019
mutate(more_than_10 = case_when(incident_count <= 5 ~ "< 5",
incident_count <= 10 ~ "5-10",
TRUE ~ ">10")) #make discrete coloring scale that we don't end up using
# make the plot!
il_map_df %>%
left_join(neighborhood_raw_count, by = c("id" = "primary_neighborhood")) %>%
ggplot(aes(long, lat , group = group, fill= incident_count)) +
scale_fill_gradient(
name = "Incident Count", # changes legend title
low = "blue",
high = "red",
space = "Lab") +
geom_polygon() +
geom_path(color = "black") +
coord_quickmap() +
theme_map() + ggtitle('Incidents Per Chicago Neighborhood in 2019')
This piece of code looks at latitude and longitude of 311 calls (original_file: OLD_calls_311.csv
and finds out what neighborhood each call was made from. This updated version, with a geometry
column, is saved as calls_311.csv
.
nhood <- st_read(dsn="data/neighborhood_shapefiles/Neighborhoods_2012b.shp")
calls_311 <- read_csv('data/OLD_calls_311.csv' ) %>%
drop_na(longitude)
calls_311 <- st_as_sf(calls_311, coords = c("longitude", "latitude"))
st_crs(calls_311) <- st_crs(nhood)
calls_311_nhood_data <- st_join(calls_311, nhood['PRI_NEIGH'], join = st_intersects)
#write_csv(calls_311_nhood_data, 'calls_311.csv')
We’d like to thank Jim McGowan of the Chicago/Northern Illinois Red Cross for their partnership, and Kisa Kowal, Arend Kuyper, Karen Smilowitz, and Reut Nocham for their technical/logistical support.
Team 2
Team 4
Team 5
Team 7
Team 8