Data

Data Sources

The primary data source of this project is the Air Quality Historical Data Platform.
This public dataset collects daily air quality index (AQI) of various pollutants in each city from local weather bureau. All AQI values were calculated using the US EPA standard.
We can search for the name of our target city, and download a csv file containing daily AQI value for past seven years of six kinds of pollutants: PM2.5, PM10, O3, NO2, SO2 and CO. Data for all 100 cities were searched and downloaded manually by all group members.

The data of GDP and population in top 100 Chinese cities in GDP are collected online. We then create a cvs file containing city name, GDP in billion CNY, and population in thousand.

The daily average temperature data are from NOAA. We used below code chunk to extract weather data for each city:

weather_df = 
  rnoaa::meteo_pull_monitors(
    c("CHM00054511", "CHM00058362", "CHM00050953", "CHM00054342", "CHM00055591", "CHM00056294", "CHM00056778", "CHM00059287", "CHM00057036", "CHM00057494", "CHM00054161", "CHM00057687", "CHM00057515", "CHM00058847", "CHM00057816", "CHM00058321", "CHM00054823", "CHM00052889", "CHM00058606", "CHM00058238", "CHM00059431", "CHM00053698", "CHM00054527", "CHM00051463", "CHM00052866", "CHM00053614", "CHM00057083"),
    var = c("PRCP", "TAVG"), 
    date_min = "2020-02-01",
    date_max = "2020-04-30") %>%
  mutate(
    name = recode(
      id, 
      CHM00054511 = "Beijing", 
      CHM00058362 = "Shanghai",
      CHM00050953 = "Harbin",
      CHM00054342 = "Shenyang",
      CHM00055591 = "Lhasa",
      CHM00056294 = "Chengdu",
      CHM00056778 = "Kunming",
      CHM00059287 = "Guangzhou",
      CHM00057036 = "Xian",
      CHM00057494 = "Wuhan",
      CHM00054161 = "Changchun",
      CHM00057687 = "Changsha",
      CHM00057515 = "Chongqing",
      CHM00058847 = "Fuzhou",
      CHM00057816 = "Guiyang",
      CHM00058321 = "Hefei",
      CHM00054823 = "Jinan",
      CHM00052889 = "Lanzhou",
      CHM00058606 = "Nanchang",
      CHM00058238 = "Nanjing",
      CHM00059431 = "Nanning",
      CHM00053698 = "Shijiazhuang",
      CHM00053772 = "Taiyuan",
      CHM00054527 = "Tianjin",
      CHM00051463 = "Wulumuqi",
      CHM00052866 = "Xining",
      CHM00053614 = "Yinchuan",
      CHM00057083 = "Zhengzhou"),
    tavg = tavg / 10,
    prcp = prcp / 10) %>%
  select(-id) %>% 
  rename(city = name) %>% 
  relocate(city)

Data description

AQI data

The downloaded AQI files of each city are in two directories: “30_cities_data” contains the 30 major cities that were used for bar graphs, box plots and line charts. The “100_cities_data” contains all 100 cities we downloaded, which were used for creating the interactive map, fitting the regression models and performing statistical tests. Each city’s AQI file is very straightforward. There are only seven variables:
date: The date of the following pollutant’s AQI recordings, in YYYY/MM/DD format
pm25: The AQI of tiny particles or droplets in the air that are two and one half microns or less in width
pm10: The AQI of tiny particles or droplets in the air that are 10 microns or less in width
o3: Ozone’s AQI.
no2: nitrogen dioxide’s AQI
so2: sulfur dioxide’s AQI
co: carbon monoxide’s AQI

Source of PM 2.5 and PM 10’s definition

GDP and population data

The GDP and population dataset contains 4 variables:
rank: a city’s GDP ranking
city: city name
gdp_billion: GDP in billion CHY
population_thousand: population in thousand

Weather data

The resulting weather dataset contains 4 variables:
city: city name
date: the date that daily average temperature was collected
prcp: precipitation in mm
tavg: daily average temperature

Data cleaning and processing

The daily AQI data of each city was not ordered in time as expected. For example, January 2021 came before December 2021. We used as.Date() to covert all dates into R’s date format, and used the lubridate library and %within% to directly filter out the date period we are interested in.
After some exploratory graphing, We realized that not all pollutants from all time are recorded by each city’s local weather bureau, and we had to change our city selections. For example, we excluded the city “Hangzhou” from out original set of cities, since Hangzhou’s PM2.5 AQI data during the lockdown period was missing. We used another city close to Hangzhou to fill it’s position.

For graph representations, We choose 30 advanced-developed cities in china among the 100 cities. We wrote a function to loop through each city’s air quality data in the 30_cities_data directory, and did the same process for the 100_cities_data directory. We also create a data frame named: city_period_diff. It will hold each city’s mean daily AQI difference of each pollutant, between Feb~Aprl 2019 and Feb~Aprl 2020. We used interval() to create time intervals that will be used to extract air quality data from out interested time period: “2018-02-01” ~ “2018-04-30”, “2019-02-01” ~ “2019-04-30”, “2020-02-01” ~ “2020-04-30”, and “2021-02-01” ~ “2021-04-30”.

city_files = list.files("30_cities_data")
onehundrd_city_files = list.files("100_cities_data")


city_period_diff = 
  tibble(city = character(),
         pm25 = numeric(),
         pm10 = numeric(),
         o3 = numeric(),
         no2 = numeric(),
         so2 = numeric(),
         co = numeric(),
         )

period_18 = interval(ymd("2018-02-01"), ymd("2018-04-30"))
period_19 = interval(ymd("2019-02-01"), ymd("2019-04-30"))
period_20 = interval(ymd("2020-02-01"), ymd("2020-04-30"))
period_21 = interval(ymd("2021-02-01"), ymd("2021-04-30"))

Compute the daily mean pollutant AQI from Feb to Aprl of year 2019 and year 2020 in each city.

for (city_file in city_files) {
  #print(city_file)
  
  path = str_c("30_cities_data/", city_file)
  city = strsplit(city_file, split = '-')[[1]][1]
  
  cityAir = read_csv(path) %>% 
    mutate(date = as.Date(date, "%Y/%m/%d")) %>%
    arrange(date)
  
  cityAir_19 = cityAir %>% 
    filter(date %within% period_19) 
  
  cityAir_20 = cityAir %>% 
    filter(date %within% period_20)
  
  
  pm25_19 = mean(cityAir_19$pm25, na.rm = T)
  pm25_20 = mean(cityAir_20$pm25, na.rm = T)
  pm25d = pm25_20 - pm25_19
  
  pm10_19 = mean(cityAir_19$pm10, na.rm = T)
  pm10_20 = mean(cityAir_20$pm10, na.rm = T)
  pm10d = pm10_20 - pm10_19
  
  o3_19 = mean(cityAir_19$o3, na.rm = T)
  o3_20 = mean(cityAir_20$o3, na.rm = T)
  o3d = o3_20 - o3_19
  
  no2_19 = mean(cityAir_19$no2, na.rm = T)
  no2_20 = mean(cityAir_20$no2, na.rm = T)
  no2d = no2_20 - no2_19
  
  so2_19 = mean(cityAir_19$so2, na.rm = T)
  so2_20 = mean(cityAir_20$so2, na.rm = T)
  so2d = so2_20 - so2_19
  
  co_19 = mean(cityAir_19$co, na.rm = T)
  co_20 = mean(cityAir_20$co, na.rm = T)
  cod = co_20 - co_19
  
  city_period_diff = 
    city_period_diff %>% 
    add_row(city = city, 
            pm25 = pm25d,
            pm10 = pm10d,
            o3 = o3d,
            no2 = no2d,
            so2 = so2d,
            co = cod)
}

Then, we transfer the “city” in city_period_diff from the character to factor and make a decreasing sequence by using reorder based on the pm2.5.

city_period_diff = 
  city_period_diff %>% 
  mutate(
    city = paste(
      toupper(substring(city, 1, 1)), 
      substring(city, 2), 
      sep = ""),
    city = fct_reorder(city, pm25, .desc = T),
    )

Now we will see how the distribution of daily PM25 AQI differ between time period 2019 Feb-Aprl and 2020 Feb~Aprl.

city_AQI_Distribution = tibble()

for (city_file in city_files) {

  path = str_c("30_cities_data/", city_file)
  city = strsplit(city_file, split = '-')[[1]][1]
  
  cityAir = read_csv(path) %>% 
    mutate(date = as.Date(date, "%Y/%m/%d")) %>%
    arrange(date)
  
  city_19 = cityAir %>% 
    filter(date %within% period_19) %>% 
    mutate(period = "2019Feb-Aprl",
           day = format(date,"%m-%d"),
           city = city) %>% 
    relocate(city, period, day)
  
  #add a fake date "2019-02-29" with all AQI values as NA
  city_19 = 
    city_19 %>% 
    add_row(city = city, 
            period = "2019Feb-Aprl", 
            day = "02-29") %>% 
    mutate(day = as.factor(day))
  
  
  city_20 = cityAir %>% 
    filter(date %within% period_20) %>% 
    mutate(period = "2020Feb-Aprl",
           day = format(date,"%m-%d"),
           day = as.factor(day),
           city = city) %>% 
    relocate(city, period, day)
  
  city_AQI_Distribution = rbind(city_AQI_Distribution, city_19)
  city_AQI_Distribution = rbind(city_AQI_Distribution, city_20)
}

city_AQI_Distribution = 
  city_AQI_Distribution %>% 
  mutate(period = factor(period, levels = c("2020Feb-Aprl", "2019Feb-Aprl")),
         city = paste(
           toupper(substring(city, 1, 1)), 
           substring(city, 2), 
           sep = ""))

In order to see the previous data, we focus on past three years. so we tidy the past Four Years’ Feb~Aprl Mean PM2.5 AQI. The steps are the same as the previous one.

for (city_file in city_files) {
  #print(city_file)
  
  path = str_c("30_cities_data/", city_file)
  city = strsplit(city_file, split = '-')[[1]][1]
  
  cityAir = read_csv(path) %>% 
    mutate(date = as.Date(date, "%Y/%m/%d")) %>%
    arrange(date)
  
  cityAir_18 = cityAir %>% 
    filter(date %within% period_18)
  
  cityAir_19 = cityAir %>% 
    filter(date %within% period_19) 
  
  cityAir_20 = cityAir %>% 
    filter(date %within% period_20)
  
  cityAir_21 = cityAir %>% 
    filter(date %within% period_21)
  
  mean_18 = mean(cityAir_18$pm25, na.rm = T)
  mean_19 = mean(cityAir_19$pm25, na.rm = T)
  mean_20 = mean(cityAir_20$pm25, na.rm = T)
  mean_21 = mean(cityAir_21$pm25, na.rm = T)
  
  city_4year_meanPM25 = 
    city_4year_meanPM25 %>%
    add_row(city = city,
            mean_18 = mean_18,
            mean_19 = mean_19, 
            mean_20 = mean_20, 
            mean_21 = mean_21)
}

city_4year_meanPM25 = 
  city_4year_meanPM25 %>% 
  mutate(city = factor(city)) %>% 
  pivot_longer(
    mean_18:mean_21,
    values_to = "mean",
    names_to = "years"
  )

Besides analyzing PM2.5, we also want to know how NO2 mean API mean change for four years from Feb to Aprl. The steps for tidying and cleaning data for NO2 are similar.

city_4year_meanno2 = 
  tibble(city = character(),
         mean_18 = numeric(),
         mean_19 = numeric(),
         mean_20 = numeric(),
         mean_21 = numeric())

for (city_file in city_files) {
  #print(city_file)
  
  path = str_c("30_cities_data/", city_file)
  city = strsplit(city_file, split = '-')[[1]][1]
  
  cityAir = read_csv(path) %>% 
    mutate(date = as.Date(date, "%Y/%m/%d")) %>%
    arrange(date)
  
  cityAir_18 = cityAir %>% 
    filter(date %within% period_18)
  
  cityAir_19 = cityAir %>% 
    filter(date %within% period_19) 
  
  cityAir_20 = cityAir %>% 
    filter(date %within% period_20)
  
  cityAir_21 = cityAir %>% 
    filter(date %within% period_21)
  
  mean_18 = mean(cityAir_18$no2, na.rm = T)
  mean_19 = mean(cityAir_19$no2, na.rm = T)
  mean_20 = mean(cityAir_20$no2, na.rm = T)
  mean_21 = mean(cityAir_21$no2, na.rm = T)
  
  city_4year_meanno2 = 
    city_4year_meanno2 %>%
    add_row(city = city,
            mean_18 = mean_18,
            mean_19 = mean_19, 
            mean_20 = mean_20, 
            mean_21 = mean_21)
}

city_4year_meanno2 = 
  city_4year_meanno2 %>% 
  mutate(city = factor(city)) %>% 
  pivot_longer(
    mean_18:mean_21,
    values_to = "mean",
    names_to = "years"
  )

In terms of the dataset of GDP and population for regression analysis, we calculate mean PM2.5 AQI (Feb-Apr) in 2019 and 2020 and mean PM2.5 AQI difference (2019 minus 2020) for each top 100 city in GDP. We then use left_join function to join the GDP and population dataset to the mean PM2.5 AQI difference dataset to get the resulting dataset for regression analysis.

library(tidyverse)
library(ggridges)

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)

scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d

city_100_df = 
  tibble(
  file = list.files("100_cities_data")) %>% 
  mutate(
    city = str_remove(file, "-air-quality.csv"),
    path = str_c("100_cities_data/", file),
    data = map(path, read_csv)
  ) %>% 
  unnest(data) %>% 
  select(-file, -path) %>% 
  mutate(
    city = str_to_title(city),
    date = as.Date(date, format = "%Y/%m/%d"))

pm25_2020 = 
  city_100_df %>% 
  filter(date > "2020-01-31" & date < "2020-05-01") %>% 
  group_by(city) %>%
  summarize(mean_pm25_2020 = mean(pm25, na.rm = T))
  
pm25_2019 = 
  city_100_df %>% 
  filter(date > "2019-01-31" & date < "2019-05-01") %>% 
  group_by(city) %>%
  summarize(mean_pm25_2019 = mean(pm25, na.rm = T))

pm25_diff = 
  left_join(pm25_2020, pm25_2019) %>% 
  mutate(pm25_diff = mean_pm25_2019 - mean_pm25_2020)

gdp_pop_df = 
  read_csv("data/gpd_and_popluation.csv") %>% 
  janitor::clean_names() %>% 
  mutate(
    gdp_trillion = gdp_billion / 1000,
    pop_million = population_thousand / 1000) %>% 
  select(city, gdp_trillion, pop_million)

diff_gdp_pop_df = 
  left_join(pm25_diff, gdp_pop_df) %>% 
  select(-mean_pm25_2020, -mean_pm25_2019)

For regression of daily PM2.5 vs daily average temperature, we filter PM2.5 AQI from February through April in only 2020 , and collect daily average temperature data during this time period from NOAA. We then join them using left_join function.

weather_df = 
  rnoaa::meteo_pull_monitors(
    c("CHM00054511", "CHM00058362", "CHM00050953", "CHM00054342", "CHM00055591", "CHM00056294", "CHM00056778", "CHM00059287", "CHM00057036", "CHM00057494", "CHM00054161", "CHM00057687", "CHM00057515", "CHM00058847", "CHM00057816", "CHM00058321", "CHM00054823", "CHM00052889", "CHM00058606", "CHM00058238", "CHM00059431", "CHM00053698", "CHM00054527", "CHM00051463", "CHM00052866", "CHM00053614", "CHM00057083"),
    var = c("PRCP", "TAVG"), 
    date_min = "2020-02-01",
    date_max = "2020-04-30") %>%
  mutate(
    name = recode(
      id, 
      CHM00054511 = "Beijing", 
      CHM00058362 = "Shanghai",
      CHM00050953 = "Harbin",
      CHM00054342 = "Shenyang",
      CHM00055591 = "Lhasa",
      CHM00056294 = "Chengdu",
      CHM00056778 = "Kunming",
      CHM00059287 = "Guangzhou",
      CHM00057036 = "Xian",
      CHM00057494 = "Wuhan",
      CHM00054161 = "Changchun",
      CHM00057687 = "Changsha",
      CHM00057515 = "Chongqing",
      CHM00058847 = "Fuzhou",
      CHM00057816 = "Guiyang",
      CHM00058321 = "Hefei",
      CHM00054823 = "Jinan",
      CHM00052889 = "Lanzhou",
      CHM00058606 = "Nanchang",
      CHM00058238 = "Nanjing",
      CHM00059431 = "Nanning",
      CHM00053698 = "Shijiazhuang",
      CHM00053772 = "Taiyuan",
      CHM00054527 = "Tianjin",
      CHM00051463 = "Wulumuqi",
      CHM00052866 = "Xining",
      CHM00053614 = "Yinchuan",
      CHM00057083 = "Zhengzhou"),
    tavg = tavg / 10,
    prcp = prcp / 10) %>%
  select(-id) %>% 
  rename(city = name) %>% 
  relocate(city)

city_30_df =
  city_100_df %>% 
  filter(date > "2020-01-31" & date < "2020-05-01") %>% 
  filter(city %in% c("Beijing", "Shanghai", "Harbin", "Shenyang", "Lhasa", "Chengdu", "Kunming", "Guangzhou", "Xian", "Wuhan", "Changchun", "Changsha", "Chongqing", "Fuzhou", "Guiyang", "Hefei", "Jinan", "Lanzhou", "Nanchang", "Nanjing", "Nanning", "Shijiazhuang", "Taiyuan", "Tianjin", "Wulumuqi", "Xining", "Yinchuan", "Zhengzhou"))

pm25_tavg_df = 
  left_join(city_30_df, weather_df, by = c("city", "date")) %>% 
  arrange(date) %>% 
  select(city, date, pm25, tavg) %>% 
  filter(pm25 != "NA")