2 Data

3 Data

3.1 Data Sources

Dataset 1: “Greenhouse Gas Emissions from Large Facilities”
- Collector: Canadian Government
- Data Collection: This dataset is likely collected through mandatory reporting by facilities, adhering to government environmental regulations.
- Format: CSV
- Dimensions: Includes facility ID, name, company name, location, total emissions, and industry classification.
- Frequency of Updates: Annually
- Issues/Problems: Potential underreporting or inaccuracies in self-reported data.
Dataset 2: “PDGES-GHGRP-GHGEmissionsGES-2004-Present”
- Collector: Environment and Climate Change Canada
- Data Collection: Collected under the Greenhouse Gas Reporting Program (GHGRP), focusing on facilities emitting 10,000 tonnes or more of GHGs in CO2 equivalent per year.
- Format: CSV
- Dimensions: Extensive data including facility ID, location, year of reference, and detailed emission data across different categories.
- Frequency of Updates: Annually
- Issues/Problems: Complex data structure requiring significant preprocessing for analysis.

3.2 Importing the Data

The datasets will be imported into R for analysis. R is particularly adept at handling and processing data, making it an excellent choice for this task.

Reading the CSV Files: The read.csv() or read_csv() function in R will be used to import the datasets. The two functions are versatile and can handle various data formats and complexities.
Handling Encoding Issues: If there are any encoding issues (common with diverse data sources), the fileEncoding parameter can be adjusted accordingly.
Data Inspection: After importing the data, functions like head(), summary(), and str() can be used to inspect the datasets. This step is crucial to understand the structure and quality of the data.

3.3 Description

The datasets will be used to answer the research questions as follows:

Emission Patterns: Time-series analysis will be performed to observe trends in emissions over the years.
Sector Analysis: Emissions will be aggregated by industrial sectors to identify major contributors.
Geographical Distribution: Geospatial analysis will reveal regional disparities in emissions.
Facility-Specific Insights: Individual facilities with unusually high or low emissions will be identified for further investigation.

3.4 Missing value analysis

Code

library(ggplot2)
library(reshape2)
library(readr)
library(dplyr)
library(forcats)

Fac_emis <- read_csv("data/PDGES-GHGRP-GHGEmissionsGES-2004-Present.csv", locale=locale(encoding="latin1"))

data <- read.csv("data/Greenhouse gas emissions from large facilities.csv", na.strings = c("", "n/a", "N/A", "NA"))

# Create a dataframe to visualize missing values
missing_data <- is.na(data)

# Bar Plot for Missing Values
missing_counts <- colSums(missing_data)
missing_df <- data.frame(column = names(missing_counts), missing_count = missing_counts)
ggplot(missing_df, aes(x = reorder(column, missing_count), y = missing_count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Bar Plot of Missing Values in Each Column", x = "Columns", y = "Number of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

# Heatmap for Missing Values
melted_data <- melt(missing_data)
ggplot(melted_data, aes(x = Var2, y = Var1)) +
  geom_tile(aes(fill = factor(value))) +
  scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "skyblue")) +
  labs(title = "Heatmap of Missing Values in the Dataset", x = "Columns", y = "Rows") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar plot and heat map show the missing values of Dataset 1 concentrate on 3 columns: City, Address, and Post Code. Compared with the missing values in Address and Post Code, the missing values in City are less and sparser. This data missing pattern makes sense because the three columns are all about geospatial features of the facilities. More precisely, if the city of one greenhouse gas emissions facility is missing, then the more detailed address and post code of such facility will also be missing for certain. On the other hand, even if the city is not missing, the specific address and post code of the facility could still be missing.

The missing values in Dataset 2 are distributed across almost all features. Hence, we decided to observe the data missing pattern with respect to each province in Canada

Code

Provinces <- unique(Fac_emis$`Facility Province or Territory / Province ou territoire de l'installation`)
Province_missing <- c()
Num_of_Reports <- c()

for (j in Provinces) {
  Province_missing <- append(Province_missing, sum(is.na(Fac_emis[Fac_emis$`Facility Province or Territory / Province ou territoire de l'installation` == j, ])))
  Num_of_Reports <- append(Num_of_Reports, nrow(Fac_emis[Fac_emis$`Facility Province or Territory / Province ou territoire de l'installation` == j, ]))
}



df <- data.frame(Provinces, Province_missing, Num_of_Reports)
ggplot(df, aes(x = Province_missing, y = fct_reorder(factor(Provinces), 
                                                     Province_missing))) +
  geom_col(fill='cornflowerblue') +
  xlab("Missing Value Count") +
  ylab("Provinces") +
  ggtitle("Pattern in Missing values based on Canadian Provinces") +
  theme(plot.title = element_text(color = "cornflowerblue"))

That is, we can see the missing values are mainly from 5 Canadian provinces: Alberta, Ontario, British Columbia, Quebec, Saskatchewan. However, if we count the number of filed facility emission reports from each province:

Code

ggplot(df, aes(x = Num_of_Reports, y = fct_reorder(factor(Provinces), 
                                                     Num_of_Reports))) +
  geom_col() +
  xlab("Number of Facility Emission Reports") +
  ylab("Provinces") +
  ggtitle("Number of Facility Emission Reports on each Province")

We can also see the majority of facility emission reports is concentrated on the same 5 Canadian provinces: Alberta, Ontario, British Columbia, Quebec, and Saskatchewan. Because these 5 provinces filed more facility emission reports, it would be natural to expect them having more missing values.

In order to derive a deeper sight to understand what the data missing pattern is across the Canadian provinces, we divide the total missing values in each province by the number of reports filed by that province to attain the average number of missing values with respect to each province’s facility emission report.

Code

df <- df %>% cbind(Average_missing = df$Province_missing/df$Num_of_Reports)
ggplot(df,
       aes(x=Average_missing, 
           y=fct_reorder(factor(Provinces), 
                         Average_missing, 
                         .na_rm=TRUE)))+
  geom_point(na.rm = TRUE)+
  xlab("Data Missing Density per Province")+
  ylab("Provinces")+
  ggtitle("Average Data Missing across Canadian Provinces")+
  theme(plot.title = element_text(color = "cornflowerblue"))

Now it becomes clear from the Cleveland plot that the provinces like Nunavut, Saskatchewan, Northwest Territories etc. will have more missing values for each facility emission report they filed. This can be explained by the fact that these provinces are deep in the North and West where the data collection task is more difficult to carry out due to the severe weather condition in comparison with other southern provinces in Canada.

Note: analyzing missing value pattern based on Canadian province is definitely not the only approach to analyze missing values in this extensive dataset. Other plausible ways could be analyzing missing value pattern for each year from 2004 to 2021, or analyzing missing value in one specific columns like Carbon dioxide in each province to develop deeper insight etc.