-c conda-forge stats-can
conda install # OR
-can
pip install stats# code not run for this document.
Libraries for Accessing Statistics Canada Data Tables
Goals
Find and extract socio-economic data from Statistics Canada. In this example we manipulate and plot unemployment rates and housing starts. Relationships between the two datasets are examined.
Data Provider
Statistics Canada hosts a large set of socio-economic datasets. When searching through these, take note of the table identifier. The stats-can and cansim packages provide Python and R APIs respectively for downloading these tables. See also the stats-can or cansim vignettes for examples and instructions to pull Statistics Canada datasets.
Installing stats-can and cansim
install.packages("cansim")
library(cansim)
vignette("cansim")
#code not run for this document.
Basic libraries
from stats_can import StatsCan # for pulling Stats Can socioeconomic data
import pandas as pd # for data wrangling
import seaborn as sb # for plotting
import matplotlib.pyplot as plt # for plot management
library(tidyverse) # for its nice flow.
library(cansim) #for pulling Stat Can socioeconomic data
library(lubridate) # for dealing with dates
library(GGally) # for pairs plot
library(ggpubr) # for putting a regression equation onto a ggplot figure
Pulling a table
Visit here for example to see labour force data in html. Notice that the StatCan webpage shows the table id, here it is table: 14-10-0287-01.
We’ll use that to grab the table. Be warned that this data table is a csv of around 1 GB in size.
= StatsCan() # We create a StatsCan Object that will be used to pull data sc
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/api_class.py:24: FutureWarning: This class will be deprecated in upcoming v3 release. Please see the docs for details
warn(
= sc.table_to_df("14-10-0287-01") data
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:608: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
warn(
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:326: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
warn(
data.head()
REF_DATE GEO DGUID ... SYMBOL TERMINATED DECIMALS
0 1976-01-01 Canada 2016A000011124 ... NaN NaN 1
1 1976-01-01 Canada 2016A000011124 ... NaN NaN 1
2 1976-01-01 Canada 2016A000011124 ... NaN NaN 1
3 1976-01-01 Canada 2016A000011124 ... NaN NaN 1
4 1976-01-01 Canada 2016A000011124 ... NaN NaN 1
[5 rows x 19 columns]
data.tail()
REF_DATE GEO DGUID ... SYMBOL TERMINATED DECIMALS
5241748 2024-07-01 British Columbia 2016A000259 ... NaN NaN 1
5241749 2024-07-01 British Columbia 2016A000259 ... NaN NaN 1
5241750 2024-07-01 British Columbia 2016A000259 ... NaN NaN 1
5241751 2024-07-01 British Columbia 2016A000259 ... NaN NaN 1
5241752 2024-07-01 British Columbia 2016A000259 ... NaN NaN 1
[5 rows x 19 columns]
= get_cansim("14-10-0287-01") |> rename_all(make.names)
data
|> glimpse() data
Rows: 5,241,753
Columns: 33
$ REF_DATE <chr> "1976-01", "1976-…
$ GEO <chr> "Canada", "Canada…
$ DGUID <chr> "2016A000011124",…
$ UOM <chr> "Persons", "Perso…
$ UOM_ID <chr> "249", "249", "24…
$ SCALAR_FACTOR <chr> "thousands", "tho…
$ SCALAR_ID <chr> "3", "3", "3", "3…
$ VECTOR <chr> "v2062809", "v206…
$ COORDINATE <chr> "1.1.1.1.1.1", "1…
$ VALUE <dbl> 16852.4, 16852.4,…
$ STATUS <chr> NA, NA, NA, NA, N…
$ SYMBOL <chr> NA, NA, NA, NA, N…
$ TERMINATED <chr> NA, NA, NA, NA, N…
$ DECIMALS <chr> "1", "1", "1", "1…
$ GeoUID <chr> "11124", "11124",…
$ Hierarchy.for.GEO <chr> "1", "1", "1", "1…
$ Classification.Code.for.Labour.force.characteristics <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Labour.force.characteristics <chr> "1", "1", "1", "1…
$ Classification.Code.for.Sex <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Sex <chr> "1", "1", "1", "1…
$ Classification.Code.for.Age.group <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Age.group <chr> "1", "1", "8", "8…
$ Classification.Code.for.Statistics <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Statistics <chr> "1", "1", "1", "1…
$ Classification.Code.for.Data.type <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Data.type <chr> "1", "2", "1", "2…
$ val_norm <dbl> 16852400, 1685240…
$ Date <date> 1976-01-01, 1976…
$ Labour.force.characteristics <fct> Population, Popul…
$ Sex <fct> Both sexes, Both …
$ Age.group <fct> 15 years and over…
$ Statistics <fct> Estimate, Estimat…
$ Data.type <fct> Seasonally adjust…
There are nearly 5 million rows of data!
Make a few plots of Employment statistics
Stat Can tables are structured so that each row has a single value and columns define attributes about that value Each row of data has a specific attribute sex, age (Age.group), and Location (GEO) attributes, for units of the value (SCALAR_FACTOR), the specific type of labour force characteristic measured (Labour.force.characteristics), and the different data types such as point estimates and (bootstrap) standard errors (Statistics), and raw or seasonally adjusted (Data.type). In general this means that we need to filter and reshape the data.
Let’s inventory a few of these columns so that the exact spelling of attributes can be used below in the plots.
"Statistics"].unique() data[
['Estimate', 'Standard error of estimate', 'Standard error of month-to-month change', 'Standard error of year-over-year change']
Categories (4, object): ['Estimate', 'Standard error of estimate',
'Standard error of month-to-month change', 'Standard error of year-over-year change']
"Age group"].unique() data[
['15 years and over', '15 to 64 years', '15 to 24 years', '15 to 19 years', '20 to 24 years', '25 years and over', '25 to 54 years', '55 years and over', '55 to 64 years']
Categories (9, object): ['15 to 19 years', '15 to 24 years', '15 to 64 years', '15 years and over', ...,
'25 to 54 years', '25 years and over', '55 to 64 years',
'55 years and over']
"GEO"].unique() data[
['Canada', 'Newfoundland and Labrador', 'Prince Edward Island', 'Nova Scotia', 'New Brunswick', ..., 'Ontario', 'Manitoba', 'Saskatchewan', 'Alberta', 'British Columbia']
Length: 11
Categories (11, object): ['Alberta', 'British Columbia', 'Canada', 'Manitoba', ..., 'Ontario',
'Prince Edward Island', 'Quebec', 'Saskatchewan']
"Data type"].unique() data[
['Seasonally adjusted', 'Unadjusted', 'Trend-cycle']
Categories (3, object): ['Seasonally adjusted', 'Trend-cycle', 'Unadjusted']
$Labour.force.characteristics |> unique() data
[1] Population Labour force Employment
[4] Full-time employment Part-time employment Unemployment
[7] Unemployment rate Participation rate Employment rate
9 Levels: Population Labour force Employment ... Employment rate
$Statistics |> unique() data
[1] Estimate
[2] Standard error of estimate
[3] Standard error of month-to-month change
[4] Standard error of year-over-year change
4 Levels: Estimate ... Standard error of year-over-year change
$Age.group |> unique() data
[1] 15 years and over 15 to 64 years 15 to 24 years 15 to 19 years
[5] 20 to 24 years 25 years and over 25 to 54 years 55 years and over
[9] 55 to 64 years
9 Levels: 15 years and over 15 to 24 years 15 to 19 years ... 55 to 64 years
$GEO |> unique() data
[1] "Canada" "Newfoundland and Labrador"
[3] "Prince Edward Island" "Nova Scotia"
[5] "New Brunswick" "Quebec"
[7] "Ontario" "Manitoba"
[9] "Saskatchewan" "Alberta"
[11] "British Columbia"
Plotting provinces against each other
Let’s consider provincial differences in employment rate for 25-55 year olds. Using the spelling from the above output we can set up the data filters.
Important consideration when pivoting
Pivoting a data table will spread out a variable into new columns. Sometimes it’s more convenient to have each province as it’s own column. Here we do this to plot provinces against each other and see how they relate. In this case, we want each province’s unemployment rate from the “Labour force characteristics” column. However, notice that there is a “Data type” column that shows two possible entries for the unemployment rate attribute in the “Labour force characteristics”. If we filtered the data without selected one of these Data types, and then pivoted, python would take the mean of these two values!
= data[["GEO",
provinces "Sex",
"Age group",
"Labour force characteristics",
"REF_DATE",
"VALUE",
"Data type"]]
= provinces[(provinces["Sex"] == "Both sexes") &
provinces "GEO"] != "Canada") &
(provinces["Labour force characteristics"] == "Unemployment rate") &
(provinces["Age group"] == "25 to 54 years") &
(provinces["Data type"] == "Unadjusted")]
(provinces[
= pd.pivot_table(provinces,
pivoted = ["Sex",
index "Age group",
"Labour force characteristics",
"REF_DATE"],
= "VALUE",
values = "GEO",
columns = False).reset_index()
observed
pivoted.head()
GEO Sex Age group ... Quebec Saskatchewan
0 Both sexes 25 to 54 years ... 6.4 3.2
1 Both sexes 25 to 54 years ... 7.0 3.2
2 Both sexes 25 to 54 years ... 6.6 3.0
3 Both sexes 25 to 54 years ... 6.8 3.2
4 Both sexes 25 to 54 years ... 5.6 2.4
[5 rows x 14 columns]
When pivot_wider Works, But Not As Intended; Common Errors and How to Fix Them
Pivoting a data table will spread out a variable into new columns. Sometimes it’s more convenient to have each province as it’s own column. Here we do this to plot provinces against each other and see how they relate. It can be helpful to see how to accidentally break things, so let’s start there.
I will need to filter the data so let’s start there. Here we go right from selecting columns, filtering and then pivoting wider, which will help when plotting. Take a look at the output.
|> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
data filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
pivot_wider(names_from= GEO, values_from=VALUE) |> # puts each province into its own column
rename_all(make.names) # rebuilt the province names
Warning: Values from `VALUE` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(Sex, Age.group,
Labour.force.characteristics, REF_DATE, GEO)) |>
dplyr::filter(n > 1L)
# A tibble: 583 × 14
Sex Age.group Labour.force.charact…¹ REF_DATE Newfoundland.and.Lab…²
<fct> <fct> <fct> <chr> <list>
1 Both sexes 25 to 54 y… Unemployment rate 1976-01 <dbl [5]>
2 Both sexes 25 to 54 y… Unemployment rate 1976-02 <dbl [5]>
3 Both sexes 25 to 54 y… Unemployment rate 1976-03 <dbl [5]>
4 Both sexes 25 to 54 y… Unemployment rate 1976-04 <dbl [5]>
5 Both sexes 25 to 54 y… Unemployment rate 1976-05 <dbl [5]>
6 Both sexes 25 to 54 y… Unemployment rate 1976-06 <dbl [5]>
7 Both sexes 25 to 54 y… Unemployment rate 1976-07 <dbl [5]>
8 Both sexes 25 to 54 y… Unemployment rate 1976-08 <dbl [5]>
9 Both sexes 25 to 54 y… Unemployment rate 1976-09 <dbl [5]>
10 Both sexes 25 to 54 y… Unemployment rate 1976-10 <dbl [5]>
# ℹ 573 more rows
# ℹ abbreviated names: ¹Labour.force.characteristics,
# ²Newfoundland.and.Labrador
# ℹ 9 more variables: Prince.Edward.Island <list>, Nova.Scotia <list>,
# New.Brunswick <list>, Quebec <list>, Ontario <list>, Manitoba <list>,
# Saskatchewan <list>, Alberta <list>, British.Columbia <list>
The provinces are now list variables, whereas they should be doubles (single numbers). It wasn’t obvious to me what went wrong the first time I broke a pivot table like this. Let’s re-run those steps without the pivot_wider and see if it helps to diagnose the root of the problem. This is one of the great parts of tidyverse, we just rerun a subset of the above code, or explicitely extract one of those list elements from the original call:
# run fewer lines of the previous code:
|> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
data filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")
# A tibble: 29,150 × 6
GEO Sex Age.group Labour.force.charact…¹ REF_DATE VALUE
<chr> <fct> <fct> <fct> <chr> <dbl>
1 Newfoundland and Labra… Both… 25 to 54… Unemployment rate 1976-01 10.1
2 Newfoundland and Labra… Both… 25 to 54… Unemployment rate 1976-01 11.4
3 Newfoundland and Labra… Both… 25 to 54… Unemployment rate 1976-01 NA
4 Newfoundland and Labra… Both… 25 to 54… Unemployment rate 1976-01 NA
5 Newfoundland and Labra… Both… 25 to 54… Unemployment rate 1976-01 NA
6 Prince Edward Island Both… 25 to 54… Unemployment rate 1976-01 5.7
7 Prince Edward Island Both… 25 to 54… Unemployment rate 1976-01 7.3
8 Prince Edward Island Both… 25 to 54… Unemployment rate 1976-01 NA
9 Prince Edward Island Both… 25 to 54… Unemployment rate 1976-01 NA
10 Prince Edward Island Both… 25 to 54… Unemployment rate 1976-01 NA
# ℹ 29,140 more rows
# ℹ abbreviated name: ¹Labour.force.characteristics
## or view the problem this way, select Ontario and check out the first element:
# data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
# filter(Sex == "Both sexes") |>
# filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
# filter( Labour.force.characteristics == "Unemployment rate")|>
# filter(Age.group == "25 to 54 years")|>
# pivot_wider(names_from= GEO, values_from=VALUE) |> # puts each province into its own column
# rename_all(make.names) |>
# select(Ontario)|> head|> .[[1]]
If you look closely at the output in the first strategy you’ll notice that there are multiple rows which only differ in their VALUE. This means that pivot_wider is mapping multiple rows of VALUE into a single element. R handles this by smashing all those values together into list elements.
In order to have a one-to-one mapping add another variable to create a unique mapping. Taking another close look at the glimpse output above, we can see that there is more than one Data.type. Checking out its values we see the following:
$Data.type |> unique() data
[1] Seasonally adjusted Unadjusted Trend-cycle
Levels: Seasonally adjusted Unadjusted Trend-cycle
The multiple VALUEs are made unique when considering the Data.type variable. To simplify things, let’s select the Unadjusted unemployment rate.
When pivot_wider Works As Intended
|> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
data filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
filter(Data.type == "Unadjusted")|>
pivot_wider(names_from= GEO, values_from=VALUE) |> # puts each province into its own column
rename_all(make.names)
# A tibble: 583 × 15
Sex Age.group Labour.force.characteristics REF_DATE Data.type
<fct> <fct> <fct> <chr> <fct>
1 Both sexes 25 to 54 years Unemployment rate 1976-01 Unadjusted
2 Both sexes 25 to 54 years Unemployment rate 1976-02 Unadjusted
3 Both sexes 25 to 54 years Unemployment rate 1976-03 Unadjusted
4 Both sexes 25 to 54 years Unemployment rate 1976-04 Unadjusted
5 Both sexes 25 to 54 years Unemployment rate 1976-05 Unadjusted
6 Both sexes 25 to 54 years Unemployment rate 1976-06 Unadjusted
7 Both sexes 25 to 54 years Unemployment rate 1976-07 Unadjusted
8 Both sexes 25 to 54 years Unemployment rate 1976-08 Unadjusted
9 Both sexes 25 to 54 years Unemployment rate 1976-09 Unadjusted
10 Both sexes 25 to 54 years Unemployment rate 1976-10 Unadjusted
# ℹ 573 more rows
# ℹ 10 more variables: Newfoundland.and.Labrador <dbl>,
# Prince.Edward.Island <dbl>, Nova.Scotia <dbl>, New.Brunswick <dbl>,
# Quebec <dbl>, Ontario <dbl>, Manitoba <dbl>, Saskatchewan <dbl>,
# Alberta <dbl>, British.Columbia <dbl>
Plotting Variables Against Each Other.
sb.pairplot(pivoted, vars = pivoted.columns[4:14]).fig.suptitle("Monthly Unemployment Rate: Not Seasonally Adjusted",
= 1.01,
y = 14)
fontsize ; plt.show()
|> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
data filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
filter(Data.type == "Unadjusted")|>
pivot_wider(names_from= GEO, values_from=VALUE) |> # puts each province into its own column
rename_all(make.names) |>
ggpairs( columns = 6:15, title = "Monthly Unemployment Rate: Not Seasonaly Adjusted",
axisLabels = "show")
Plotting the Provinces as Time Series
Notice how we produce year and month columns by accessing properties of pandas datetime objects.
"month"] = provinces["REF_DATE"].dt.month
provinces["year"] = provinces["REF_DATE"].dt.year
provinces[
= list(provinces["GEO"].unique())
x = [1,2,3,4,5,6,7,8,9,10]
y
"darkgrid")
sb.set_style(
for province, i in zip(x, y):
3,4,i)
plt.subplot(
plt.tight_layout()
= provinces[provinces["GEO"] == province],
sb.lineplot(data = "year",
x = "VALUE")
y
'', frameon = False)
plt.legend(= 8)
plt.title(province, fontsize if province == "Quebec":
"Unemployment Rate")
plt.ylabel(else:
"")
plt.ylabel(
= 0.9)
plt.subplots_adjust(top 'Unemployment Rates by Province', fontsize=14)
plt.suptitle(; plt.show()
We won’t bother pivoting the data table anymore, since it’s convenient for ggplot to have a column as an identifying feature.
Notice that the REF_DATE column is a character but it should be a date. We’ll use the lubridate package for converting into dates. The helpfile for the parse_date_time function is really useful for specifying the input format, as it clarifies that Y is a year with century but y is the two digit year without century. Using the wrong one will break the mutation.
|> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
data filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
filter(Data.type == "Unadjusted")|>
mutate(date = parse_date_time(REF_DATE, orders = "Y-m"))|> # convert from character date to date format.
rename(unemployment.rate = VALUE)|>
ggplot(aes(x=date, y = unemployment.rate, colour = GEO))+
geom_line()+
ggtitle("Unemployment Rates by Province")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+# rotate axis labels so they don't overlap
facet_wrap(~GEO)
Plot the Within Year Seasonal Trends
We plot unemployment for each year in each province using a different line. In order to highlight the effect of the pandemic we first call seaborn lineplot on a subset of the data up to the year 2019. Each year is given a different colour by setting the “hue” parameter to “year”. We then call seaborn linepot on a subset of the data from 2020 onwards. We choose a different colour palette so the pandemic years are clearly identifiable. Seaborn will overly successive plots on the same graph unless the plot is cleared using plt.clf().
In some provinces, the strong seasonality of the unemployment rate realy pops out.
= list(provinces["GEO"].unique())
x = [1,2,3,4,5,6,7,8,9,10]
y
plt.clf()
for province, i in zip(x, y):
3,4,i)
plt.subplot(
plt.tight_layout()= provinces[(provinces["GEO"] == province) &
sb.lineplot(data "year"] < 2020)],
(provinces[= "month",
x = "VALUE",
y = "year",
hue = "crest",
palette = 0.5)
linewidth
= provinces[(provinces["GEO"] == province) &
sb.lineplot(data "year"] >= 2020)],
(provinces[= "month",
x = "VALUE",
y = "year",
hue = "Reds_d",
palette = 2.5)
linewidth if province == "British Columbia":
="Year",
plt.legend(title=(1.05, 1),
bbox_to_anchor='upper left',
loc=0,
borderaxespad= 3)
ncol else:
'', frameon = False)
plt.legend(= 8)
plt.title(province, fontsize if province == "Quebec":
"Unemployment Rate")
plt.ylabel(else:
"")
plt.ylabel(
= 0.9)
plt.subplots_adjust(top 'Unemployment Rates by Province: Highlighting Pandemic Years',
plt.suptitle(= 13)
fontsize ; plt.show()
Let’s split the date into year and month columns using the separate function. Then within a provincial facet we can plot each year as a different line. This figure makes lines darker as they go through the years by using year as the colour variable but highlights 2020-2021 in red using a second call to the geom_line function. There are a few strategies to do this, here I built a dataset with a new mutated variable specifying the 2020-2021 years and used it for plotting. This begins with a lot of filtering to obtain the ‘right’ data.
In some provinces, the strong seasonality of the unemployment rate realy pops out.
= data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
data4plot filter(Sex == "Both sexes") |>
filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
filter(Data.type == "Unadjusted")|>
separate(REF_DATE, into = c("year","month"),sep = "-")|>#split year-month into year month columns
mutate(year = as.numeric(year))|>
mutate(month = as.numeric(month))|>
rename(unemployment.rate = VALUE)
ggplot()+
geom_line(data=data4plot,aes(x=month, y = unemployment.rate, colour = year, group = year))+
# re-add in the pandemic year wwhile forcing the colour red
geom_line(data=data4plot|> filter(year>2019),aes(x=month, y = unemployment.rate, group = year),colour = "red",lwd=1.24)+
#add to the manual legend.
scale_x_continuous(breaks = c(2,4,6,8,10))+
ggtitle("Unadjusted Unemployment Rates by Province for 25 - 54 Year Olds: highlighting pandemic years")+
facet_wrap(~GEO)
Plot the within year seasonal trends, but highlight by decade
A “decade” column is produced using floor division. The yearly data lines are colour grouped by decade so 2020-2021 are inherently already highlighted.
"decade"] = provinces["year"] // 10 * 10
provinces[
for province, i in zip(x, y):
3,4,i)
plt.subplot(
plt.tight_layout()
= provinces[provinces["GEO"] == province],
sb.lineplot(data = "month",
x = "VALUE",
y = "decade",
hue = "bright",
palette = "year",
units = None)
estimator
'', frameon = False)
plt.legend(= 8)
plt.title(province, fontsize
if province == "British Columbia":
="decade",
plt.legend(title=(1.05, 1),
bbox_to_anchor='upper left',
loc=0,
borderaxespad= 3)
ncol else:
'', frameon = False)
plt.legend(
if province == "Quebec":
"Unemployment Rate")
plt.ylabel(else:
"")
plt.ylabel(
= 0.9)
plt.subplots_adjust(top 'Unemployment Rates by Province: Highlighted by Decade', fontsize=13)
plt.suptitle(; plt.show()
This figure uses a mutated variable defining the decade to set line colours, so 2020-2021 are inherently already highlighted.
This time the Canadian average is included, despite the fact that it is a weighted average of the provincial values. Note that this time we are also keeping the filtered, mutated data to simplify merging datasets later.
= data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
unemployment filter(Sex == "Both sexes") |>
# filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
filter( Labour.force.characteristics == "Unemployment rate")|>
filter(Age.group == "25 to 54 years")|>
filter(Data.type == "Unadjusted")|>
separate(REF_DATE, into = c("year","month"),sep = "-")|>#split year-month into year month columns
mutate(year = as.numeric(year))|>
mutate(month = as.numeric(month))|>
mutate(pandemic = 1+.1*as.numeric(as.numeric(year>=2020)))|>
mutate(decade = factor(floor(year/10)*10 ))|>
rename(unemployment.rate = VALUE)|>
rename(province = GEO)
ggplot(data = unemployment)+
geom_line(aes(x=month, y = unemployment.rate, colour = decade, group = year))+
scale_colour_brewer(palette = "Paired")+ # nicer colours
scale_x_continuous(breaks = c(2,4,6,8,10))+ # place axis labels on even numbered months
ggtitle("Unemployment Rates by Province")+
facet_wrap(~province)
Housing starts
The housing starts are more complicated beacuse they are based on cities but the unemployment is based on provices
= sc.table_to_df("34-10-0154-01") housing
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:608: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
warn(
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:326: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
warn(
housing.head()
REF_DATE GEO DGUID ... SYMBOL TERMINATED DECIMALS
0 1972-01-01 Census metropolitan areas NaN ... NaN NaN 0
1 1972-01-01 Census metropolitan areas NaN ... NaN NaN 0
2 1972-01-01 Census metropolitan areas NaN ... NaN NaN 0
3 1972-01-01 Census metropolitan areas NaN ... NaN NaN 0
4 1972-01-01 Census metropolitan areas NaN ... NaN NaN 0
[5 rows x 16 columns]
# the unique cities
"GEO"].unique() housing[
['Census metropolitan areas', 'Brantford, Ontario', 'Calgary, Alberta', 'Edmonton, Alberta', 'Greater Sudbury, Ontario', ..., 'Barrie, Ontario', 'Trois-Rivières, Quebec', 'Abbotsford-Mission, British Columbia', 'Sherbrooke, Quebec', 'Montréal, Quebec']
Length: 37
Categories (37, object): ['Abbotsford-Mission, British Columbia', 'Barrie, Ontario', 'Brantford, Ontario',
'Calgary, Alberta', ..., 'Vancouver, British Columbia',
'Victoria, British Columbia', 'Windsor, Ontario', 'Winnipeg, Manitoba']
= get_cansim("34-10-0154-01") |> rename_all(make.names) housing
Accessing CANSIM NDM product 34-10-0154 from Statistics Canada
Parsing data
|> glimpse() housing
Rows: 329,940
Columns: 24
$ REF_DATE <chr> "1972-01", "1972-01", "1972-…
$ GEO <chr> "Census metropolitan areas",…
$ DGUID <chr> NA, NA, NA, NA, NA, NA, NA, …
$ UOM <chr> "Units", "Units", "Units", "…
$ UOM_ID <chr> "300", "300", "300", "300", …
$ SCALAR_FACTOR <chr> "units", "units", "units", "…
$ SCALAR_ID <chr> "0", "0", "0", "0", "0", "0"…
$ VECTOR <chr> "v42127250", "v42127251", "v…
$ COORDINATE <chr> "1.1.1", "1.1.2", "1.1.3", "…
$ VALUE <dbl> 7969, 3287, 615, 734, 3333, …
$ STATUS <chr> NA, NA, NA, NA, NA, NA, NA, …
$ SYMBOL <chr> NA, NA, NA, NA, NA, NA, NA, …
$ TERMINATED <chr> NA, NA, NA, NA, NA, NA, NA, …
$ DECIMALS <chr> "0", "0", "0", "0", "0", "0"…
$ GeoUID <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.GEO <chr> "1", "1", "1", "1", "1", "1"…
$ Classification.Code.for.Housing.estimates <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.Housing.estimates <chr> "1", "1", "1", "1", "1", "2"…
$ Classification.Code.for.Type.of.unit <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.Type.of.unit <chr> "1", "1.2", "1.3", "1.4", "1…
$ val_norm <dbl> 7969, 3287, 615, 734, 3333, …
$ Date <date> 1972-01-01, 1972-01-01, 197…
$ Housing.estimates <fct> Housing starts, Housing star…
$ Type.of.unit <fct> Total units, Single-detached…
# the unique cities
|> pull(GEO) |> unique() housing
[1] "Census metropolitan areas"
[2] "Brantford, Ontario"
[3] "Calgary, Alberta"
[4] "Edmonton, Alberta"
[5] "Greater Sudbury, Ontario"
[6] "Guelph, Ontario"
[7] "Halifax, Nova Scotia"
[8] "Hamilton, Ontario"
[9] "Kingston, Ontario"
[10] "Kitchener-Cambridge-Waterloo, Ontario"
[11] "London, Ontario"
[12] "Moncton, New Brunswick"
[13] "Montréal excluding Saint-Jérôme, Quebec"
[14] "Ottawa-Gatineau, Ontario/Quebec"
[15] "Ottawa-Gatineau, Ontario part, Ontario/Quebec"
[16] "Ottawa-Gatineau, Quebec part, Ontario/Quebec"
[17] "Peterborough, Ontario"
[18] "Québec, Quebec"
[19] "Regina, Saskatchewan"
[20] "St. Catharines-Niagara, Ontario"
[21] "St. John's, Newfoundland and Labrador"
[22] "Saint John, New Brunswick"
[23] "Saguenay, Quebec"
[24] "Saskatoon, Saskatchewan"
[25] "Thunder Bay, Ontario"
[26] "Toronto, Ontario"
[27] "Vancouver, British Columbia"
[28] "Victoria, British Columbia"
[29] "Windsor, Ontario"
[30] "Winnipeg, Manitoba"
[31] "Kelowna, British Columbia"
[32] "Oshawa, Ontario"
[33] "Barrie, Ontario"
[34] "Trois-Rivières, Quebec"
[35] "Abbotsford-Mission, British Columbia"
[36] "Sherbrooke, Quebec"
[37] "Montréal, Quebec"
To merge the housing starts with the unemployment, we need to set both to the same measurement unit, provinces per {month, year}. To construct provincial housing starts, sum over the municipailites within a province. Many of these StatCan tables have multiple data types, so we will only consider Housing.estimates = “Housing starts” and Type.of.unit = “Total units”.
Notice that Ottawa-Gatineau is split into ON and QC parts.
Let’s use the string split function to split apart the province and city. Considering the GEO location “Ottawa-Gatineau, Ontario part, Ontario/Quebec” and splitting at the “,”, we will have to specify that we want to split at the first comma. Also note that location “Census metropolitan area” does not contain province information, so ‘None’ will appear in that column.
# Build the city and province columns by splitting at the first comma - notice the space after the comma - this serves to discard that space
'city', 'province']] = housing['GEO'].str.split(', ', n=1, expand=True)
housing[[
# Now handle the "Ontario part", "Quebec part" entries
"province"] = housing["province"].str.split(" p").str[0]
housing[
housing.head()
REF_DATE GEO ... city province
0 1972-01-01 Census metropolitan areas ... Census metropolitan areas None
1 1972-01-01 Census metropolitan areas ... Census metropolitan areas None
2 1972-01-01 Census metropolitan areas ... Census metropolitan areas None
3 1972-01-01 Census metropolitan areas ... Census metropolitan areas None
4 1972-01-01 Census metropolitan areas ... Census metropolitan areas None
[5 rows x 18 columns]
Let’s use the separate function to split apart the province and city. Considering the GEO location “Ottawa-Gatineau, Ontario part, Ontario/Quebec” and splitting at the “,” we can end up with some innaccurate provinces “Ontario/Quebec” or destroy the useful location information by removing “Ontario part”.
Here are a few options using regular expressions. Note that most of these give warnings, generally complaining about the GEO location “Census metropolitan areas”. Strategies to split the GEO location at commas will result in NA’s since this location has no comma. Those warnings are left in the output below.
# set up city by replacing everything after the first "," with "":
|> mutate(city = str_replace_all(GEO, pattern = "(,\\s).*", replacement ="")) |>
housing pull(city) |> unique() |> head(16)
[1] "Census metropolitan areas" "Brantford"
[3] "Calgary" "Edmonton"
[5] "Greater Sudbury" "Guelph"
[7] "Halifax" "Hamilton"
[9] "Kingston" "Kitchener-Cambridge-Waterloo"
[11] "London" "Moncton"
[13] "Montréal excluding Saint-Jérôme" "Ottawa-Gatineau"
[15] "Peterborough" "Québec"
#Split everything at the last comma:
# Separate GEO at a ", ", but if more than one exists find the one which is NOT followed by any characters, then a comma:
#Notice how this handles the Ontario and Quebec parts of the Ottawa-Gatineau region
|> separate(GEO, into = c("city","province"), sep = ",\\s(?!(.*,))") |>
housing select(city, province) |>
unique()|> head(16)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
city province
<chr> <chr>
1 Census metropolitan areas <NA>
2 Brantford Ontario
3 Calgary Alberta
4 Edmonton Alberta
5 Greater Sudbury Ontario
6 Guelph Ontario
7 Halifax Nova Scotia
8 Hamilton Ontario
9 Kingston Ontario
10 Kitchener-Cambridge-Waterloo Ontario
11 London Ontario
12 Moncton New Brunswick
13 Montréal excluding Saint-Jérôme Quebec
14 Ottawa-Gatineau Ontario/Quebec
15 Ottawa-Gatineau, Ontario part Ontario/Quebec
16 Ottawa-Gatineau, Quebec part Ontario/Quebec
#Build the province by splitting at the first comma, then adjusting the province by
# handing cases such as "Ontario part, Ontario/Quebec"
# notice that I'm actually splitting at all commas and allowing anything after the
# second comma (where it exists) to be discarded.
|> separate(GEO, into = c("city","province"), sep = ",\\s") |>
housing mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
select(city, province)|> unique()|> head(16)
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
city province
<chr> <chr>
1 Census metropolitan areas <NA>
2 Brantford Ontario
3 Calgary Alberta
4 Edmonton Alberta
5 Greater Sudbury Ontario
6 Guelph Ontario
7 Halifax Nova Scotia
8 Hamilton Ontario
9 Kingston Ontario
10 Kitchener-Cambridge-Waterloo Ontario
11 London Ontario
12 Moncton New Brunswick
13 Montréal excluding Saint-Jérôme Quebec
14 Ottawa-Gatineau Ontario/Quebec
15 Ottawa-Gatineau Ontario
16 Ottawa-Gatineau Quebec
#note that this leaves in an artifact province :"Ontario/Quebec" from Ottawa-Gatineau
# but that is a combo of the Ontario part and the Quebec part
# that province combo could be replaced with NA and then eliminated later
# for example:
|> separate(GEO, into = c("city","province"), sep = ",\\s") |>
housing mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
mutate(province = str_replace_all(province, pattern = "Ontario/Quebec", replacement = "NA"))|>
select(city, province)|> unique()|> head(16)
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
city province
<chr> <chr>
1 Census metropolitan areas <NA>
2 Brantford Ontario
3 Calgary Alberta
4 Edmonton Alberta
5 Greater Sudbury Ontario
6 Guelph Ontario
7 Halifax Nova Scotia
8 Hamilton Ontario
9 Kingston Ontario
10 Kitchener-Cambridge-Waterloo Ontario
11 London Ontario
12 Moncton New Brunswick
13 Montréal excluding Saint-Jérôme Quebec
14 Ottawa-Gatineau NA
15 Ottawa-Gatineau Ontario
16 Ottawa-Gatineau Quebec
Preparing for Merging Datasets
Here we rebuild the dataset and sum housing starts within provinces
# Drop all data except Housing Starts and Total Units
# Rename VALUE Housing start so we know what housing parameter we are looking at
= housing[(housing["Housing estimates"] == "Housing starts") &
housing_starts "Type of unit"] == "Total units")]
(housing[
= housing_starts.rename(columns = {"VALUE" : "Housing starts"})
housing_starts
"year"] = housing_starts["REF_DATE"].dt.year
housing_starts["month"] = housing_starts["REF_DATE"].dt.month
housing_starts[
= housing_starts.groupby(["province",
housing_starts "year",
"month",
"REF_DATE"],
= False,)["Housing starts"].sum()
as_index
housing_starts.head()
province year month REF_DATE Housing starts
0 Alberta 1972 1 1972-01-01 597.0
1 Alberta 1972 2 1972-02-01 1104.0
2 Alberta 1972 3 1972-03-01 1372.0
3 Alberta 1972 4 1972-04-01 1639.0
4 Alberta 1972 5 1972-05-01 2324.0
= provinces[["GEO", "REF_DATE", "VALUE", "month", "year", "decade"]]
unemployment
= unemployment.rename(columns = {"VALUE" : "unemployment rate",
unemployment "GEO" : "province"})
unemployment.head()
province REF_DATE ... year decade
1571 Newfoundland and Labrador 1976-01-01 ... 1976 1970
2366 Prince Edward Island 1976-01-01 ... 1976 1970
3161 Nova Scotia 1976-01-01 ... 1976 1970
3956 New Brunswick 1976-01-01 ... 1976 1970
4751 Quebec 1976-01-01 ... 1976 1970
[5 rows x 6 columns]
= pd.merge(unemployment, housing_starts, on = ['province', 'year', 'month', 'REF_DATE'])
merged
merged.head()
province REF_DATE ... decade Housing starts
0 Newfoundland and Labrador 1976-01-01 ... 1970 40.0
1 Nova Scotia 1976-01-01 ... 1970 273.0
2 New Brunswick 1976-01-01 ... 1970 99.0
3 Quebec 1976-01-01 ... 1970 3638.0
4 Ontario 1976-01-01 ... 1970 3264.0
[5 rows x 7 columns]
# We will keep the final strategy above, but also mutate in the
# Housing.estimates and Type.of.unit
# select fewer variables, to clean it up prior to merging datasets
= housing |> separate(GEO, into = c("city","province"), sep = ",\\s") |>
housing.starts mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
mutate(province = str_replace_all(province, pattern = "Ontario/Quebec", replacement = "NA"))|>
rename(housing.starts = VALUE)|> # so I'm not confused later
filter(Housing.estimates == "Housing starts")|>
filter(Type.of.unit == "Total units") |>
filter(!is.na(province))|> # get rid of census metropolitan areas
mutate(date = parse_date_time(REF_DATE, orders = "Y-m"))|> # convert from character date to date format.
separate(REF_DATE, into = c("year","month"),sep = "-")|>#split year-month into year month columns
mutate(year = as.numeric(year))|>
mutate(month = as.numeric(month))|>
group_by(province, year, month, date) |> # look within provinces at a given time
summarize(total.housing = sum(housing.starts)) |>
ungroup()
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
`summarise()` has grouped output by 'province', 'year', 'month'. You can
override using the `.groups` argument.
|> glimpse() housing.starts
Rows: 6,310
Columns: 5
$ province <chr> "Alberta", "Alberta", "Alberta", "Alberta", "Alberta", "…
$ year <dbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 19…
$ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,…
$ date <dttm> 1972-01-01, 1972-02-01, 1972-03-01, 1972-04-01, 1972-05…
$ total.housing <dbl> 597, 1104, 1372, 1639, 2324, 2414, 1335, 1481, 1260, 103…
By default inner_join will merge two datasets based on their common column names. Only rows which appear in both datasets will be kept.
= unemployment |>
both select(province, year, month, decade,pandemic, unemployment.rate) |>
inner_join(housing.starts)
Joining with `by = join_by(province, year, month)`
|> glimpse() both
Rows: 5,247
Columns: 8
$ province <chr> "Newfoundland and Labrador", "Nova Scotia", "New Bru…
$ year <dbl> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976…
$ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ decade <fct> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ pandemic <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ unemployment.rate <dbl> 11.4, 6.4, 10.6, 6.4, 5.9, 4.1, 3.2, 3.4, 8.0, 12.3,…
$ date <dttm> 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 197…
$ total.housing <dbl> 40, 273, 99, 3638, 3264, 333, 219, 1239, 1416, 51, 6…
Comparing Unemployment and Housing Starts
Note that there are Prince Edward Island and the territories not represented within the housing starts dataset and therefore are not in the following plots.
In general as unemployment increases, housing starts decrease.
= list(merged["province"].unique())
x = y = [1,2,3,4,5,6,7,8,9]
y
for province, i in zip(x, y):
3,3,i)
plt.subplot(
plt.tight_layout()
= merged[merged["province"] == province],
sb.regplot(data = "unemployment rate",
x = "Housing starts",
y ={'s':1})
scatter_kws
= 8)
plt.title(province, fontsize '', frameon = False)
plt.legend(
if province == "Quebec":
"Total housing")
plt.ylabel(else:
"")
plt.ylabel(
= 0.9)
plt.subplots_adjust(top 'Regression plots of Housing vs. Unemployment Rate', fontsize=13)
plt.suptitle(; plt.show()
|> ggplot(aes(x=unemployment.rate, y =total.housing))+
both geom_point( aes(colour = province, group = province),alpha = .1)+#alpha is the transparancy of the points
::stat_regline_equation()+
ggpubrgeom_smooth(method = "lm")+
facet_wrap(~province)
`geom_smooth()` using formula = 'y ~ x'
Going Further
- Make residual plots for the linear regessions
- Consider other variables/ relationship, such as CPI from table 18-10-0004-01 Consumer Price Index, monthly, not seasonally adjusted
- Consider looking for a time varying trend, subsetting the years, or even multiple regression with a year effect.