Libraries for Accessing Statistics Canada Data Tables

Census Data
Regression
Time Series
Software:Python
Software:R
Things that break
Author

Dean Jayatilleke, Junpu Xie, and Dave Campbell

Published

August 10, 2024

Goals

Find and extract socio-economic data from Statistics Canada. In this example we manipulate and plot unemployment rates and housing starts. Relationships between the two datasets are examined.

Data Provider

Statistics Canada hosts a large set of socio-economic datasets. When searching through these, take note of the table identifier. The stats-can and cansim packages provide Python and R APIs respectively for downloading these tables. See also the stats-can or cansim vignettes for examples and instructions to pull Statistics Canada datasets.

Installing stats-can and cansim

conda install -c conda-forge stats-can
# OR
pip install stats-can
# code not run for this document.
install.packages("cansim")
library(cansim)
vignette("cansim")

#code not run for this document.

Basic libraries

from stats_can import StatsCan # for pulling Stats Can socioeconomic data
import pandas as pd # for data wrangling
import seaborn as sb # for plotting
import matplotlib.pyplot as plt # for plot management
library(tidyverse) # for its nice flow.
library(cansim) #for pulling Stat Can socioeconomic data
library(lubridate) # for dealing with dates
library(GGally) # for pairs plot
library(ggpubr) # for putting a regression equation onto a ggplot figure

Pulling a table

Visit here for example to see labour force data in html. Notice that the StatCan webpage shows the table id, here it is table: 14-10-0287-01.

We’ll use that to grab the table. Be warned that this data table is a csv of around 1 GB in size.

sc = StatsCan() # We create a StatsCan Object that will be used to pull data
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/api_class.py:24: FutureWarning: This class will be deprecated in upcoming v3 release. Please see the docs for details
  warn(
data = sc.table_to_df("14-10-0287-01") 
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:608: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
  warn(
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:326: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
  warn(
data.head()
    REF_DATE     GEO           DGUID  ... SYMBOL TERMINATED DECIMALS
0 1976-01-01  Canada  2016A000011124  ...    NaN        NaN        1
1 1976-01-01  Canada  2016A000011124  ...    NaN        NaN        1
2 1976-01-01  Canada  2016A000011124  ...    NaN        NaN        1
3 1976-01-01  Canada  2016A000011124  ...    NaN        NaN        1
4 1976-01-01  Canada  2016A000011124  ...    NaN        NaN        1

[5 rows x 19 columns]
data.tail()
          REF_DATE               GEO        DGUID  ... SYMBOL TERMINATED DECIMALS
5241748 2024-07-01  British Columbia  2016A000259  ...    NaN        NaN        1
5241749 2024-07-01  British Columbia  2016A000259  ...    NaN        NaN        1
5241750 2024-07-01  British Columbia  2016A000259  ...    NaN        NaN        1
5241751 2024-07-01  British Columbia  2016A000259  ...    NaN        NaN        1
5241752 2024-07-01  British Columbia  2016A000259  ...    NaN        NaN        1

[5 rows x 19 columns]
data = get_cansim("14-10-0287-01") |> rename_all(make.names) 

data |> glimpse()
Rows: 5,241,753
Columns: 33
$ REF_DATE                                             <chr> "1976-01", "1976-…
$ GEO                                                  <chr> "Canada", "Canada…
$ DGUID                                                <chr> "2016A000011124",…
$ UOM                                                  <chr> "Persons", "Perso…
$ UOM_ID                                               <chr> "249", "249", "24…
$ SCALAR_FACTOR                                        <chr> "thousands", "tho…
$ SCALAR_ID                                            <chr> "3", "3", "3", "3…
$ VECTOR                                               <chr> "v2062809", "v206…
$ COORDINATE                                           <chr> "1.1.1.1.1.1", "1…
$ VALUE                                                <dbl> 16852.4, 16852.4,…
$ STATUS                                               <chr> NA, NA, NA, NA, N…
$ SYMBOL                                               <chr> NA, NA, NA, NA, N…
$ TERMINATED                                           <chr> NA, NA, NA, NA, N…
$ DECIMALS                                             <chr> "1", "1", "1", "1…
$ GeoUID                                               <chr> "11124", "11124",…
$ Hierarchy.for.GEO                                    <chr> "1", "1", "1", "1…
$ Classification.Code.for.Labour.force.characteristics <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Labour.force.characteristics           <chr> "1", "1", "1", "1…
$ Classification.Code.for.Sex                          <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Sex                                    <chr> "1", "1", "1", "1…
$ Classification.Code.for.Age.group                    <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Age.group                              <chr> "1", "1", "8", "8…
$ Classification.Code.for.Statistics                   <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Statistics                             <chr> "1", "1", "1", "1…
$ Classification.Code.for.Data.type                    <chr> NA, NA, NA, NA, N…
$ Hierarchy.for.Data.type                              <chr> "1", "2", "1", "2…
$ val_norm                                             <dbl> 16852400, 1685240…
$ Date                                                 <date> 1976-01-01, 1976…
$ Labour.force.characteristics                         <fct> Population, Popul…
$ Sex                                                  <fct> Both sexes, Both …
$ Age.group                                            <fct> 15 years and over…
$ Statistics                                           <fct> Estimate, Estimat…
$ Data.type                                            <fct> Seasonally adjust…

There are nearly 5 million rows of data!

Make a few plots of Employment statistics

Stat Can tables are structured so that each row has a single value and columns define attributes about that value Each row of data has a specific attribute sex, age (Age.group), and Location (GEO) attributes, for units of the value (SCALAR_FACTOR), the specific type of labour force characteristic measured (Labour.force.characteristics), and the different data types such as point estimates and (bootstrap) standard errors (Statistics), and raw or seasonally adjusted (Data.type). In general this means that we need to filter and reshape the data.

Let’s inventory a few of these columns so that the exact spelling of attributes can be used below in the plots.

data["Statistics"].unique()
['Estimate', 'Standard error of estimate', 'Standard error of month-to-month change', 'Standard error of year-over-year change']
Categories (4, object): ['Estimate', 'Standard error of estimate',
                         'Standard error of month-to-month change', 'Standard error of year-over-year change']
data["Age group"].unique()
['15 years and over', '15 to 64 years', '15 to 24 years', '15 to 19 years', '20 to 24 years', '25 years and over', '25 to 54 years', '55 years and over', '55 to 64 years']
Categories (9, object): ['15 to 19 years', '15 to 24 years', '15 to 64 years', '15 years and over', ...,
                         '25 to 54 years', '25 years and over', '55 to 64 years',
                         '55 years and over']
data["GEO"].unique()
['Canada', 'Newfoundland and Labrador', 'Prince Edward Island', 'Nova Scotia', 'New Brunswick', ..., 'Ontario', 'Manitoba', 'Saskatchewan', 'Alberta', 'British Columbia']
Length: 11
Categories (11, object): ['Alberta', 'British Columbia', 'Canada', 'Manitoba', ..., 'Ontario',
                          'Prince Edward Island', 'Quebec', 'Saskatchewan']
data["Data type"].unique()
['Seasonally adjusted', 'Unadjusted', 'Trend-cycle']
Categories (3, object): ['Seasonally adjusted', 'Trend-cycle', 'Unadjusted']
data$Labour.force.characteristics  |> unique()
[1] Population           Labour force         Employment          
[4] Full-time employment Part-time employment Unemployment        
[7] Unemployment rate    Participation rate   Employment rate     
9 Levels: Population Labour force Employment ... Employment rate
data$Statistics  |> unique()
[1] Estimate                               
[2] Standard error of estimate             
[3] Standard error of month-to-month change
[4] Standard error of year-over-year change
4 Levels: Estimate ... Standard error of year-over-year change
data$Age.group  |> unique()
[1] 15 years and over 15 to 64 years    15 to 24 years    15 to 19 years   
[5] 20 to 24 years    25 years and over 25 to 54 years    55 years and over
[9] 55 to 64 years   
9 Levels: 15 years and over 15 to 24 years 15 to 19 years ... 55 to 64 years
data$GEO  |> unique()
 [1] "Canada"                    "Newfoundland and Labrador"
 [3] "Prince Edward Island"      "Nova Scotia"              
 [5] "New Brunswick"             "Quebec"                   
 [7] "Ontario"                   "Manitoba"                 
 [9] "Saskatchewan"              "Alberta"                  
[11] "British Columbia"         

Plotting provinces against each other

Let’s consider provincial differences in employment rate for 25-55 year olds. Using the spelling from the above output we can set up the data filters.

Important consideration when pivoting

Pivoting a data table will spread out a variable into new columns. Sometimes it’s more convenient to have each province as it’s own column. Here we do this to plot provinces against each other and see how they relate. In this case, we want each province’s unemployment rate from the “Labour force characteristics” column. However, notice that there is a “Data type” column that shows two possible entries for the unemployment rate attribute in the “Labour force characteristics”. If we filtered the data without selected one of these Data types, and then pivoted, python would take the mean of these two values!

provinces = data[["GEO",
                  "Sex",
                  "Age group",
                  "Labour force characteristics",
                  "REF_DATE",
                  "VALUE",
                  "Data type"]]

provinces = provinces[(provinces["Sex"] == "Both sexes") &
                      (provinces["GEO"] != "Canada") &
                      (provinces["Labour force characteristics"] == "Unemployment rate") &
                      (provinces["Age group"] == "25 to 54 years") &
                      (provinces["Data type"] == "Unadjusted")]


pivoted = pd.pivot_table(provinces,
                         index = ["Sex",
                                  "Age group",
                                  "Labour force characteristics",
                                  "REF_DATE"],
                         values = "VALUE",
                         columns = "GEO",
                         observed = False).reset_index()

pivoted.head()
GEO         Sex       Age group  ... Quebec Saskatchewan
0    Both sexes  25 to 54 years  ...    6.4          3.2
1    Both sexes  25 to 54 years  ...    7.0          3.2
2    Both sexes  25 to 54 years  ...    6.6          3.0
3    Both sexes  25 to 54 years  ...    6.8          3.2
4    Both sexes  25 to 54 years  ...    5.6          2.4

[5 rows x 14 columns]
When pivot_wider Works, But Not As Intended; Common Errors and How to Fix Them

Pivoting a data table will spread out a variable into new columns. Sometimes it’s more convenient to have each province as it’s own column. Here we do this to plot provinces against each other and see how they relate. It can be helpful to see how to accidentally break things, so let’s start there.
I will need to filter the data so let’s start there. Here we go right from selecting columns, filtering and then pivoting wider, which will help when plotting. Take a look at the output.

data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
  filter(Sex == "Both sexes") |> 
  filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
  filter( Labour.force.characteristics == "Unemployment rate")|>
  filter(Age.group == "25 to 54 years")|>
  pivot_wider(names_from= GEO, values_from=VALUE) |>  # puts each province into its own column
  rename_all(make.names) # rebuilt the province names
Warning: Values from `VALUE` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(Sex, Age.group,
  Labour.force.characteristics, REF_DATE, GEO)) |>
  dplyr::filter(n > 1L)
# A tibble: 583 × 14
   Sex        Age.group   Labour.force.charact…¹ REF_DATE Newfoundland.and.Lab…²
   <fct>      <fct>       <fct>                  <chr>    <list>                
 1 Both sexes 25 to 54 y… Unemployment rate      1976-01  <dbl [5]>             
 2 Both sexes 25 to 54 y… Unemployment rate      1976-02  <dbl [5]>             
 3 Both sexes 25 to 54 y… Unemployment rate      1976-03  <dbl [5]>             
 4 Both sexes 25 to 54 y… Unemployment rate      1976-04  <dbl [5]>             
 5 Both sexes 25 to 54 y… Unemployment rate      1976-05  <dbl [5]>             
 6 Both sexes 25 to 54 y… Unemployment rate      1976-06  <dbl [5]>             
 7 Both sexes 25 to 54 y… Unemployment rate      1976-07  <dbl [5]>             
 8 Both sexes 25 to 54 y… Unemployment rate      1976-08  <dbl [5]>             
 9 Both sexes 25 to 54 y… Unemployment rate      1976-09  <dbl [5]>             
10 Both sexes 25 to 54 y… Unemployment rate      1976-10  <dbl [5]>             
# ℹ 573 more rows
# ℹ abbreviated names: ¹​Labour.force.characteristics,
#   ²​Newfoundland.and.Labrador
# ℹ 9 more variables: Prince.Edward.Island <list>, Nova.Scotia <list>,
#   New.Brunswick <list>, Quebec <list>, Ontario <list>, Manitoba <list>,
#   Saskatchewan <list>, Alberta <list>, British.Columbia <list>

The provinces are now list variables, whereas they should be doubles (single numbers). It wasn’t obvious to me what went wrong the first time I broke a pivot table like this. Let’s re-run those steps without the pivot_wider and see if it helps to diagnose the root of the problem. This is one of the great parts of tidyverse, we just rerun a subset of the above code, or explicitely extract one of those list elements from the original call:

# run fewer lines of the previous code:
data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
  filter(Sex == "Both sexes") |> 
  filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
  filter( Labour.force.characteristics == "Unemployment rate")|>
  filter(Age.group == "25 to 54 years")
# A tibble: 29,150 × 6
   GEO                     Sex   Age.group Labour.force.charact…¹ REF_DATE VALUE
   <chr>                   <fct> <fct>     <fct>                  <chr>    <dbl>
 1 Newfoundland and Labra… Both… 25 to 54… Unemployment rate      1976-01   10.1
 2 Newfoundland and Labra… Both… 25 to 54… Unemployment rate      1976-01   11.4
 3 Newfoundland and Labra… Both… 25 to 54… Unemployment rate      1976-01   NA  
 4 Newfoundland and Labra… Both… 25 to 54… Unemployment rate      1976-01   NA  
 5 Newfoundland and Labra… Both… 25 to 54… Unemployment rate      1976-01   NA  
 6 Prince Edward Island    Both… 25 to 54… Unemployment rate      1976-01    5.7
 7 Prince Edward Island    Both… 25 to 54… Unemployment rate      1976-01    7.3
 8 Prince Edward Island    Both… 25 to 54… Unemployment rate      1976-01   NA  
 9 Prince Edward Island    Both… 25 to 54… Unemployment rate      1976-01   NA  
10 Prince Edward Island    Both… 25 to 54… Unemployment rate      1976-01   NA  
# ℹ 29,140 more rows
# ℹ abbreviated name: ¹​Labour.force.characteristics
## or view the problem this way, select Ontario and check out the first element:
# data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE)|>
#   filter(Sex == "Both sexes") |> 
#   filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
#   filter( Labour.force.characteristics == "Unemployment rate")|>
#   filter(Age.group == "25 to 54 years")|>
#   pivot_wider(names_from= GEO, values_from=VALUE) |>  # puts each province into its own column
#   rename_all(make.names) |> 
#   select(Ontario)|> head|> .[[1]]

If you look closely at the output in the first strategy you’ll notice that there are multiple rows which only differ in their VALUE. This means that pivot_wider is mapping multiple rows of VALUE into a single element. R handles this by smashing all those values together into list elements.

In order to have a one-to-one mapping add another variable to create a unique mapping. Taking another close look at the glimpse output above, we can see that there is more than one Data.type. Checking out its values we see the following:

data$Data.type  |> unique()
[1] Seasonally adjusted Unadjusted          Trend-cycle        
Levels: Seasonally adjusted Unadjusted Trend-cycle

The multiple VALUEs are made unique when considering the Data.type variable. To simplify things, let’s select the Unadjusted unemployment rate.

When pivot_wider Works As Intended
data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
  filter(Sex == "Both sexes") |> 
  filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
  filter( Labour.force.characteristics == "Unemployment rate")|>
  filter(Age.group == "25 to 54 years")|>
  filter(Data.type == "Unadjusted")|>
  pivot_wider(names_from= GEO, values_from=VALUE) |>  # puts each province into its own column
  rename_all(make.names) 
# A tibble: 583 × 15
   Sex        Age.group      Labour.force.characteristics REF_DATE Data.type 
   <fct>      <fct>          <fct>                        <chr>    <fct>     
 1 Both sexes 25 to 54 years Unemployment rate            1976-01  Unadjusted
 2 Both sexes 25 to 54 years Unemployment rate            1976-02  Unadjusted
 3 Both sexes 25 to 54 years Unemployment rate            1976-03  Unadjusted
 4 Both sexes 25 to 54 years Unemployment rate            1976-04  Unadjusted
 5 Both sexes 25 to 54 years Unemployment rate            1976-05  Unadjusted
 6 Both sexes 25 to 54 years Unemployment rate            1976-06  Unadjusted
 7 Both sexes 25 to 54 years Unemployment rate            1976-07  Unadjusted
 8 Both sexes 25 to 54 years Unemployment rate            1976-08  Unadjusted
 9 Both sexes 25 to 54 years Unemployment rate            1976-09  Unadjusted
10 Both sexes 25 to 54 years Unemployment rate            1976-10  Unadjusted
# ℹ 573 more rows
# ℹ 10 more variables: Newfoundland.and.Labrador <dbl>,
#   Prince.Edward.Island <dbl>, Nova.Scotia <dbl>, New.Brunswick <dbl>,
#   Quebec <dbl>, Ontario <dbl>, Manitoba <dbl>, Saskatchewan <dbl>,
#   Alberta <dbl>, British.Columbia <dbl>

Plotting Variables Against Each Other.

sb.pairplot(pivoted, 
            vars = pivoted.columns[4:14]).fig.suptitle("Monthly Unemployment Rate: Not Seasonally Adjusted",
                                                        y = 1.01,
                                                        fontsize = 14)
plt.show();

data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
  filter(Sex == "Both sexes") |> 
  filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
  filter( Labour.force.characteristics == "Unemployment rate")|>
  filter(Age.group == "25 to 54 years")|>
  filter(Data.type == "Unadjusted")|>
  pivot_wider(names_from= GEO, values_from=VALUE) |>  # puts each province into its own column
  rename_all(make.names) |> 
ggpairs( columns = 6:15, title = "Monthly Unemployment Rate: Not Seasonaly Adjusted",  
        axisLabels = "show")

Plotting the Provinces as Time Series

Notice how we produce year and month columns by accessing properties of pandas datetime objects.

provinces["month"] = provinces["REF_DATE"].dt.month
provinces["year"] = provinces["REF_DATE"].dt.year

x = list(provinces["GEO"].unique())
y = [1,2,3,4,5,6,7,8,9,10]

sb.set_style("darkgrid")

for province, i in zip(x, y):
    plt.subplot(3,4,i)
    plt.tight_layout()

    sb.lineplot(data = provinces[provinces["GEO"] == province],
                x = "year",
                y = "VALUE")
     
    
    plt.legend('', frameon = False)
    plt.title(province, fontsize = 8)
    if province == "Quebec":
        plt.ylabel("Unemployment Rate")
    else:
        plt.ylabel("") 

plt.subplots_adjust(top = 0.9)
plt.suptitle('Unemployment Rates by Province', fontsize=14)
plt.show();

We won’t bother pivoting the data table anymore, since it’s convenient for ggplot to have a column as an identifying feature.

Notice that the REF_DATE column is a character but it should be a date. We’ll use the lubridate package for converting into dates. The helpfile for the parse_date_time function is really useful for specifying the input format, as it clarifies that Y is a year with century but y is the two digit year without century. Using the wrong one will break the mutation.

data |> select(GEO,Sex,Age.group,Labour.force.characteristics,REF_DATE,VALUE,Data.type)|>
  filter(Sex == "Both sexes") |> 
  filter(GEO != "Canada")|> # since the Canada value will be a type of weighted average
  filter( Labour.force.characteristics == "Unemployment rate")|>
  filter(Age.group == "25 to 54 years")|>
  filter(Data.type == "Unadjusted")|>
  mutate(date = parse_date_time(REF_DATE, orders = "Y-m"))|> # convert from character date to date format.
  rename(unemployment.rate = VALUE)|>
  ggplot(aes(x=date, y = unemployment.rate, colour = GEO))+
  geom_line()+
  ggtitle("Unemployment Rates by Province")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+# rotate axis labels so they don't overlap
  facet_wrap(~GEO)

Housing starts

The housing starts are more complicated beacuse they are based on cities but the unemployment is based on provices

housing = sc.table_to_df("34-10-0154-01") 
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:608: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
  warn(
/Users/iamdavecampbell/anaconda3/lib/python3.11/site-packages/stats_can/sc.py:326: FutureWarning: This function will be deprecated in the v3 release. Please see the docs for details.
  warn(
housing.head()
    REF_DATE                        GEO DGUID  ... SYMBOL TERMINATED DECIMALS
0 1972-01-01  Census metropolitan areas   NaN  ...    NaN        NaN        0
1 1972-01-01  Census metropolitan areas   NaN  ...    NaN        NaN        0
2 1972-01-01  Census metropolitan areas   NaN  ...    NaN        NaN        0
3 1972-01-01  Census metropolitan areas   NaN  ...    NaN        NaN        0
4 1972-01-01  Census metropolitan areas   NaN  ...    NaN        NaN        0

[5 rows x 16 columns]
# the unique cities
housing["GEO"].unique()
['Census metropolitan areas', 'Brantford, Ontario', 'Calgary, Alberta', 'Edmonton, Alberta', 'Greater Sudbury, Ontario', ..., 'Barrie, Ontario', 'Trois-Rivières, Quebec', 'Abbotsford-Mission, British Columbia', 'Sherbrooke, Quebec', 'Montréal, Quebec']
Length: 37
Categories (37, object): ['Abbotsford-Mission, British Columbia', 'Barrie, Ontario', 'Brantford, Ontario',
                          'Calgary, Alberta', ..., 'Vancouver, British Columbia',
                          'Victoria, British Columbia', 'Windsor, Ontario', 'Winnipeg, Manitoba']
housing = get_cansim("34-10-0154-01") |> rename_all(make.names) 
Accessing CANSIM NDM product 34-10-0154 from Statistics Canada
Parsing data
housing |> glimpse()
Rows: 329,940
Columns: 24
$ REF_DATE                                  <chr> "1972-01", "1972-01", "1972-…
$ GEO                                       <chr> "Census metropolitan areas",…
$ DGUID                                     <chr> NA, NA, NA, NA, NA, NA, NA, …
$ UOM                                       <chr> "Units", "Units", "Units", "…
$ UOM_ID                                    <chr> "300", "300", "300", "300", …
$ SCALAR_FACTOR                             <chr> "units", "units", "units", "…
$ SCALAR_ID                                 <chr> "0", "0", "0", "0", "0", "0"…
$ VECTOR                                    <chr> "v42127250", "v42127251", "v…
$ COORDINATE                                <chr> "1.1.1", "1.1.2", "1.1.3", "…
$ VALUE                                     <dbl> 7969, 3287, 615, 734, 3333, …
$ STATUS                                    <chr> NA, NA, NA, NA, NA, NA, NA, …
$ SYMBOL                                    <chr> NA, NA, NA, NA, NA, NA, NA, …
$ TERMINATED                                <chr> NA, NA, NA, NA, NA, NA, NA, …
$ DECIMALS                                  <chr> "0", "0", "0", "0", "0", "0"…
$ GeoUID                                    <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.GEO                         <chr> "1", "1", "1", "1", "1", "1"…
$ Classification.Code.for.Housing.estimates <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.Housing.estimates           <chr> "1", "1", "1", "1", "1", "2"…
$ Classification.Code.for.Type.of.unit      <chr> NA, NA, NA, NA, NA, NA, NA, …
$ Hierarchy.for.Type.of.unit                <chr> "1", "1.2", "1.3", "1.4", "1…
$ val_norm                                  <dbl> 7969, 3287, 615, 734, 3333, …
$ Date                                      <date> 1972-01-01, 1972-01-01, 197…
$ Housing.estimates                         <fct> Housing starts, Housing star…
$ Type.of.unit                              <fct> Total units, Single-detached…
# the unique cities
housing |> pull(GEO) |> unique()
 [1] "Census metropolitan areas"                    
 [2] "Brantford, Ontario"                           
 [3] "Calgary, Alberta"                             
 [4] "Edmonton, Alberta"                            
 [5] "Greater Sudbury, Ontario"                     
 [6] "Guelph, Ontario"                              
 [7] "Halifax, Nova Scotia"                         
 [8] "Hamilton, Ontario"                            
 [9] "Kingston, Ontario"                            
[10] "Kitchener-Cambridge-Waterloo, Ontario"        
[11] "London, Ontario"                              
[12] "Moncton, New Brunswick"                       
[13] "Montréal excluding Saint-Jérôme, Quebec"      
[14] "Ottawa-Gatineau, Ontario/Quebec"              
[15] "Ottawa-Gatineau, Ontario part, Ontario/Quebec"
[16] "Ottawa-Gatineau, Quebec part, Ontario/Quebec" 
[17] "Peterborough, Ontario"                        
[18] "Québec, Quebec"                               
[19] "Regina, Saskatchewan"                         
[20] "St. Catharines-Niagara, Ontario"              
[21] "St. John's, Newfoundland and Labrador"        
[22] "Saint John, New Brunswick"                    
[23] "Saguenay, Quebec"                             
[24] "Saskatoon, Saskatchewan"                      
[25] "Thunder Bay, Ontario"                         
[26] "Toronto, Ontario"                             
[27] "Vancouver, British Columbia"                  
[28] "Victoria, British Columbia"                   
[29] "Windsor, Ontario"                             
[30] "Winnipeg, Manitoba"                           
[31] "Kelowna, British Columbia"                    
[32] "Oshawa, Ontario"                              
[33] "Barrie, Ontario"                              
[34] "Trois-Rivières, Quebec"                       
[35] "Abbotsford-Mission, British Columbia"         
[36] "Sherbrooke, Quebec"                           
[37] "Montréal, Quebec"                             

To merge the housing starts with the unemployment, we need to set both to the same measurement unit, provinces per {month, year}. To construct provincial housing starts, sum over the municipailites within a province. Many of these StatCan tables have multiple data types, so we will only consider Housing.estimates = “Housing starts” and Type.of.unit = “Total units”.

Notice that Ottawa-Gatineau is split into ON and QC parts.

Let’s use the string split function to split apart the province and city. Considering the GEO location “Ottawa-Gatineau, Ontario part, Ontario/Quebec” and splitting at the “,”, we will have to specify that we want to split at the first comma. Also note that location “Census metropolitan area” does not contain province information, so ‘None’ will appear in that column.

# Build the city and province columns by splitting at the first comma - notice the space after the comma - this serves to discard that space
housing[['city', 'province']] = housing['GEO'].str.split(', ', n=1, expand=True)


# Now handle the "Ontario part", "Quebec part" entries
housing["province"] = housing["province"].str.split(" p").str[0]

housing.head()
    REF_DATE                        GEO  ...                       city province
0 1972-01-01  Census metropolitan areas  ...  Census metropolitan areas     None
1 1972-01-01  Census metropolitan areas  ...  Census metropolitan areas     None
2 1972-01-01  Census metropolitan areas  ...  Census metropolitan areas     None
3 1972-01-01  Census metropolitan areas  ...  Census metropolitan areas     None
4 1972-01-01  Census metropolitan areas  ...  Census metropolitan areas     None

[5 rows x 18 columns]

Let’s use the separate function to split apart the province and city. Considering the GEO location “Ottawa-Gatineau, Ontario part, Ontario/Quebec” and splitting at the “,” we can end up with some innaccurate provinces “Ontario/Quebec” or destroy the useful location information by removing “Ontario part”.

Here are a few options using regular expressions. Note that most of these give warnings, generally complaining about the GEO location “Census metropolitan areas”. Strategies to split the GEO location at commas will result in NA’s since this location has no comma. Those warnings are left in the output below.

# set up city by replacing everything after the first "," with "":
housing |> mutate(city = str_replace_all(GEO, pattern = "(,\\s).*", replacement ="")) |> 
  pull(city) |> unique() |> head(16)
 [1] "Census metropolitan areas"       "Brantford"                      
 [3] "Calgary"                         "Edmonton"                       
 [5] "Greater Sudbury"                 "Guelph"                         
 [7] "Halifax"                         "Hamilton"                       
 [9] "Kingston"                        "Kitchener-Cambridge-Waterloo"   
[11] "London"                          "Moncton"                        
[13] "Montréal excluding Saint-Jérôme" "Ottawa-Gatineau"                
[15] "Peterborough"                    "Québec"                         
#Split everything at the last comma:
# Separate GEO at a ", ", but if more than one exists find the one which is NOT followed by any characters, then a comma:
#Notice how this handles the Ontario and Quebec parts of the Ottawa-Gatineau region
housing |> separate(GEO, into  = c("city","province"), sep = ",\\s(?!(.*,))") |> 
  select(city,  province) |>
  unique()|> head(16)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
   city                            province      
   <chr>                           <chr>         
 1 Census metropolitan areas       <NA>          
 2 Brantford                       Ontario       
 3 Calgary                         Alberta       
 4 Edmonton                        Alberta       
 5 Greater Sudbury                 Ontario       
 6 Guelph                          Ontario       
 7 Halifax                         Nova Scotia   
 8 Hamilton                        Ontario       
 9 Kingston                        Ontario       
10 Kitchener-Cambridge-Waterloo    Ontario       
11 London                          Ontario       
12 Moncton                         New Brunswick 
13 Montréal excluding Saint-Jérôme Quebec        
14 Ottawa-Gatineau                 Ontario/Quebec
15 Ottawa-Gatineau, Ontario part   Ontario/Quebec
16 Ottawa-Gatineau, Quebec part    Ontario/Quebec
#Build the province by splitting at the first comma, then adjusting the province by 
# handing cases such as "Ontario part, Ontario/Quebec"
# notice that I'm actually splitting at all commas and allowing anything after the 
# second comma (where it exists) to be discarded.
housing |> separate(GEO, into  = c("city","province"), sep = ",\\s") |> 
  mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
  select(city,  province)|> unique()|> head(16) 
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
   city                            province      
   <chr>                           <chr>         
 1 Census metropolitan areas       <NA>          
 2 Brantford                       Ontario       
 3 Calgary                         Alberta       
 4 Edmonton                        Alberta       
 5 Greater Sudbury                 Ontario       
 6 Guelph                          Ontario       
 7 Halifax                         Nova Scotia   
 8 Hamilton                        Ontario       
 9 Kingston                        Ontario       
10 Kitchener-Cambridge-Waterloo    Ontario       
11 London                          Ontario       
12 Moncton                         New Brunswick 
13 Montréal excluding Saint-Jérôme Quebec        
14 Ottawa-Gatineau                 Ontario/Quebec
15 Ottawa-Gatineau                 Ontario       
16 Ottawa-Gatineau                 Quebec        
#note that this leaves in an artifact province :"Ontario/Quebec" from Ottawa-Gatineau 
# but that is a combo of the Ontario part and the Quebec part
# that province combo could be replaced with NA and then eliminated later
# for example:

housing |> separate(GEO, into  = c("city","province"), sep = ",\\s") |> 
  mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
  mutate(province = str_replace_all(province, pattern = "Ontario/Quebec", replacement = "NA"))|>
  select(city,  province)|> unique()|> head(16) 
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
# A tibble: 16 × 2
   city                            province     
   <chr>                           <chr>        
 1 Census metropolitan areas       <NA>         
 2 Brantford                       Ontario      
 3 Calgary                         Alberta      
 4 Edmonton                        Alberta      
 5 Greater Sudbury                 Ontario      
 6 Guelph                          Ontario      
 7 Halifax                         Nova Scotia  
 8 Hamilton                        Ontario      
 9 Kingston                        Ontario      
10 Kitchener-Cambridge-Waterloo    Ontario      
11 London                          Ontario      
12 Moncton                         New Brunswick
13 Montréal excluding Saint-Jérôme Quebec       
14 Ottawa-Gatineau                 NA           
15 Ottawa-Gatineau                 Ontario      
16 Ottawa-Gatineau                 Quebec       

Preparing for Merging Datasets

Here we rebuild the dataset and sum housing starts within provinces

# Drop all data except Housing Starts and Total Units
# Rename VALUE Housing start so we know what housing parameter we are looking at

housing_starts = housing[(housing["Housing estimates"] == "Housing starts") &
                  (housing["Type of unit"] == "Total units")]

housing_starts = housing_starts.rename(columns = {"VALUE" : "Housing starts"})

housing_starts["year"] = housing_starts["REF_DATE"].dt.year
housing_starts["month"] = housing_starts["REF_DATE"].dt.month

housing_starts = housing_starts.groupby(["province",
                               "year",
                               "month",
                               "REF_DATE"],
                               as_index = False,)["Housing starts"].sum()

housing_starts.head()
  province  year  month   REF_DATE  Housing starts
0  Alberta  1972      1 1972-01-01           597.0
1  Alberta  1972      2 1972-02-01          1104.0
2  Alberta  1972      3 1972-03-01          1372.0
3  Alberta  1972      4 1972-04-01          1639.0
4  Alberta  1972      5 1972-05-01          2324.0
unemployment = provinces[["GEO", "REF_DATE", "VALUE", "month", "year", "decade"]]


unemployment = unemployment.rename(columns = {"VALUE" : "unemployment rate", 
                                              "GEO" : "province"})

unemployment.head()
                       province   REF_DATE  ...  year  decade
1571  Newfoundland and Labrador 1976-01-01  ...  1976    1970
2366       Prince Edward Island 1976-01-01  ...  1976    1970
3161                Nova Scotia 1976-01-01  ...  1976    1970
3956              New Brunswick 1976-01-01  ...  1976    1970
4751                     Quebec 1976-01-01  ...  1976    1970

[5 rows x 6 columns]
merged = pd.merge(unemployment, housing_starts, on = ['province', 'year', 'month', 'REF_DATE'])

merged.head()
                    province   REF_DATE  ...  decade  Housing starts
0  Newfoundland and Labrador 1976-01-01  ...    1970            40.0
1                Nova Scotia 1976-01-01  ...    1970           273.0
2              New Brunswick 1976-01-01  ...    1970            99.0
3                     Quebec 1976-01-01  ...    1970          3638.0
4                    Ontario 1976-01-01  ...    1970          3264.0

[5 rows x 7 columns]
# We will keep the final strategy above, but also mutate in the
# Housing.estimates  and Type.of.unit
# select fewer variables, to clean it up prior to merging datasets
housing.starts = housing |> separate(GEO, into  = c("city","province"), sep = ",\\s") |> 
  mutate(province = str_replace_all(province, pattern = " part", replacement = ""))|>
  mutate(province = str_replace_all(province, pattern = "Ontario/Quebec", replacement = "NA"))|>
  rename(housing.starts = VALUE)|> # so I'm not confused later
  filter(Housing.estimates == "Housing starts")|>
  filter(Type.of.unit == "Total units") |>
  filter(!is.na(province))|> # get rid of census metropolitan areas
  mutate(date = parse_date_time(REF_DATE, orders = "Y-m"))|> # convert from character date to date format.
  separate(REF_DATE, into = c("year","month"),sep = "-")|>#split year-month into year  month columns
  mutate(year = as.numeric(year))|>
  mutate(month = as.numeric(month))|>
  group_by(province, year, month, date) |> # look within provinces at a given time
  summarize(total.housing = sum(housing.starts)) |>
  ungroup() 
Warning: Expected 2 pieces. Additional pieces discarded in 18930 rows [211, 212, 213,
214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229,
230, ...].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 9465 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 451, 452, 453, 454, 455, ...].
`summarise()` has grouped output by 'province', 'year', 'month'. You can
override using the `.groups` argument.
housing.starts |> glimpse()
Rows: 6,310
Columns: 5
$ province      <chr> "Alberta", "Alberta", "Alberta", "Alberta", "Alberta", "…
$ year          <dbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 19…
$ month         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,…
$ date          <dttm> 1972-01-01, 1972-02-01, 1972-03-01, 1972-04-01, 1972-05…
$ total.housing <dbl> 597, 1104, 1372, 1639, 2324, 2414, 1335, 1481, 1260, 103…

By default inner_join will merge two datasets based on their common column names. Only rows which appear in both datasets will be kept.

both =  unemployment |>
  select(province, year, month, decade,pandemic, unemployment.rate) |> 
  inner_join(housing.starts)
Joining with `by = join_by(province, year, month)`
both |> glimpse()
Rows: 5,247
Columns: 8
$ province          <chr> "Newfoundland and Labrador", "Nova Scotia", "New Bru…
$ year              <dbl> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976…
$ month             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ decade            <fct> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ pandemic          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ unemployment.rate <dbl> 11.4, 6.4, 10.6, 6.4, 5.9, 4.1, 3.2, 3.4, 8.0, 12.3,…
$ date              <dttm> 1976-01-01, 1976-01-01, 1976-01-01, 1976-01-01, 197…
$ total.housing     <dbl> 40, 273, 99, 3638, 3264, 333, 219, 1239, 1416, 51, 6…

Comparing Unemployment and Housing Starts

Note that there are Prince Edward Island and the territories not represented within the housing starts dataset and therefore are not in the following plots.

In general as unemployment increases, housing starts decrease.

x = list(merged["province"].unique())
y = y = [1,2,3,4,5,6,7,8,9]

for province, i in zip(x, y):
    plt.subplot(3,3,i)
    plt.tight_layout()

    sb.regplot(data = merged[merged["province"] == province],
                x = "unemployment rate",
                y = "Housing starts",
                scatter_kws={'s':1})
     
    plt.title(province, fontsize = 8)
    plt.legend('', frameon = False)

    if province == "Quebec":
        plt.ylabel("Total housing")
    else:
        plt.ylabel("") 

plt.subplots_adjust(top = 0.9)
plt.suptitle('Regression plots of Housing vs. Unemployment Rate', fontsize=13)
plt.show();

both |> ggplot(aes(x=unemployment.rate, y =total.housing))+
  geom_point( aes(colour =  province, group = province),alpha = .1)+#alpha is the transparancy of the points
  ggpubr::stat_regline_equation()+
  geom_smooth(method = "lm")+
  facet_wrap(~province)
`geom_smooth()` using formula = 'y ~ x'

Going Further

  • Make residual plots for the linear regessions
  • Consider other variables/ relationship, such as CPI from table 18-10-0004-01 Consumer Price Index, monthly, not seasonally adjusted
  • Consider looking for a time varying trend, subsetting the years, or even multiple regression with a year effect.