Advanced R for Econometricians

Advanced R for EconometriciansDatabases using R1 / 13

Memory Limitations of RThe amount of data you can work with in R is mainly constraint by your computers memory. 
In times where data sets with several 100 GB or even TB become more and more common,
loading all data into memory is usually not feasable. 
Often only a subset or an aggregation of the data is required for the analysis which requires much less memory space.
How to subset or aggregate data? → Use a database!
2 / 13

(Relational) DatabasesA database is a place where data is stored and can be manipulated.
A relational database aims to reduce redundancies by spreading the data over multiple tables and
relate them to each other. 
Example of Redundent Data
 
    store_id 
    store_location 
    store_mngr 
    cust_id 
    cust_address 
    cust_name 
    date 
    total 
  


    1 
    Baker St. 
    John 
    34 
    Downing Rd. 
    Kelly 
    2019-02-12 
    14 
  

    1 
    Baker St. 
    John 
    34 
    Downing Rd. 
    Kelly 
    2019-02-06 
    12 
  

    1 
    Baker St. 
    John 
    108 
    Berwick St. 
    Aleta 
    2019-02-21 
    20 
  

    1 
    Baker St. 
    John 
    17 
    Carnaby  St. 
    Peter 
    2019-01-14 
    18 
  

    2 
    Abbey Road 
    George 
    108 
    Berwick St. 
    Aleta 
    2019-02-23 
    18 
  

    2 
    Abbey Road 
    George 
    17 
    Carnaby  St. 
    Peter 
    2019-01-25 
    18 
  

    2 
    Abbey Road 
    George 
    17 
    Carnaby  St. 
    Peter 
    2019-01-26 
    12 
  

    2 
    Abbey Road 
    George 
    17 
    Carnaby  St. 
    Peter 
    2019-01-27 
    17 
  


3 / 13

store_id	store_location	store_mngr	cust_id	cust_address	cust_name	date	total
1	Baker St.	John	34	Downing Rd.	Kelly	2019-02-12	14
1	Baker St.	John	34	Downing Rd.	Kelly	2019-02-06	12
1	Baker St.	John	108	Berwick St.	Aleta	2019-02-21	20
1	Baker St.	John	17	Carnaby St.	Peter	2019-01-14	18
2	Abbey Road	George	108	Berwick St.	Aleta	2019-02-23	18
2	Abbey Road	George	17	Carnaby St.	Peter	2019-01-25	18
2	Abbey Road	George	17	Carnaby St.	Peter	2019-01-26	12
2	Abbey Road	George	17	Carnaby St.	Peter	2019-01-27	17

Example of Non-Redundent Data

The same information would be stored in a relational database like this:

Sell Records
store_id	cust_id	date	total
1	34	2019-02-12	14
1	34	2019-02-06	12
1	108	2019-02-21	20
1	17	2019-01-14	18
2	108	2019-02-23	18
2	17	2019-01-25	18
2	17	2019-01-26	12
2	17	2019-01-27	17

Store
store_id	store_location	store_mngr
1	Baker St.	John
2	Abbey Road	George

Customer
cust_id	cust_names	cust_address
17	Peter	Carnaby St.
34	Kelly	Downing Rd.
108	Aleta	Berwick St.

The relationship between the tables is defined by the id columns (called keys).

4 / 13

R and Databases

dplyr also works with remote on-disk data stored in relational databases.
For this to work you need
```
install.packages("dbplyr")
```
You’ll also need to install a DBI backend package. The DBI package provides a common interface that allows dplyr to work with many different databases using the same code.
Some commonly used backends are:
- RMariaDB connects to MySQL and MariaDB.
- RPostgreSQL connects to Postgres and Redshift.
- RSQLite embeds a SQLite database.
- bigrquery connects to Google’s BigQuery.

5 / 13

Connect to a Database

To connect to a database authentication information such as username and password are required (the database admin should provide those).
You also need to specify a driver that makes communication with the database possible.

There is a remote MySQL database hosted on the web which we have access to:

library(dplyr)
library(dplyr)
con <- DBI::dbConnect(
      drv = RMariaDB::MariaDB(), # MariaDB driver works for MySQL as well
      username = "sql7312029",
      dbname   = "sql7312029", 
      host     = "sql7.freesqldatabase.com",
      password = rstudioapi::askForPassword("Database password"),
      port     = "3306"
      )

6 / 13

Access Data

When the connection is established we can use the object con to communicate with the database.
The function DBI::dbListTables() tells us which tables are in the database.
```
DBI::dbListTables(con)
```
```
## [1] "country_codes" "gapminder"
```

With dplyr::tbl() we can create an object which is a reference to a table in the database. Most dplyr functions work on this object as it was as a data frame (or tibble).

gap_db <- tbl(con, "gapminder")
class(gap_db)

## [1] "tbl_MariaDBConnection" "tbl_dbi"               "tbl_sql"              
## [4] "tbl_lazy"              "tbl"

7 / 13

Use Standard `dplyr` Functions on the Database

gap_avg_db <- gap_db %>% 
                group_by(continent) %>% 
                select_if(is.numeric) %>% 
                select(-year) %>%
                summarise_all(mean)

Even if it seems as if gap_avg_db is a data frame it still is only a set of instructions to be performed by the data base.

show_query() tells us which instructions are sent to the database.

show_query(gap_avg_db)

## <SQL>
## SELECT `continent`, AVG(`lifeExp`) AS `lifeExp`, AVG(`pop`) AS `pop`, AVG(`gdpPercap`) AS `gdpPercap`
## FROM (SELECT `continent`, `lifeExp`, `pop`, `gdpPercap`
## FROM `gapminder`) `dbplyr_001`
## GROUP BY `continent`

8 / 13

Use Standard `dplyr` functions on the Database

When working with databases, dplyr tries to be as lazy as possible:

It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects everything you want to do and then sends it to the database in one step.

collect() actually pulls the data into R.

gap_avg <- gap_avg_db %>% collect()
print(gap_avg)

## # A tibble: 5 x 4
##   continent lifeExp       pop gdpPercap
##   <chr>       <dbl>     <dbl>     <dbl>
## 1 Africa       48.9  9916003.     2194.
## 2 Americas     64.7 24504795.     7136.
## 3 Asia         60.1 77038722.     7902.
## 4 Europe       71.9 17169765.    14469.
## 5 Oceania      74.3  8874672.    18622.

9 / 13

Disconnect

If the connection to the database is no longer needed you should disconnect to free system ressources.
```
DBI::dbDisconnect(con)
rm(con)
```

10 / 13

Set up an SQLite Database

SQLite is very light weighted Database Management System
An SQLite database is contained in a single file. There is no need for a server.
Everything you need is to install the RSQLite packages.

Create an SQLite Database

A new SQLite database is automatically created if you call DBI::dbConnect() with argument dbname = "path_to_database" where "path_to_database" is a path to a non-existing file.

In case "path_to_database" is an existing SQLite database it connects to that database.

library(dplyr)
con <- DBI::dbConnect(
           drv = RSQLite::SQLite(),
           dbname = here::here("databases","mtcars.sqlite3"), 
           )

11 / 13

Write Data to the Database

With

DBI::dbWriteTable(con, 
              name = "mtcars_db", 
              value = mtcars)

we add the data frame mtcars as a new table with name mtcars_db to the database.

To add data to an existing table we use the additional argument append = TRUE.

new_car <- data.frame(mpg = 1, cyl = 16, disp = 800, hp = 210, 
                  drat = 5, wt = 6, qsec = 2, vs = 1, am = 0)
rownames(new_car) <- "Super Car 500"
DBI::dbWriteTable(con, 
              name = "mtcars_db", 
              value = new_car, 
              append = TRUE)
DBI::dbDisconnect(con)

12 / 13

ExercisesSet up a new SQLite database containing the Boston Crime data. 
Without retrieving the data to R let the database compute the percentage of incidents involving a shooting.
13 / 13

Memory Limitations of R

The amount of data you can work with in R is mainly constraint by your computers memory.

In times where data sets with several 100 GB or even TB become more and more common, loading all data into memory is usually not feasable.

Often only a subset or an aggregation of the data is required for the analysis which requires much less memory space.

How to subset or aggregate data? → Use a database!

2 / 13

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Advanced R for Econometricians

Databases using R

Memory Limitations of R

(Relational) Databases

Example of Redundent Data

Example of Non-Redundent Data

R and Databases

Connect to a Database

Access Data

Use Standard dplyr Functions on the Database

Use Standard dplyr functions on the Database

Disconnect

Set up an SQLite Database

Create an SQLite Database

Write Data to the Database

Exercises

Memory Limitations of R

Help

Use Standard `dplyr` Functions on the Database

Use Standard `dplyr` functions on the Database