+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced R for Econometricians

Databases using R

1 / 13

Memory Limitations of R

  • The amount of data you can work with in R is mainly constraint by your computers memory.
  • In times where data sets with several 100 GB or even TB become more and more common, loading all data into memory is usually not feasable.
  • Often only a subset or an aggregation of the data is required for the analysis which requires much less memory space.
  • How to subset or aggregate data? → Use a database!
2 / 13

(Relational) Databases

  • A database is a place where data is stored and can be manipulated.
  • A relational database aims to reduce redundancies by spreading the data over multiple tables and relate them to each other.

Example of Redundent Data

store_id store_location store_mngr cust_id cust_address cust_name date total
1 Baker St. John 34 Downing Rd. Kelly 2019-02-12 14
1 Baker St. John 34 Downing Rd. Kelly 2019-02-06 12
1 Baker St. John 108 Berwick St. Aleta 2019-02-21 20
1 Baker St. John 17 Carnaby St. Peter 2019-01-14 18
2 Abbey Road George 108 Berwick St. Aleta 2019-02-23 18
2 Abbey Road George 17 Carnaby St. Peter 2019-01-25 18
2 Abbey Road George 17 Carnaby St. Peter 2019-01-26 12
2 Abbey Road George 17 Carnaby St. Peter 2019-01-27 17
3 / 13

Example of Non-Redundent Data

  • The same information would be stored in a relational database like this:
Sell Records
store_id cust_id date total
1 34 2019-02-12 14
1 34 2019-02-06 12
1 108 2019-02-21 20
1 17 2019-01-14 18
2 108 2019-02-23 18
2 17 2019-01-25 18
2 17 2019-01-26 12
2 17 2019-01-27 17
Store
store_id store_location store_mngr
1 Baker St. John
2 Abbey Road George

Customer
cust_id cust_names cust_address
17 Peter Carnaby St.
34 Kelly Downing Rd.
108 Aleta Berwick St.


  • The relationship between the tables is defined by the id columns (called keys).
4 / 13

R and Databases

  • dplyr also works with remote on-disk data stored in relational databases.
  • For this to work you need

    install.packages("dbplyr")
  • You’ll also need to install a DBI backend package. The DBI package provides a common interface that allows dplyr to work with many different databases using the same code.

  • Some commonly used backends are:

    • RMariaDB connects to MySQL and MariaDB.
    • RPostgreSQL connects to Postgres and Redshift.
    • RSQLite embeds a SQLite database.
    • bigrquery connects to Google’s BigQuery.
5 / 13

Connect to a Database

  • To connect to a database authentication information such as username and password are required (the database admin should provide those).
  • You also need to specify a driver that makes communication with the database possible.
  • There is a remote MySQL database hosted on the web which we have access to:

    library(dplyr)
    library(dplyr)
    con <- DBI::dbConnect(
    drv = RMariaDB::MariaDB(), # MariaDB driver works for MySQL as well
    username = "sql7312029",
    dbname = "sql7312029",
    host = "sql7.freesqldatabase.com",
    password = rstudioapi::askForPassword("Database password"),
    port = "3306"
    )
6 / 13

Access Data

  • When the connection is established we can use the object con to communicate with the database.
  • The function DBI::dbListTables() tells us which tables are in the database.

    DBI::dbListTables(con)
    ## [1] "country_codes" "gapminder"
  • With dplyr::tbl() we can create an object which is a reference to a table in the database. Most dplyr functions work on this object as it was as a data frame (or tibble).

    gap_db <- tbl(con, "gapminder")
    class(gap_db)
    ## [1] "tbl_MariaDBConnection" "tbl_dbi" "tbl_sql"
    ## [4] "tbl_lazy" "tbl"
7 / 13

Use Standard dplyr Functions on the Database

gap_avg_db <- gap_db %>%
group_by(continent) %>%
select_if(is.numeric) %>%
select(-year) %>%
summarise_all(mean)
  • Even if it seems as if gap_avg_db is a data frame it still is only a set of instructions to be performed by the data base.
  • show_query() tells us which instructions are sent to the database.

    show_query(gap_avg_db)
    ## <SQL>
    ## SELECT `continent`, AVG(`lifeExp`) AS `lifeExp`, AVG(`pop`) AS `pop`, AVG(`gdpPercap`) AS `gdpPercap`
    ## FROM (SELECT `continent`, `lifeExp`, `pop`, `gdpPercap`
    ## FROM `gapminder`) `dbplyr_001`
    ## GROUP BY `continent`
8 / 13

Use Standard dplyr functions on the Database

When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment: it collects everything you want to do and then sends it to the database in one step.
  • collect() actually pulls the data into R.

    gap_avg <- gap_avg_db %>% collect()
    print(gap_avg)
    ## # A tibble: 5 x 4
    ## continent lifeExp pop gdpPercap
    ## <chr> <dbl> <dbl> <dbl>
    ## 1 Africa 48.9 9916003. 2194.
    ## 2 Americas 64.7 24504795. 7136.
    ## 3 Asia 60.1 77038722. 7902.
    ## 4 Europe 71.9 17169765. 14469.
    ## 5 Oceania 74.3 8874672. 18622.
9 / 13

Disconnect

  • If the connection to the database is no longer needed you should disconnect to free system ressources.

    DBI::dbDisconnect(con)
    rm(con)
10 / 13

Set up an SQLite Database

  • SQLite is very light weighted Database Management System
  • An SQLite database is contained in a single file. There is no need for a server.
  • Everything you need is to install the RSQLite packages.

Create an SQLite Database

  • A new SQLite database is automatically created if you call DBI::dbConnect() with argument dbname = "path_to_database" where "path_to_database" is a path to a non-existing file.
  • In case "path_to_database" is an existing SQLite database it connects to that database.

    library(dplyr)
    con <- DBI::dbConnect(
    drv = RSQLite::SQLite(),
    dbname = here::here("databases","mtcars.sqlite3"),
    )
11 / 13

Write Data to the Database

  • With

    DBI::dbWriteTable(con,
    name = "mtcars_db",
    value = mtcars)

    we add the data frame mtcars as a new table with name mtcars_db to the database.

  • To add data to an existing table we use the additional argument append = TRUE.

    new_car <- data.frame(mpg = 1, cyl = 16, disp = 800, hp = 210,
    drat = 5, wt = 6, qsec = 2, vs = 1, am = 0)
    rownames(new_car) <- "Super Car 500"
    DBI::dbWriteTable(con,
    name = "mtcars_db",
    value = new_car,
    append = TRUE)
    DBI::dbDisconnect(con)
12 / 13

Exercises

  1. Set up a new SQLite database containing the Boston Crime data.
  2. Without retrieving the data to R let the database compute the percentage of incidents involving a shooting.
13 / 13

Memory Limitations of R

  • The amount of data you can work with in R is mainly constraint by your computers memory.
  • In times where data sets with several 100 GB or even TB become more and more common, loading all data into memory is usually not feasable.
  • Often only a subset or an aggregation of the data is required for the analysis which requires much less memory space.
  • How to subset or aggregate data? → Use a database!
2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow