class: title-slide title-sqlite center middle # Advanced R for Econometricians ## Databases using R --- ## Memory Limitations of R - The amount of data you can work with in R is mainly constraint by your computers memory. - In times where data sets with several 100 GB or even TB become more and more common, loading all data into memory is usually not feasable. - Often only a subset or an aggregation of the data is required for the analysis which requires much less memory space. - How to subset or aggregate data? → Use a database! --- ## (Relational) Databases - A database is a place where data is stored and can be manipulated. - A relational database aims to reduce redundancies by spreading the data over multiple tables and relate them to each other. #### Example of Redundent Data .font90[ <table> <thead> <tr> <th style="text-align:right;"> store_id </th> <th style="text-align:left;"> store_location </th> <th style="text-align:left;"> store_mngr </th> <th style="text-align:right;"> cust_id </th> <th style="text-align:left;"> cust_address </th> <th style="text-align:left;"> cust_name </th> <th style="text-align:left;"> date </th> <th style="text-align:right;"> total </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Baker St. </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 34 </td> <td style="text-align:left;"> Downing Rd. </td> <td style="text-align:left;"> Kelly </td> <td style="text-align:left;"> 2019-02-12 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Baker St. </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 34 </td> <td style="text-align:left;"> Downing Rd. </td> <td style="text-align:left;"> Kelly </td> <td style="text-align:left;"> 2019-02-06 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Baker St. </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 108 </td> <td style="text-align:left;"> Berwick St. </td> <td style="text-align:left;"> Aleta </td> <td style="text-align:left;"> 2019-02-21 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Baker St. </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> Carnaby St. </td> <td style="text-align:left;"> Peter </td> <td style="text-align:left;"> 2019-01-14 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Abbey Road </td> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 108 </td> <td style="text-align:left;"> Berwick St. </td> <td style="text-align:left;"> Aleta </td> <td style="text-align:left;"> 2019-02-23 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Abbey Road </td> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> Carnaby St. </td> <td style="text-align:left;"> Peter </td> <td style="text-align:left;"> 2019-01-25 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Abbey Road </td> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> Carnaby St. </td> <td style="text-align:left;"> Peter </td> <td style="text-align:left;"> 2019-01-26 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Abbey Road </td> <td style="text-align:left;"> George </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> Carnaby St. </td> <td style="text-align:left;"> Peter </td> <td style="text-align:left;"> 2019-01-27 </td> <td style="text-align:right;"> 17 </td> </tr> </tbody> </table> ] --- #### Example of Non-Redundent Data - The same information would be stored in a relational database like this: .pull-left[ <table> <caption>Sell Records</caption> <thead> <tr> <th style="text-align:right;"> store_id </th> <th style="text-align:right;"> cust_id </th> <th style="text-align:left;"> date </th> <th style="text-align:right;"> total </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:left;"> 2019-02-12 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:left;"> 2019-02-06 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 108 </td> <td style="text-align:left;"> 2019-02-21 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> 2019-01-14 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 108 </td> <td style="text-align:left;"> 2019-02-23 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> 2019-01-25 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> 2019-01-26 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> 2019-01-27 </td> <td style="text-align:right;"> 17 </td> </tr> </tbody> </table> ] .pull-right[ <table> <caption>Store</caption> <thead> <tr> <th style="text-align:right;"> store_id </th> <th style="text-align:left;"> store_location </th> <th style="text-align:left;"> store_mngr </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Baker St. </td> <td style="text-align:left;"> John </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Abbey Road </td> <td style="text-align:left;"> George </td> </tr> </tbody> </table> <br> <table> <caption>Customer</caption> <thead> <tr> <th style="text-align:left;"> cust_id </th> <th style="text-align:left;"> cust_names </th> <th style="text-align:left;"> cust_address </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 17 </td> <td style="text-align:left;"> Peter </td> <td style="text-align:left;"> Carnaby St. </td> </tr> <tr> <td style="text-align:left;"> 34 </td> <td style="text-align:left;"> Kelly </td> <td style="text-align:left;"> Downing Rd. </td> </tr> <tr> <td style="text-align:left;"> 108 </td> <td style="text-align:left;"> Aleta </td> <td style="text-align:left;"> Berwick St. </td> </tr> </tbody> </table> ] <br> - The relationship between the tables is defined by the `id` columns (called keys). --- ## R and Databases - `dplyr` also works with remote on-disk data stored in relational databases. - For this to work you need ```r install.packages("dbplyr") ``` - You’ll also need to install a `DBI` backend package. The `DBI` package provides a common interface that allows `dplyr` to work with many different databases using the same code. - Some commonly used backends are: - `RMariaDB` connects to MySQL and MariaDB. - `RPostgreSQL` connects to Postgres and Redshift. - `RSQLite` embeds a SQLite database. - `bigrquery` connects to Google’s BigQuery. --- ## Connect to a Database - To connect to a database authentication information such as username and password are required (the database admin should provide those). - You also need to specify a driver that makes communication with the database possible. - There is a remote MySQL database hosted on the web which we have access to: ```r library(dplyr) library(dplyr) con <- DBI::dbConnect( drv = RMariaDB::MariaDB(), # MariaDB driver works for MySQL as well username = "sql7312029", dbname = "sql7312029", host = "sql7.freesqldatabase.com", password = rstudioapi::askForPassword("Database password"), port = "3306" ) ``` --- ## Access Data - When the connection is established we can use the object `con` to communicate with the database. - The function `DBI::dbListTables()` tells us which tables are in the database. ```r DBI::dbListTables(con) ``` ``` ## [1] "country_codes" "gapminder" ``` - With `dplyr::tbl()` we can create an object which is a reference to a table in the database. Most `dplyr` functions work on this object as it was as a data frame (or `tibble`). ```r gap_db <- tbl(con, "gapminder") class(gap_db) ``` ``` ## [1] "tbl_MariaDBConnection" "tbl_dbi" "tbl_sql" ## [4] "tbl_lazy" "tbl" ``` --- ## Use Standard `dplyr` Functions on the Database ```r gap_avg_db <- gap_db %>% group_by(continent) %>% select_if(is.numeric) %>% select(-year) %>% summarise_all(mean) ``` - Even if it seems as if `gap_avg_db` is a data frame it still is only a set of instructions to be performed by the data base. - `show_query()` tells us which instructions are sent to the database. ```r show_query(gap_avg_db) ``` ``` ## <SQL> ## SELECT `continent`, AVG(`lifeExp`) AS `lifeExp`, AVG(`pop`) AS `pop`, AVG(`gdpPercap`) AS `gdpPercap` ## FROM (SELECT `continent`, `lifeExp`, `pop`, `gdpPercap` ## FROM `gapminder`) `dbplyr_001` ## GROUP BY `continent` ``` --- ## Use Standard `dplyr` functions on the Database When working with databases, `dplyr` tries to be as lazy as possible: - It never pulls data into R unless you explicitly ask for it. - It delays doing any work until the last possible moment: it collects everything you want to do and then sends it to the database in one step. - `collect()` actually pulls the data into R. ```r gap_avg <- gap_avg_db %>% collect() print(gap_avg) ``` ``` ## # A tibble: 5 x 4 ## continent lifeExp pop gdpPercap ## <chr> <dbl> <dbl> <dbl> ## 1 Africa 48.9 9916003. 2194. ## 2 Americas 64.7 24504795. 7136. ## 3 Asia 60.1 77038722. 7902. ## 4 Europe 71.9 17169765. 14469. ## 5 Oceania 74.3 8874672. 18622. ``` --- ## Disconnect - If the connection to the database is no longer needed you should disconnect to free system ressources. ```r DBI::dbDisconnect(con) rm(con) ``` --- ## Set up an SQLite Database - SQLite is very light weighted Database Management System - An SQLite database is contained in a single file. There is no need for a server. - Everything you need is to install the `RSQLite` packages. #### Create an SQLite Database - A new SQLite database is automatically created if you call `DBI::dbConnect()` with argument `dbname = "path_to_database"` where `"path_to_database"` is a path to a non-existing file. - In case `"path_to_database"` is an existing SQLite database it connects to that database. ```r library(dplyr) con <- DBI::dbConnect( drv = RSQLite::SQLite(), dbname = here::here("databases","mtcars.sqlite3"), ) ``` --- ## Write Data to the Database - With ```r DBI::dbWriteTable(con, name = "mtcars_db", value = mtcars) ``` we add the data frame `mtcars` as a new table with name `mtcars_db` to the database. - To add data to an existing table we use the additional argument `append = TRUE`. ```r new_car <- data.frame(mpg = 1, cyl = 16, disp = 800, hp = 210, drat = 5, wt = 6, qsec = 2, vs = 1, am = 0) rownames(new_car) <- "Super Car 500" DBI::dbWriteTable(con, name = "mtcars_db", value = new_car, append = TRUE) DBI::dbDisconnect(con) ``` --- ## Exercises 1. Set up a new SQLite database containing the Boston Crime data. 2. Without retrieving the data to R let the database compute the percentage of incidents involving a shooting.