Hacking for Science

Block 2, Session 2: Programming with Data

Matt Bannert

Data Generating Processes in Science

Simulation

Why

  • Analyzing Complex Systems
  • Demos
  • Proposals / Grants
  • reproducible examples, reprex R package

Process

set.seed(123)
rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083
set.seed(1)
rnorm(3)
[1] -0.6264538  0.1836433 -0.8356286
set.seed(123)
rnorm(3)
[1] -0.5604756 -0.2301775  1.5587083

Logging / Tracking

Process

  • webservers
  • mobile phones
  • IoT devices
  • tracking tools, e.g., Google Analytics

Form of Resulting Data

  • text files
  • granular
  • event based
  • biased (tracking)

Surveys

Types

  • multi mode surveys (paper / online forms)
  • Recordings

Form of Resulting Data

  • rectangular data (1-line-1-observation)
  • text
  • cross sectional
  • longitudinal data

APIs & Web Scraping

Process

  • Automated, regular updates
  • Transformation of structured data into analysis friendly datasets (regular expressions, DOM extraction)

Form of Resulting Data

  • text strings
  • nested data
  • standardized data

What DGPs Have You Worked With? What DGPs Do You Expect to Face in Your Work?

Representing Data

Data Management: Memory, Files, Databases

In Memory

  • vector
  • matrix
  • data.frame / data.table / tibble
  • list
  • environment

On Disk

  • .RData
  • .parquet
  • feather
  • .xlsx
  • .csv
  • .json
  • .xml

see also: RSE Book on Data Management

In a Database

  • interface
  • query language, e.g., SQL

Types of Data: Time Series

library(kofdata)
library(tstools)

tsl <- window(get_time_series("ch.kof.barometer"))
tsplot("KOF Barometer" = 
         window(tsl$ch.kof.barometer,
                start = c(2010,1))
       )

in memory

  • ts
  • xts
  • tsibble
  • zoo

(!) Try out the tsbox R package to easily switch from one representation to another.

on disk

  • .csv (long format, wide format)
  • .xml
  • .json
  • .RData

Types of Data: Rectangular Datasets

Cross Sectional Data

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  • multiple variables
  • one period

Panel Data

  • multiple variables
  • longitudinal
  • e.g., German Socio Economic Panel (GSOEP)

Types of Data: Nested Data

l <- list()
l$element1 <- 2
l$element2 <- head(mtcars[,1:3],4)
l
$element1
[1] 2

$element2
                mpg cyl disp
Mazda RX4      21.0   6  160
Mazda RX4 Wag  21.0   6  160
Datsun 710     22.8   4  108
Hornet 4 Drive 21.4   6  258

examples:

Hands on: Block 2, Task 2 – Trying out different representations

R Three Ecosystems to Manipulate Data

Base R

  • vectors, matrices, data.frames
  • no extensions needed
  • much better than its marketing
  • split, apply, combine approach
by_cyl <- split(mtcars, mtcars$cyl)
out <- lapply(by_cyl,
              function(x) summary(
                lm(mpg~.,data = x)
                )
              )

str(out, max.level = 1)
List of 3
 $ 4:List of 11
  ..- attr(*, "class")= chr "summary.lm"
 $ 6:List of 11
  ..- attr(*, "class")= chr "summary.lm"
 $ 8:List of 11
  ..- attr(*, "class")= chr "summary.lm"

data.table

  • CRAN package
  • written by Matt Dowle et al.
  • fastest ecosystem, including fwrite/fread for fast disk i/o
library(data.table)
dt <- fread("../data/simulated_survey.csv")
head(dt)
   basic advanced structure    id
   <int>    <int>     <int> <int>
1:     2        3         3     1
2:     3        5         3     2
3:     3        3         5     3
4:     4        3         2     4
5:     5        4         3     5
6:     3        4         2     6

data.table c’d

# i, j, by
dt[, obs_avg := rowMeans(.SD), .SDcols = c("basic", "advanced", "structure")]
head(dt)
  • SQL reminiscent concept and syntax
  • works with pointers / reference -> it changes objects in place

tidyverse

  • dplyr, tibble, ….
  • best documentation
  • fast
  • uses references, too
  • very popular for interactive use
  • pipe operator: %>% (base R got a pipe in the meantime, too)
  • pretty printing

tidyverse

library(dplyr)
mtcars %>% 
  filter(cyl > 4) %>% 
  nrow()
[1] 21

Block 2, Task 3: Three R Ecosystems to Manipulate Data

Databases (DBMS)

When to Use a Database ?

Passive Use

  • direct access
  • no API
  • flexible queries needed

Active Use (in the sense of setting it up, running it)

  • as a backend, e.g., survey (transactional database)
  • in research projects when you want to share data inside your lab
  • when you need to restrict access at different levels

Active Use c’d

Which DBMS Should I Use ?

Which DBMS Should I Use ?

  1. relational DBMS should be the default choice, don’t believe me? Believe him. Note, you can use json inside SQL DBs as well.

Which DBMS Should I Use ?

  1. Which relational DBMS should I Use?
  • SQLite is for prototypes and mobile phones
  • MySQL is for kids
  • PostgreSQL, MS SQL and Oracle are at the same level for many science applications

Can You Explain ‘Relational’

Are There Any Disadvantages of Databases?

  • You have to maintain them, even if you don’t need them
  • more setup overhead: dev and production system(s) needed
  • solid, free solutions are only temporal
  • you need an interface for collaborators who don’t speak SQL
  • some really, really, really big datasets may ask for specific methods

Getting Started with Relational Databases (RDBMS)

Hosting

  • localhost (dev only)
  • docker environment
  • university VM
  • homeserver
  • cloud (either VM and install on your own, or DB specific Cloud)

Client

  • install drivers locally at OS level
  • install client library (R packages are usually wrappers around C interfaces)

Design

  • integrity (think primary key, foreign key for starters)
  • Draw up a design… dbdiagram.io is a fun tool, paper & pencil also work pretty well

Start Simple, Catch a Breath of SQL Air with SQLite

library(RSQLite)
db_path = "../data/h4sci.sqlite3"
con <- dbConnect(RSQLite::SQLite(), db_path)
dbWriteTable(con, dbQuoteIdentifier(con,"mtcars"), mtcars, overwrite = T)
dbWriteTable(con, dbQuoteIdentifier(con,"flowers"), iris, overwrite = T)

dbGetQuery(con, "SELECT * FROM flowers LIMIT 3")
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Note: Foreign Key Handling is very limited in SQLite.

Example Query

dbGetQuery(con, "SELECT * FROM mtcars WHERE mpg > 30")
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Database Access

library(RSQLite)
db_path = "../data/h4sci.sqlite3"
con <- dbConnect(RSQLite::SQLite(),
                 db_path)
library(RPostgres)
con <-  dbConnect(drv = Postgres(),
                  dbname = "postgres",
                  user = "postgres",
                  host = "some.server.or.ip",
                  password = .rs.askForPassword(
                    "enter your pw"
                    )
                  )
  • Essentially the same interface (DBI)
  • server, files are abstracted away
  • host, password, port etc. needed for Postgres

Basic SQL Syntax

SELECT * FROM schema.table;

INSERT INTO schema.table VALUES ('abc',2,3);

SELECT name, salary FROM staff
WHERE position = 'manager'
ORDER BY salary DESC;
  • ’*’ SELECTs all data from a given table
  • INSERTs values to a table while checking integrity (data types)
  • SELECTs name and salary from table staff for all managers and orders the output by salary from highest to lowest

Further Reads

Hands on: Block 2, Task 4: A Little SQL

How to Work with an Application Programming Interface (API)

  1. RTFM

  2. Experiment ‘per pedes’

  3. Write a Wrapper if it does not exist

  4. Use the Wrapper

Here is an R example to get ☔️ images from the MET collection.