Hacking for Science

Block 2, Session 2: Programming with Data

Matt Bannert

Data Generating Processes in Science

Simulation

Why

Analyzing Complex Systems
Demos
Proposals / Grants
reproducible examples, reprex R package

Process

set.seed(123)
rnorm(3)

[1] -0.5604756 -0.2301775  1.5587083

set.seed(1)
rnorm(3)

[1] -0.6264538  0.1836433 -0.8356286

set.seed(123)
rnorm(3)

[1] -0.5604756 -0.2301775  1.5587083

Logging / Tracking

Process

webservers
mobile phones
IoT devices
tracking tools, e.g., Google Analytics

Form of Resulting Data

text files
granular
event based
biased (tracking)

Surveys

Types

multi mode surveys (paper / online forms)
Recordings

Form of Resulting Data

rectangular data (1-line-1-observation)
text
cross sectional
longitudinal data

APIs & Web Scraping

Process

Automated, regular updates
Transformation of structured data into analysis friendly datasets (regular expressions, DOM extraction)

Form of Resulting Data

text strings
nested data
standardized data

What DGPs Have You Worked With? What DGPs Do You Expect to Face in Your Work?

Representing Data

Data Management: Memory, Files, Databases

In Memory

vector
matrix
data.frame / data.table / tibble
list
environment

On Disk

.RData
.parquet
feather
.xlsx
.csv
.json
.xml

In a Database

interface
query language, e.g., SQL

Types of Data: Time Series

library(kofdata)
library(tstools)

tsl <- window(get_time_series("ch.kof.barometer"))
tsplot("KOF Barometer" = 
         window(tsl$ch.kof.barometer,
                start = c(2010,1))
       )

in memory

ts
xts
tsibble
zoo

(!) Try out the tsbox R package to easily switch from one representation to another.

on disk

.csv (long format, wide format)
.xml
.json
.RData

Types of Data: Rectangular Datasets

Cross Sectional Data

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

multiple variables
one period

Panel Data

multiple variables
longitudinal
e.g., German Socio Economic Panel (GSOEP)

Types of Data: Nested Data

l <- list()
l$element1 <- 2
l$element2 <- head(mtcars[,1:3],4)
l

$element1
[1] 2

$element2
                mpg cyl disp
Mazda RX4      21.0   6  160
Mazda RX4 Wag  21.0   6  160
Datsun 710     22.8   4  108
Hornet 4 Drive 21.4   6  258

examples:

meta information, sector classification (hierarchical), GDP components, translations, attributes, properties,
Geo spatial data

Hands on: Block 2, Task 2 – Trying out different representations

R Three Ecosystems to Manipulate Data

Base R

vectors, matrices, data.frames
no extensions needed
much better than its marketing
split, apply, combine approach

by_cyl <- split(mtcars, mtcars$cyl)
out <- lapply(by_cyl,
              function(x) summary(
                lm(mpg~.,data = x)
                )
              )

str(out, max.level = 1)

List of 3
 $ 4:List of 11
  ..- attr(*, "class")= chr "summary.lm"
 $ 6:List of 11
  ..- attr(*, "class")= chr "summary.lm"
 $ 8:List of 11
  ..- attr(*, "class")= chr "summary.lm"

data.table

CRAN package
written by Matt Dowle et al.
fastest ecosystem, including fwrite/fread for fast disk i/o

library(data.table)
dt <- fread("../data/simulated_survey.csv")
head(dt)

   basic advanced structure    id
   <int>    <int>     <int> <int>
1:     2        3         3     1
2:     3        5         3     2
3:     3        3         5     3
4:     4        3         2     4
5:     5        4         3     5
6:     3        4         2     6

data.table c’d

# i, j, by
dt[, obs_avg := rowMeans(.SD), .SDcols = c("basic", "advanced", "structure")]
head(dt)

SQL reminiscent concept and syntax
works with pointers / reference -> it changes objects in place

tidyverse

dplyr, tibble, ….
best documentation
fast
uses references, too
very popular for interactive use
pipe operator: %>% (base R got a pipe in the meantime, too)
pretty printing

tidyverse

library(dplyr)
mtcars %>% 
  filter(cyl > 4) %>% 
  nrow()

[1] 21

Block 2, Task 3: Three R Ecosystems to Manipulate Data

Databases (DBMS)

When to Use a Database ?

Passive Use

direct access
no API
flexible queries needed

Active Use (in the sense of setting it up, running it)

as a backend, e.g., survey (transactional database)
in research projects when you want to share data inside your lab
when you need to restrict access at different levels

Active Use c’d

to expose data through an API (to external collaborators or the public)
need to integrate with different systems / tools and programming languages
specific operations, e.g., spatial data

Which DBMS Should I Use ?

relational DBMS should be the default choice, don’t believe me? Believe him. Note, you can use json inside SQL DBs as well.

Which DBMS Should I Use ?

Which relational DBMS should I Use?

SQLite is for prototypes and mobile phones
MySQL is for kids
PostgreSQL, MS SQL and Oracle are at the same level for many science applications

Can You Explain ‘Relational’

Are There Any Disadvantages of Databases?

You have to maintain them, even if you don’t need them
more setup overhead: dev and production system(s) needed
solid, free solutions are only temporal
you need an interface for collaborators who don’t speak SQL
some really, really, really big datasets may ask for specific methods

Getting Started with Relational Databases (RDBMS)

Hosting

localhost (dev only)
docker environment
university VM
homeserver
cloud (either VM and install on your own, or DB specific Cloud)

Client

install drivers locally at OS level
install client library (R packages are usually wrappers around C interfaces)

Design

integrity (think primary key, foreign key for starters)
Draw up a design… dbdiagram.io is a fun tool, paper & pencil also work pretty well

Start Simple, Catch a Breath of SQL Air with SQLite

library(RSQLite)
db_path = "../data/h4sci.sqlite3"
con <- dbConnect(RSQLite::SQLite(), db_path)
dbWriteTable(con, dbQuoteIdentifier(con,"mtcars"), mtcars, overwrite = T)
dbWriteTable(con, dbQuoteIdentifier(con,"flowers"), iris, overwrite = T)

dbGetQuery(con, "SELECT * FROM flowers LIMIT 3")

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Note: Foreign Key Handling is very limited in SQLite.

Example Query

dbGetQuery(con, "SELECT * FROM mtcars WHERE mpg > 30")

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Database Access

library(RSQLite)
db_path = "../data/h4sci.sqlite3"
con <- dbConnect(RSQLite::SQLite(),
                 db_path)

library(RPostgres)
con <-  dbConnect(drv = Postgres(),
                  dbname = "postgres",
                  user = "postgres",
                  host = "some.server.or.ip",
                  password = .rs.askForPassword(
                    "enter your pw"
                    )
                  )

Essentially the same interface (DBI)
server, files are abstracted away
host, password, port etc. needed for Postgres

Basic SQL Syntax

SELECT * FROM schema.table;

INSERT INTO schema.table VALUES ('abc',2,3);

SELECT name, salary FROM staff
WHERE position = 'manager'
ORDER BY salary DESC;

’*’ SELECTs all data from a given table
INSERTs values to a table while checking integrity (data types)
SELECTs name and salary from table staff for all managers and orders the output by salary from highest to lowest

Further Reads

SQLite Tutorial
PostgreSQL 15 Documentation (Jump in at ‘table basics’)
Vignettes of the {dm} package, in particular the data.frame to dm part

Hands on: Block 2, Task 4: A Little SQL

How to Work with an Application Programming Interface (API)

RTFM
Experiment ‘per pedes’
Write a Wrapper if it does not exist
Use the Wrapper

Here is an R example to get ☔️ images from the MET collection.