Hacking for Science

Block 2, Session 1: R Programming 101

Matt Bannert

“Don’t get bored by the process.”

– Chris Bosh (11x NBA Allstar, 2x NBA champion) on The Knowledge Project podcast.

Today’s Goal: See R as a Programming Language

📜 Scripts (Experimental)

📜 Scripts (Experimental)

⬇️

♻️ Functions (Reusable)

📜 Scripts (Experimental)

⬇️

♻️ Functions (Reusable)

⬇️

📦 Packages (Deployable)

R Building Blocks

Vector

a <- 1
b <- c(1,2,6,20,30,40,100,200,300)
d <- 1:10
a
[1] 1
b
[1]   1   2   6  20  30  40 100 200 300
d
 [1]  1  2  3  4  5  6  7  8  9 10
  • scalar: vector of length one.
  • all elements have the same data type -> coercion to smallest common denominator.

Matrix

m <- matrix(b, nrow = 3)
m
     [,1] [,2] [,3]
[1,]    1   20  100
[2,]    2   30  200
[3,]    6   40  300
  • two dimensional.
  • all columns are of the same length.
  • all elements have the same data type.

Data.frames

d <- data.frame(col1 = b[1:3],
                col2 = b[4:6],
                col3 = letters[1:3])
d
  col1 col2 col3
1    1   20    a
2    2   30    b
3    6   40    c
  • two dimensional.
  • all columns are of the same length.
  • elements may have different data types.

Lists

l <- list(element1 = a,
          element2 = m, 
          element3 = d)
l
$element1
[1] 1

$element2
     [,1] [,2] [,3]
[1,]    1   20  100
[2,]    2   30  200
[3,]    6   40  300

$element3
  col1 col2 col3
1    1   20    a
2    2   30    b
3    6   40    c
  • may be nested
  • may contain different data types

Environments

e <- new.env()
e$cow <- "moooooo."
e$duck <- "quack."
e$dog <- "woof."

ls()
[1] "a" "b" "d" "e" "l" "m"
ls(e)
[1] "cow"  "dog"  "duck"
get("dog", envir = e)
[1] "woof."

Environments c’d

Functions

name_of_the_function <- function(parameter_1, parameter_2){
  # function body 
  
  s <- parameter_1 + parameter_2
  
  # R does *need* a return statement
  # it returns the last statement that is not
  # assigned to a variable using the assignment operator `<-` 
  return(s)
  
  # NOTE an R function can only return ONE object (!)
}

Running Code

Script vs. Function – Call vs. Definition

set.seed(123)
d1 <- rnorm(1000)
d2 <- rnorm(1000)

d1_mean <- mean(d1)
d1_sd <- sd(d1)
d1_q <- quantile(d1)
desc_stats_d1 <- 
  list(d1_mean = d1_mean,
       d1_sd = d1_sd,
       d1_q = d1_q)

d2_mean <- mean(d2)
d2_sd <- sd(d2)
d2_q <- quantile(d2)
desc_stats_d2 <- 
  list(d2_mean = d2_mean,
       d2_sd = d2_sd,
       d2_q = d2_q)

Script vs. Function – Call vs. Definition

set.seed(123)
d1 <- rnorm(1000)
d2 <- rnorm(1000)

d1_mean <- mean(d1)
d1_sd <- sd(d1)
d1_q <- quantile(d1)
desc_stats_d1 <- 
  list(d1_mean = d1_mean,
       d1_sd = d1_sd,
       d1_q = d1_q)

d2_mean <- mean(d2)
d2_sd <- sd(d2)
d2_q <- quantile(d2)
desc_stats_d2 <- 
  list(d2_mean = d2_mean,
       d2_sd = d2_sd,
       d2_q = d2_q)
create_basic_desc <- function(distr){
  out <- list(
    mean = mean(distr),
    sd = sd(distr),
    quantiles = quantile(distr)
  )
  out
}

Script vs. Function – Call vs. Definition

set.seed(123)
d1 <- rnorm(1000)
d2 <- rnorm(1000)

d1_mean <- mean(d1)
d1_sd <- sd(d1)
d1_q <- quantile(d1)
desc_stats_d1 <- 
  list(d1_mean = d1_mean,
       d1_sd = d1_sd,
       d1_q = d1_q)

d2_mean <- mean(d2)
d2_sd <- sd(d2)
d2_q <- quantile(d2)
desc_stats_d2 <- 
  list(d2_mean = d2_mean,
       d2_sd = d2_sd,
       d2_q = d2_q)
create_basic_desc <- function(distr){
  out <- list(
    mean = mean(distr),
    sd = sd(distr),
    quantiles = quantile(distr)
  )
  out
}

create_basic_desc(d1)
$mean
[1] 0.01612787

$sd
[1] 0.991695

$quantiles
          0%          25%          50%          75%         100% 
-2.809774679 -0.628324243  0.009209639  0.664601867  3.241039935 

Documentation of Functions

#' Create Basic Descriptive Statistics
#'
#' Creates means, standard deviations and
#' default quantiles from an numeric input vector. 
#' 
#' @param distr numeric vector 
#' @export 
create_basic_desc <- function(distr){
  out <- list(
    mean = mean(distr),
    sd = sd(distr),
    quantiles = quantile(distr)
  )
  out
}
  • Roxygen documentation can be rendered .Rd and .html
  • packages require documentation of every exported function

Naming

me: “How to name things?”

me: “How to name things?”

Google: “About 5’260’000’000 results (0.71 seconds)”

Whassssssup, i_am_a_snake!

frequently occurs in: file and folder names, function names in R and Python.

Howdie, JeSuisUnChameau!

frequently occurs in: class names (mostly UpperCamelCase).

Hello, i-am-a-kebap!

frequently occurs in: folder, file and repository names.

Functional Programming

ds_list <- list(iris = iris,
                mtcars = mtcars)
lapply(ds_list, summary)

Anonymous Functions

lapply(ds_list, function(x){
  sprintf(
    "The dataset contains %d observations.",
    nrow(x))
})
$iris
[1] 150

$mtcars
[1] 32

Anonymous Functions

$iris
[1] "The dataset contains 150 observations."

$mtcars
[1] "The dataset contains 32 observations."

Functional Programming

  • for more elaborate examples, such as function factories, take a read the Functional Programming Chapter of Hadley Wickham’s book.

  • Note how R is dynamically typed.

TryCatch

Without Safety Net

do_numeric_op <- function(a,b){
  out <- (a + b)*a
  message(sprintf(
    "'%s', '%s' were your inputs.",
    a,b)
    )
  out
}


do_numeric_op(1,"a")

“Error in a + b : non-numeric argument to binary operator Calls: […]”

Try Catch

With Safety Net

do_numeric_op <- function(a,b){
  
  tryCatch({
    return((a + b)*a)
  
  }, error = function(e) message(
    "Operation went wrong, but function is alive."
    )
  )
  sprintf("'%s', '%s' were your inputs.", a,b)
}

Tips

  • Document Before You Start, e.g., write a vignette outline
  • use pseudo code or comments to structure
  • do not comment on WHAT you do but WHY you do it
  • Peer programming in the beginning (make sure to switch who’s on the driver’s seat)
  • Reproducible Research (include data)

Hands on: The AKQJT task

Packaging

Folder Structure

  • R function definitions
  • man references (functions and datasets)
  • docs Articles for GitHub Pages
  • vignettes Articles’ source, elaborate documentation source
  • inst additional material, installed with package, e.g., sql files
  • src source code that needs to be compiled, e.g., C++
  • tests automated tests (basic idea in DevOps Carpentry), often created with a testing framework such as testthat or tinytest

Further Conventions & Ideas

  • In addition every package needs DESCRIPTION and NAMESPACE file.
  • start with an sandbox.R file in the root folder (may be moved to inst/ later on.).
  • move functions to the R/ folder once they get more mature (documentation, tested)