Motivation

I don’t mind for loops in R—they’re easy to write, easy to read, and can be very fast (e.g. if the output vector or matrix is pre-allocated). But they aren’t very “elegant” (no pipes?!?!) and they require a lot of keystrokes. So I recently decided to look for a go-to alternative.

The apply family is great for replacing for loops, but you often end up with a bunch of lapply’s, sapply’s, rbind’s, and so on, which gets ugly pretty fast.

The pmap function in purrr (Henry and Wickham 2017) is almost exactly what I was looking for. You specify a function and a “list of lists” of parameter values to loop over, and it does it. But I wanted slightly different functionality, namely:

  1. Loop over all combinations of input values (as opposed to 1st set, 2nd set, …)
  2. Run multiple trials for each combination
  3. Avoid defining lists in the function call.

So I wrote a function called iterate and added it to my dvmisc package (Van Domelen 2018). You can install dvmisc from GitHub by loading devtools and running install_github("vandomed/dvmisc").

Iteration

Suppose you have a function called f.reg that fits a (possibly log-transformed) linear regression relating user-specified \(Y\) and \(X\) variables in a particular dataset (default mtcars):

f.reg <- function(data = mtcars, y, x, log = FALSE) {
  if (log) {
    formula <- paste("log(", y, ") ~ ", x, sep = "")
  } else {
    formula <- paste(y, " ~ ", x, sep = "")
  }
  summary.fit <- summary(lm(formula, data = data))
  return(list(r2 = summary.fit$r.squared, p = summary.fit$coef[2, 4]))
}

And suppose you want to regress gas mileage (mpg) on 3 separate predictors (wt = weight, hp = horsepower, qsec = 1/4 mile time), with and without log-transforming mpg in each case. To fit these 6 models via iterate, you can run:

library("dvmisc")
iterate(f.reg, y = "mpg", x = c("wt", "hp", "qsec"), log = c(TRUE, FALSE))
##      y    x   log     r2         p
## 1: mpg   wt  TRUE 0.7976  6.31e-12
## 2: mpg   wt FALSE 0.7528 1.294e-10
## 3: mpg   hp  TRUE 0.6233 7.853e-08
## 4: mpg   hp FALSE 0.6024 1.788e-07
## 5: mpg qsec  TRUE 0.1728   0.01796
## 6: mpg qsec FALSE 0.1753   0.01708

Or with pipes:

f.reg %>% iterate(y = "mpg", x = c("wt", "hp", "qsec"), log = c(TRUE, FALSE))
##      y    x   log     r2         p
## 1: mpg   wt  TRUE 0.7976  6.31e-12
## 2: mpg   wt FALSE 0.7528 1.294e-10
## 3: mpg   hp  TRUE 0.6233 7.853e-08
## 4: mpg   hp FALSE 0.6024 1.788e-07
## 5: mpg qsec  TRUE 0.1728   0.01796
## 6: mpg qsec FALSE 0.1753   0.01708

You could get the same thing with pmap, but it takes a little more work, and this particular solution doesn’t make it obvious what each set of inputs is:

library("purrr")
do.call(rbind, pmap(.l = list(y = rep("mpg", 6), 
                              x = rep(c("wt", "hp", "qsec"), each = 2), 
                              log = rep(c(TRUE, FALSE), 3)), 
                    .f = f.reg))
##      r2     p        
## [1,] 0.7976 6.31e-12 
## [2,] 0.7528 1.294e-10
## [3,] 0.6233 7.853e-08
## [4,] 0.6024 1.788e-07
## [5,] 0.1728 0.01796  
## [6,] 0.1753 0.01708

You can also pass non-scalar inputs that you want iterate to treat as fixed through the fix argument. For example, to fit Sepal.Length vs. Sepal.Width and Petal.Length in the iris dataset:

f.reg %>% iterate(y = "Sepal.Length", x = c("Sepal.Width", "Petal.Length"), 
                  fix = list(data = iris))
##               y            x      r2         p
## 1: Sepal.Length  Sepal.Width 0.01382    0.1519
## 2: Sepal.Length Petal.Length    0.76 1.039e-47

Simulation

For simulations, you typically want to run multiple trials of several different scenarios. You can do that with iterate pretty easily. Suppose you want to calculate empirical power of the one-sample t-test in various scenarios. Your “one-trial” function might be something like:

f.power <- function(n = 50, mu = 0, sigsq = 1, mu0 = 0, alpha = 0.05) {
  x <- rnorm(n = n, mean = mu, sd = sqrt(sigsq))
  fit <- t.test(x, mu = mu0)
  return(list(t = fit$statistic, p = fit$p.value))
}

To run 2 trials for three different true means and two different sample sizes:

f.power %>% iterate(mu = c(0.1, 0.25, 0.5), n = c(25, 50),  trials = 2)
##       mu  n      t         p
##  1: 0.10 25 -1.025    0.3158
##  2: 0.10 25 0.5205    0.6075
##  3: 0.10 50  1.314    0.1951
##  4: 0.10 50 0.7288    0.4696
##  5: 0.25 25 0.6227    0.5394
##  6: 0.25 25  3.049  0.005525
##  7: 0.25 50  1.716   0.09254
##  8: 0.25 50  1.427    0.1601
##  9: 0.50 25 0.3716    0.7134
## 10: 0.50 25  2.414   0.02378
## 11: 0.50 50   3.22   0.00228
## 12: 0.50 50  4.774 1.673e-05

Ramping it up to 1,000 trials each, and actually calculate the empirical power afterwards:

f.power %>% iterate(mu = c(0.1, 0.25, 0.5), n = c(25, 50), trials = 1000) %>%
  dplyr::group_by(mu, n) %>% 
  dplyr::summarise(empirical.power = mean(p < 0.05))
## # A tibble: 6 x 3
## # Groups:   mu [?]
##      mu     n empirical.power
##   <dbl> <dbl>           <dbl>
## 1 0.100  25.0          0.0700
## 2 0.100  50.0          0.111 
## 3 0.250  25.0          0.225 
## 4 0.250  50.0          0.390 
## 5 0.500  25.0          0.660 
## 6 0.500  50.0          0.929

Miraculously, we find that power increases with sample size and effect size. Biometrics here we come!

Referation (that was a stretch)

Henry, Lionel, and Hadley Wickham. 2017. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Van Domelen, Dane R. 2018. Dvmisc: Convenience Functions, Moving Window Statistics, and Graphics.