4 min read

Loading multiple *.csv (*.xlsx) effectively

This brief post is a “how to” load multiple csv / xlsx files into single data frame. Say, we have a couple of *.xlsx files in one folder (some of them in sub-directories) and we would like to bind them all into one. Quite common task. Let´s do it..

library(tidyverse)
library(readxl)

Step 1

list.files function with argument full.names = TRUE (preserving complete path to all files) and argument recursive = TRUE is handy in case your files are in any level of sub-directories (represented with “next_level” folder in my case).

x <- list.files(path = "../../static/data/multiple", full.names = TRUE, recursive = TRUE, pattern = "*.xlsx")
x
## [1] "../../static/data/multiple/file_1.xlsx"           
## [2] "../../static/data/multiple/file_2.xlsx"           
## [3] "../../static/data/multiple/file_3.xlsx"           
## [4] "../../static/data/multiple/next_level/file_4.xlsx"

Now I have the x vector with all full paths to all "*.xlsx" files.

Step 2

map the vector of paths to function read_excel. Use argument sheets to specify the name of the sheet.

lst <- map(x, read_excel, sheet = "Value Assessment")
head(lst[[1]])
## # A tibble: 6 x 10
##   Value Count Pattern Length Form  Case  Content Spacing Category
##   <chr> <chr> <chr>   <chr>  <chr> <chr> <chr>   <chr>   <chr>   
## 1 0     126   N       1      Nume~ Not ~ Single  Valid   Questio~
## 2 0.01  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 3 0.33  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 4 0.36  3     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 5 0.39  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 6 0.44  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## # ... with 1 more variable: `Assessment Note` <chr>

These are first 5 records of first file.

Step 3

Use bind_rows to bind the list into one tibble (data frame). Argument .id allows you to create a column with the name of source file.

df <- lst %>% 
  # set names to list items first 
  set_names(x %>% basename) %>% 
  bind_rows(.id = "filename")
head(df)
## # A tibble: 6 x 11
##   filename Value Count Pattern Length Form  Case  Content Spacing Category
##   <chr>    <chr> <chr> <chr>   <chr>  <chr> <chr> <chr>   <chr>   <chr>   
## 1 file_1.~ 0     126   N       1      Nume~ Not ~ Single  Valid   Questio~
## 2 file_1.~ 0.01  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 3 file_1.~ 0.33  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 4 file_1.~ 0.36  3     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 5 file_1.~ 0.39  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 6 file_1.~ 0.44  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## # ... with 1 more variable: `Assessment Note` <chr>

Pipe pipe pipe

Let´s pipe all these steps into single chunk of code.

x <- list.files(path = "../../static/data/multiple", full.names = TRUE, recursive = TRUE, pattern = "*.xlsx") 
x %>% 
  map(read_excel, sheet = "Value Assessment") %>% 
  set_names(x %>% basename) %>% 
  bind_rows(.id = "filename") %>% 
  head
## # A tibble: 6 x 11
##   filename Value Count Pattern Length Form  Case  Content Spacing Category
##   <chr>    <chr> <chr> <chr>   <chr>  <chr> <chr> <chr>   <chr>   <chr>   
## 1 file_1.~ 0     126   N       1      Nume~ Not ~ Single  Valid   Questio~
## 2 file_1.~ 0.01  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 3 file_1.~ 0.33  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 4 file_1.~ 0.36  3     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 5 file_1.~ 0.39  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 6 file_1.~ 0.44  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## # ... with 1 more variable: `Assessment Note` <chr>

Actually, you can reduce it to two lines using map_df..Note that without setting the names.

list.files(path = "../../static/data/multiple", full.names = TRUE, recursive = TRUE, pattern = "*.xlsx") %>% 
  map_df(~read_excel(., sheet = "Value Assessment"), .id = "filename") %>% 
  head
## # A tibble: 6 x 11
##   filename Value Count Pattern Length Form  Case  Content Spacing Category
##   <chr>    <chr> <chr> <chr>   <chr>  <chr> <chr> <chr>   <chr>   <chr>   
## 1 1        0     126   N       1      Nume~ Not ~ Single  Valid   Questio~
## 2 1        0.01  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 3 1        0.33  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 4 1        0.36  3     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 5 1        0.39  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## 6 1        0.44  1     NPNN    4      Nume~ Not ~ Single  Valid   Appears~
## # ... with 1 more variable: `Assessment Note` <chr>

Pipes are cool, makes your life easier. Two lines of code, but very powerfull.