Tidying datasets

gather: from wide to long format
spread: from long to wide format
unite variables
separate variables
Summarizing what you know with gapminder
Resources

Throughout, we will use functions in library(tidyr) and library(dplyr).

`gather`: from wide to long format

Suppose you want to compare outcomes with 3 treatments, and your data look like this

data

## # A tibble: 5 x 3
##   Treat1 Treat2 Treat3
##    <dbl>  <dbl>  <dbl>
## 1 -1.60   0.983 -0.2  
## 2  0.409 -0.671 -0.022
## 3 -0.019 -0.313 -1.74 
## 4 -0.251  3.25   1.88 
## 5  0.306  2.06  -0.083

Some people would say that the data is in “wide format.”

Data in wide format aren’t convenient for running our analyses: if you want to run an ANOVA or Tukey, you need to have all the outcomes in one column, and the categories (treatments) in another column. This alternative formatting is called “long format”. You can go from wide to long format using gather in library(tidyr).

data %>% gather(key=treatment, value=outcome, Treat1, Treat2, Treat3)

## # A tibble: 15 x 2
##    treatment outcome
##    <chr>       <dbl>
##  1 Treat1     -1.60 
##  2 Treat1      0.409
##  3 Treat1     -0.019
##  4 Treat1     -0.251
##  5 Treat1      0.306
##  6 Treat2      0.983
##  7 Treat2     -0.671
##  8 Treat2     -0.313
##  9 Treat2      3.25 
## 10 Treat2      2.06 
## 11 Treat3     -0.2  
## 12 Treat3     -0.022
## 13 Treat3     -1.74 
## 14 Treat3      1.88 
## 15 Treat3     -0.083

The first argument in gather is for naming the new column that contains the categories (the key), the second one is for naming the column where the new outcomes will be stored (the value), and then you write the names of the columns that contain the outcomes you want to gather. An equivalent way of writing the same thing is:

data %>% gather(key=treatment, value=outcome, Treat1:Treat3)

## # A tibble: 15 x 2
##    treatment outcome
##    <chr>       <dbl>
##  1 Treat1     -1.60 
##  2 Treat1      0.409
##  3 Treat1     -0.019
##  4 Treat1     -0.251
##  5 Treat1      0.306
##  6 Treat2      0.983
##  7 Treat2     -0.671
##  8 Treat2     -0.313
##  9 Treat2      3.25 
## 10 Treat2      2.06 
## 11 Treat3     -0.2  
## 12 Treat3     -0.022
## 13 Treat3     -1.74 
## 14 Treat3      1.88 
## 15 Treat3     -0.083

In Treat1:Treat3 we gave R a range of columns which we want to gather (first to last). This is useful if you have many variables.

What if your data is in wide format, but you have an uneven number of observations? That is, your data looks something like this

uneven

## # A tibble: 5 x 3
##    Treat1 Treat2  Treat3
##     <dbl>  <dbl>   <dbl>
## 1  -1.60   0.983  -0.2  
## 2   0.409 -0.671  -0.022
## 3  -0.019 -0.313  -1.74 
## 4  -0.251  3.25   NA    
## 5  NA      2.06   NA

Let’s try to gather:

uneven %>% gather(key = treatment, value = outcome, Treat1:Treat3)

## # A tibble: 15 x 2
##    treatment outcome
##    <chr>       <dbl>
##  1 Treat1     -1.60 
##  2 Treat1      0.409
##  3 Treat1     -0.019
##  4 Treat1     -0.251
##  5 Treat1     NA    
##  6 Treat2      0.983
##  7 Treat2     -0.671
##  8 Treat2     -0.313
##  9 Treat2      3.25 
## 10 Treat2      2.06 
## 11 Treat3     -0.2  
## 12 Treat3     -0.022
## 13 Treat3     -1.74 
## 14 Treat3     NA    
## 15 Treat3     NA

Unfortunately, we get some NAs. We can get rid of them with na.omit:

uneven %>% gather(key = treatment, value = outcome, Treat1:Treat3) %>% 
           na.omit

## # A tibble: 12 x 2
##    treatment outcome
##    <chr>       <dbl>
##  1 Treat1     -1.60 
##  2 Treat1      0.409
##  3 Treat1     -0.019
##  4 Treat1     -0.251
##  5 Treat2      0.983
##  6 Treat2     -0.671
##  7 Treat2     -0.313
##  8 Treat2      3.25 
##  9 Treat2      2.06 
## 10 Treat3     -0.2  
## 11 Treat3     -0.022
## 12 Treat3     -1.74

`spread`: from long to wide format

If you want to go from long to wide format, you can use spread.

For example, if your data are

data2

## # A tibble: 15 x 3
## # Groups:   treatment [3]
##    treatment outcome   ind
##    <chr>       <dbl> <int>
##  1 Treat1     -1.60      1
##  2 Treat1      0.409     2
##  3 Treat1     -0.019     3
##  4 Treat1     -0.251     4
##  5 Treat1      0.306     5
##  6 Treat2      0.983     1
##  7 Treat2     -0.671     2
##  8 Treat2     -0.313     3
##  9 Treat2      3.25      4
## 10 Treat2      2.06      5
## 11 Treat3     -0.2       1
## 12 Treat3     -0.022     2
## 13 Treat3     -1.74      3
## 14 Treat3      1.88      4
## 15 Treat3     -0.083     5

You can convert it to wide format as follows

data2 %>% spread(treatment, outcome)

## # A tibble: 5 x 4
##     ind Treat1 Treat2 Treat3
##   <int>  <dbl>  <dbl>  <dbl>
## 1     1 -1.60   0.983 -0.2  
## 2     2  0.409 -0.671 -0.022
## 3     3 -0.019 -0.313 -1.74 
## 4     4 -0.251  3.25   1.88 
## 5     5  0.306  2.06  -0.083

Note that data2 isn’t just our dataset that came out of gathering. In fact, if you start with

gath = data %>% gather(key=treatment, value=outcome, Treat1:Treat3)

and you try to spread, you’ll get an error. R complains because the rows of gath aren’t uniquely identifiable. A way to get around that is creating index variables within the treatments

gath = gath %>% group_by(treatment) %>% mutate(id=row_number())
gath

## # A tibble: 15 x 3
## # Groups:   treatment [3]
##    treatment outcome    id
##    <chr>       <dbl> <int>
##  1 Treat1     -1.60      1
##  2 Treat1      0.409     2
##  3 Treat1     -0.019     3
##  4 Treat1     -0.251     4
##  5 Treat1      0.306     5
##  6 Treat2      0.983     1
##  7 Treat2     -0.671     2
##  8 Treat2     -0.313     3
##  9 Treat2      3.25      4
## 10 Treat2      2.06      5
## 11 Treat3     -0.2       1
## 12 Treat3     -0.022     2
## 13 Treat3     -1.74      3
## 14 Treat3      1.88      4
## 15 Treat3     -0.083     5

and then, you can spread (and get rid of id):

gath %>% spread(treatment, outcome) %>% select(-id)

## # A tibble: 5 x 3
##   Treat1 Treat2 Treat3
##    <dbl>  <dbl>  <dbl>
## 1 -1.60   0.983 -0.2  
## 2  0.409 -0.671 -0.022
## 3 -0.019 -0.313 -1.74 
## 4 -0.251  3.25   1.88 
## 5  0.306  2.06  -0.083

`unite` variables

Suppose you have the following dataset

drinks

## # A tibble: 6 x 3
##   spirit         mixer        total
##   <chr>          <chr>        <dbl>
## 1 gin            tonic            8
## 2 sparkling wine orange juice     6
## 3 rum            coke             7
## 4 red wine       coke             5
## 5 <NA>           seltzer          5
## 6 beer           <NA>             5

If you would like to create a new column that merges the information in spirit and mixer, you can use the command unite.

For example, you can try

drinks %>% unite(name, spirit, mixer, sep=' & ')

## # A tibble: 6 x 2
##   name                          total
##   <chr>                         <dbl>
## 1 gin & tonic                       8
## 2 sparkling wine & orange juice     6
## 3 rum & coke                        7
## 4 red wine & coke                   5
## 5 NA & seltzer                      5
## 6 beer & NA                         5

Unfortunately, it doesn’t look great because there are NAs. You can substitute the NAs by nothing using coalesce to make the output look a little nicer:

new = drinks %>% 
        mutate(spirit=coalesce(spirit, "nothing"), mixer=coalesce(mixer, "nothing")) %>%
        unite(name, spirit, mixer, sep=' & ') %>% 
        mutate(name=trimws(name))
new

## # A tibble: 6 x 2
##   name                          total
##   <chr>                         <dbl>
## 1 gin & tonic                       8
## 2 sparkling wine & orange juice     6
## 3 rum & coke                        7
## 4 red wine & coke                   5
## 5 nothing & seltzer                 5
## 6 beer & nothing                    5

The function trimws got rid of some empty spaces at the beginning and end of name.

If you are willing to work a little bit more, you can make the new variable look even nicer:

final = new %>% mutate(name1 = gsub("& nothing","",name),
                       name = trimws(gsub("nothing &","",name1))) %>%
                       select(-name1)
final

## # A tibble: 6 x 2
##   name                          total
##   <chr>                         <dbl>
## 1 gin & tonic                       8
## 2 sparkling wine & orange juice     6
## 3 rum & coke                        7
## 4 red wine & coke                   5
## 5 seltzer                           5
## 6 beer                              5

We used the function gsub to substitute & nothing and nothing & by (literally) nothing.

`separate` variables

You can split up variables using separate. For example, if you start with

## # A tibble: 6 x 2
##   name                          total
##   <chr>                         <dbl>
## 1 gin & tonic                       8
## 2 sparkling wine & orange juice     6
## 3 rum & coke                        7
## 4 red wine & coke                   5
## 5 nothing & seltzer                 5
## 6 beer & nothing                    5

We can split up name into spirit and mixer with

new %>% separate(name, into=c("spirit", "mixer"), sep=" & ")

## # A tibble: 6 x 3
##   spirit         mixer        total
##   <chr>          <chr>        <dbl>
## 1 gin            tonic            8
## 2 sparkling wine orange juice     6
## 3 rum            coke             7
## 4 red wine       coke             5
## 5 nothing        seltzer          5
## 6 beer           nothing          5

If you use the dataset final instead, you would have to do more work to separate correctly.

Summarizing what you know with `gapminder`

The dataset gapminder in library(gapminder) has information on life expectancy, GDP per capita, and population for some countries. The variables are

str(gapminder)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We can find the average life expectancy by continent and year with

out = gapminder %>% 
  group_by(continent, year) %>% 
  summarize(avglife=mean(lifeExp))
out

## # A tibble: 60 x 3
## # Groups:   continent [?]
##    continent  year avglife
##    <fct>     <int>   <dbl>
##  1 Africa     1952    39.1
##  2 Africa     1957    41.3
##  3 Africa     1962    43.3
##  4 Africa     1967    45.3
##  5 Africa     1972    47.5
##  6 Africa     1977    49.6
##  7 Africa     1982    51.6
##  8 Africa     1987    53.3
##  9 Africa     1992    53.6
## 10 Africa     1997    53.6
## # ... with 50 more rows

And you can plot the trends as follows:

ggplot(out, aes(x=year, y=avglife, color=continent, linetype=continent))+geom_line(size=2)

What’s going on in Oceania? Let’s find out:

out2 = gapminder %>%
  filter(continent == "Oceania")
ggplot(out2, aes(x=year, y=lifeExp, color=country)) + geom_line(size=1.2)

Let’s do something a little more difficult. Let’s try to find the population differential betyouen 2007 and 1957 (range of the data) and sort the data in ascending order by the population differential. A way to do this is

gapminder %>% select(country, year, pop) %>%
          filter(year==1952|year==2007)  %>% 
          spread(key=year, value=pop) %>%
          mutate(diffPop = `2007`-`1952`) %>%
          select(country, diffPop) %>%
          arrange(diffPop)

## # A tibble: 142 x 2
##    country               diffPop
##    <fct>                   <int>
##  1 Bulgaria                47958
##  2 Sao Tome and Principe  139568
##  3 Iceland                153969
##  4 Montenegro             270902
##  5 Equatorial Guinea      334237
##  6 Trinidad and Tobago    393758
##  7 Djibouti               433225
##  8 Hungary                452108
##  9 Slovenia               519727
## 10 Reunion                540394
## # ... with 132 more rows

Notice the backticks in the variables when you define diffPop.

Tidying datasets

Víctor Peña

gather: from wide to long format

spread: from long to wide format

unite variables

separate variables

Summarizing what you know with gapminder

Resources

`gather`: from wide to long format

`spread`: from long to wide format

`unite` variables

`separate` variables

Summarizing what you know with `gapminder`