Throughout, we will use functions in library(tidyr)
and library(dplyr)
.
gather
: from wide to long formatSuppose you want to compare outcomes with 3 treatments, and your data look like this
data
## # A tibble: 5 x 3
## Treat1 Treat2 Treat3
## <dbl> <dbl> <dbl>
## 1 -1.60 0.983 -0.2
## 2 0.409 -0.671 -0.022
## 3 -0.019 -0.313 -1.74
## 4 -0.251 3.25 1.88
## 5 0.306 2.06 -0.083
Some people would say that the data is in “wide format.”
Data in wide format aren’t convenient for running our analyses: if you want to run an ANOVA or Tukey, you need to have all the outcomes in one column, and the categories (treatments) in another column. This alternative formatting is called “long format”. You can go from wide to long format using gather
in library(tidyr)
.
data %>% gather(key=treatment, value=outcome, Treat1, Treat2, Treat3)
## # A tibble: 15 x 2
## treatment outcome
## <chr> <dbl>
## 1 Treat1 -1.60
## 2 Treat1 0.409
## 3 Treat1 -0.019
## 4 Treat1 -0.251
## 5 Treat1 0.306
## 6 Treat2 0.983
## 7 Treat2 -0.671
## 8 Treat2 -0.313
## 9 Treat2 3.25
## 10 Treat2 2.06
## 11 Treat3 -0.2
## 12 Treat3 -0.022
## 13 Treat3 -1.74
## 14 Treat3 1.88
## 15 Treat3 -0.083
The first argument in gather
is for naming the new column that contains the categories (the key
), the second one is for naming the column where the new outcomes will be stored (the value
), and then you write the names of the columns that contain the outcomes you want to gather
. An equivalent way of writing the same thing is:
data %>% gather(key=treatment, value=outcome, Treat1:Treat3)
## # A tibble: 15 x 2
## treatment outcome
## <chr> <dbl>
## 1 Treat1 -1.60
## 2 Treat1 0.409
## 3 Treat1 -0.019
## 4 Treat1 -0.251
## 5 Treat1 0.306
## 6 Treat2 0.983
## 7 Treat2 -0.671
## 8 Treat2 -0.313
## 9 Treat2 3.25
## 10 Treat2 2.06
## 11 Treat3 -0.2
## 12 Treat3 -0.022
## 13 Treat3 -1.74
## 14 Treat3 1.88
## 15 Treat3 -0.083
In Treat1:Treat3
we gave R
a range of columns which we want to gather
(first to last). This is useful if you have many variables.
What if your data is in wide format, but you have an uneven number of observations? That is, your data looks something like this
uneven
## # A tibble: 5 x 3
## Treat1 Treat2 Treat3
## <dbl> <dbl> <dbl>
## 1 -1.60 0.983 -0.2
## 2 0.409 -0.671 -0.022
## 3 -0.019 -0.313 -1.74
## 4 -0.251 3.25 NA
## 5 NA 2.06 NA
Let’s try to gather
:
uneven %>% gather(key = treatment, value = outcome, Treat1:Treat3)
## # A tibble: 15 x 2
## treatment outcome
## <chr> <dbl>
## 1 Treat1 -1.60
## 2 Treat1 0.409
## 3 Treat1 -0.019
## 4 Treat1 -0.251
## 5 Treat1 NA
## 6 Treat2 0.983
## 7 Treat2 -0.671
## 8 Treat2 -0.313
## 9 Treat2 3.25
## 10 Treat2 2.06
## 11 Treat3 -0.2
## 12 Treat3 -0.022
## 13 Treat3 -1.74
## 14 Treat3 NA
## 15 Treat3 NA
Unfortunately, we get some NA
s. We can get rid of them with na.omit
:
uneven %>% gather(key = treatment, value = outcome, Treat1:Treat3) %>%
na.omit
## # A tibble: 12 x 2
## treatment outcome
## <chr> <dbl>
## 1 Treat1 -1.60
## 2 Treat1 0.409
## 3 Treat1 -0.019
## 4 Treat1 -0.251
## 5 Treat2 0.983
## 6 Treat2 -0.671
## 7 Treat2 -0.313
## 8 Treat2 3.25
## 9 Treat2 2.06
## 10 Treat3 -0.2
## 11 Treat3 -0.022
## 12 Treat3 -1.74
spread
: from long to wide formatIf you want to go from long to wide format, you can use spread
.
For example, if your data are
data2
## # A tibble: 15 x 3
## # Groups: treatment [3]
## treatment outcome ind
## <chr> <dbl> <int>
## 1 Treat1 -1.60 1
## 2 Treat1 0.409 2
## 3 Treat1 -0.019 3
## 4 Treat1 -0.251 4
## 5 Treat1 0.306 5
## 6 Treat2 0.983 1
## 7 Treat2 -0.671 2
## 8 Treat2 -0.313 3
## 9 Treat2 3.25 4
## 10 Treat2 2.06 5
## 11 Treat3 -0.2 1
## 12 Treat3 -0.022 2
## 13 Treat3 -1.74 3
## 14 Treat3 1.88 4
## 15 Treat3 -0.083 5
You can convert it to wide format as follows
data2 %>% spread(treatment, outcome)
## # A tibble: 5 x 4
## ind Treat1 Treat2 Treat3
## <int> <dbl> <dbl> <dbl>
## 1 1 -1.60 0.983 -0.2
## 2 2 0.409 -0.671 -0.022
## 3 3 -0.019 -0.313 -1.74
## 4 4 -0.251 3.25 1.88
## 5 5 0.306 2.06 -0.083
Note that data2
isn’t just our dataset that came out of gather
ing. In fact, if you start with
gath = data %>% gather(key=treatment, value=outcome, Treat1:Treat3)
and you try to spread
, you’ll get an error. R
complains because the rows of gath
aren’t uniquely identifiable. A way to get around that is creating index variables within the treatments
gath = gath %>% group_by(treatment) %>% mutate(id=row_number())
gath
## # A tibble: 15 x 3
## # Groups: treatment [3]
## treatment outcome id
## <chr> <dbl> <int>
## 1 Treat1 -1.60 1
## 2 Treat1 0.409 2
## 3 Treat1 -0.019 3
## 4 Treat1 -0.251 4
## 5 Treat1 0.306 5
## 6 Treat2 0.983 1
## 7 Treat2 -0.671 2
## 8 Treat2 -0.313 3
## 9 Treat2 3.25 4
## 10 Treat2 2.06 5
## 11 Treat3 -0.2 1
## 12 Treat3 -0.022 2
## 13 Treat3 -1.74 3
## 14 Treat3 1.88 4
## 15 Treat3 -0.083 5
and then, you can spread
(and get rid of id
):
gath %>% spread(treatment, outcome) %>% select(-id)
## # A tibble: 5 x 3
## Treat1 Treat2 Treat3
## <dbl> <dbl> <dbl>
## 1 -1.60 0.983 -0.2
## 2 0.409 -0.671 -0.022
## 3 -0.019 -0.313 -1.74
## 4 -0.251 3.25 1.88
## 5 0.306 2.06 -0.083
unite
variablesSuppose you have the following dataset
drinks
## # A tibble: 6 x 3
## spirit mixer total
## <chr> <chr> <dbl>
## 1 gin tonic 8
## 2 sparkling wine orange juice 6
## 3 rum coke 7
## 4 red wine coke 5
## 5 <NA> seltzer 5
## 6 beer <NA> 5
If you would like to create a new column that merges the information in spirit
and mixer
, you can use the command unite
.
For example, you can try
drinks %>% unite(name, spirit, mixer, sep=' & ')
## # A tibble: 6 x 2
## name total
## <chr> <dbl>
## 1 gin & tonic 8
## 2 sparkling wine & orange juice 6
## 3 rum & coke 7
## 4 red wine & coke 5
## 5 NA & seltzer 5
## 6 beer & NA 5
Unfortunately, it doesn’t look great because there are NA
s. You can substitute the NA
s by nothing
using coalesce
to make the output look a little nicer:
new = drinks %>%
mutate(spirit=coalesce(spirit, "nothing"), mixer=coalesce(mixer, "nothing")) %>%
unite(name, spirit, mixer, sep=' & ') %>%
mutate(name=trimws(name))
new
## # A tibble: 6 x 2
## name total
## <chr> <dbl>
## 1 gin & tonic 8
## 2 sparkling wine & orange juice 6
## 3 rum & coke 7
## 4 red wine & coke 5
## 5 nothing & seltzer 5
## 6 beer & nothing 5
The function trimws
got rid of some empty spaces at the beginning and end of name
.
If you are willing to work a little bit more, you can make the new variable look even nicer:
final = new %>% mutate(name1 = gsub("& nothing","",name),
name = trimws(gsub("nothing &","",name1))) %>%
select(-name1)
final
## # A tibble: 6 x 2
## name total
## <chr> <dbl>
## 1 gin & tonic 8
## 2 sparkling wine & orange juice 6
## 3 rum & coke 7
## 4 red wine & coke 5
## 5 seltzer 5
## 6 beer 5
We used the function gsub
to substitute & nothing
and nothing &
by (literally) nothing.
separate
variablesYou can split up variables using separate
. For example, if you start with
## # A tibble: 6 x 2
## name total
## <chr> <dbl>
## 1 gin & tonic 8
## 2 sparkling wine & orange juice 6
## 3 rum & coke 7
## 4 red wine & coke 5
## 5 nothing & seltzer 5
## 6 beer & nothing 5
We can split up name
into spirit
and mixer
with
new %>% separate(name, into=c("spirit", "mixer"), sep=" & ")
## # A tibble: 6 x 3
## spirit mixer total
## <chr> <chr> <dbl>
## 1 gin tonic 8
## 2 sparkling wine orange juice 6
## 3 rum coke 7
## 4 red wine coke 5
## 5 nothing seltzer 5
## 6 beer nothing 5
If you use the dataset final
instead, you would have to do more work to separate
correctly.
gapminder
The dataset gapminder
in library(gapminder)
has information on life expectancy, GDP per capita, and population for some countries. The variables are
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
We can find the average life expectancy by continent and year with
out = gapminder %>%
group_by(continent, year) %>%
summarize(avglife=mean(lifeExp))
out
## # A tibble: 60 x 3
## # Groups: continent [?]
## continent year avglife
## <fct> <int> <dbl>
## 1 Africa 1952 39.1
## 2 Africa 1957 41.3
## 3 Africa 1962 43.3
## 4 Africa 1967 45.3
## 5 Africa 1972 47.5
## 6 Africa 1977 49.6
## 7 Africa 1982 51.6
## 8 Africa 1987 53.3
## 9 Africa 1992 53.6
## 10 Africa 1997 53.6
## # ... with 50 more rows
And you can plot the trends as follows:
ggplot(out, aes(x=year, y=avglife, color=continent, linetype=continent))+geom_line(size=2)
What’s going on in Oceania? Let’s find out:
out2 = gapminder %>%
filter(continent == "Oceania")
ggplot(out2, aes(x=year, y=lifeExp, color=country)) + geom_line(size=1.2)
Let’s do something a little more difficult. Let’s try to find the population differential betyouen 2007 and 1957 (range of the data) and sort the data in ascending order by the population differential. A way to do this is
gapminder %>% select(country, year, pop) %>%
filter(year==1952|year==2007) %>%
spread(key=year, value=pop) %>%
mutate(diffPop = `2007`-`1952`) %>%
select(country, diffPop) %>%
arrange(diffPop)
## # A tibble: 142 x 2
## country diffPop
## <fct> <int>
## 1 Bulgaria 47958
## 2 Sao Tome and Principe 139568
## 3 Iceland 153969
## 4 Montenegro 270902
## 5 Equatorial Guinea 334237
## 6 Trinidad and Tobago 393758
## 7 Djibouti 433225
## 8 Hungary 452108
## 9 Slovenia 519727
## 10 Reunion 540394
## # ... with 132 more rows
Notice the backticks in the variables when you define diffPop
.