In lecture, we learned about left_join
, right_join
, inner_join
, and full_join
. Our starting point is 2 datasets a
and b
, which have a variable in common named x
. We wish to merge the datasets by matching values of x
.
left_join(a, b, by='x')
: Keep all the rows in a
, find rows in dataset b
that have matching values of x
in a
, and include the extra columns in b
that don’t appear in a
. Don’t include rows in b
that have no matching value of x
in b
.
right_join(a, b, by='x')
: Keep all the rows in b
, find rows in dataset a
that have matching values of x
in b
, and include the extra columns in a
that don’t appear in b
. Don’t include rows in a
that have no matching value of x
in b
.
inner_join(a, b, by='x')
: Keep rows that have values of x
that appear in both a
and b
.
full_join(a, b, by='x')
: Keep all the rows in a
and b
, even if they don’t have matching values in x
.
To make this concrete, let’s work with a specific example. The tibbles
below contain names of drinks:
nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"),
mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"),
spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))
Let’s try out what happens when we try the different join
s on the tibble
s:
left_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 5 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 polar seltzer <NA>
right_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 6 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 IPA <NA> beer
## 6 dark and stormy <NA> rum
inner_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 4 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
full_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 7 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 polar seltzer <NA>
## 6 IPA <NA> beer
## 7 dark and stormy <NA> rum
We can also use join
s for filtering. For example, semi_join(a, b, by='x')
keeps all the rows in a
that have a matching value of x
in b
(it doesn’t create more columns with the information in b
), and anti_join(a, b, by='x')
keeps the rows in a
that don’t have a matching value of x
in b
:
semi_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 4 x 2
## name mixer
## <chr> <chr>
## 1 g&t tonic
## 2 mimosa orange juice
## 3 rum and coke coke
## 4 calimocho coke
anti_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 1 x 2
## name mixer
## <chr> <chr>
## 1 polar seltzer
Suppose we get prices for the drinks:
nonalcoholic = nonalcoholic %>% mutate(price=c(8, 6, 7, 5, 5))
alcoholic = alcoholic %>% mutate('$'=c(8, 6, 7, 5, 5, NA))
How can we join the new tibbles
nicely?
First, we can rename the price variable in alcoholic
:
alcoholic = alcoholic %>% rename(price='$')
And then, we can full_join
the tibbles
by their name
and price
(we want a double match):
join = full_join(nonalcoholic, alcoholic, by=c('name','price'))
join
## # A tibble: 7 x 4
## name mixer price spirit
## <chr> <chr> <dbl> <chr>
## 1 g&t tonic 8 gin
## 2 mimosa orange juice 6 sparkling wine
## 3 rum and coke coke 7 rum
## 4 calimocho coke 5 red wine
## 5 polar seltzer 5 <NA>
## 6 IPA <NA> 5 beer
## 7 dark and stormy <NA> NA rum
If we want to reorder the columns so that the spirit comes first, followed by the mixer, we can use select
:
join = join %>% select(name, spirit, mixer, price)
join
## # A tibble: 7 x 4
## name spirit mixer price
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 8
## 2 mimosa sparkling wine orange juice 6
## 3 rum and coke rum coke 7
## 4 calimocho red wine coke 5
## 5 polar <NA> seltzer 5
## 6 IPA beer <NA> 5
## 7 dark and stormy rum <NA> NA
Now, suppose we got a new tibble
with the extra components that were missing for dark and stormy
ds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)
How can we incorporate this new information?
We can start by full_join
ing
full_join(join, ds, by='name')
## # A tibble: 7 x 6
## name spirit.x mixer price.x spirit.y price.y
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 g&t gin tonic 8 <NA> NA
## 2 mimosa sparkling wine orange juice 6 <NA> NA
## 3 rum and coke rum coke 7 <NA> NA
## 4 calimocho red wine coke 5 <NA> NA
## 5 polar <NA> seltzer 5 <NA> NA
## 6 IPA beer <NA> 5 <NA> NA
## 7 dark and stormy rum <NA> NA ginger beer 8
This join
has a few problems. We’d like to combine price.x
and price.y
. We can do that using the function coalesce
, and then getting rid of price.x
and price.y
:
full_join(join, ds, by='name') %>%
mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y)
## # A tibble: 7 x 5
## name spirit.x mixer spirit.y price
## <chr> <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic <NA> 8
## 2 mimosa sparkling wine orange juice <NA> 6
## 3 rum and coke rum coke <NA> 7
## 4 calimocho red wine coke <NA> 5
## 5 polar <NA> seltzer <NA> 5
## 6 IPA beer <NA> <NA> 5
## 7 dark and stormy rum <NA> ginger beer 8
This is mostly fine. If we want to combine spirit.x
and spirit.y
:
full_join(join, ds, by='name') %>%
mutate(price=coalesce(price.x,price.y)) %>% select(-price.x, -price.y) %>%
replace(is.na(.),"") %>%
unite(spirit, spirit.x, spirit.y, sep="")
## # A tibble: 7 x 4
## name spirit mixer price
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 8
## 2 mimosa sparkling wine orange juice 6
## 3 rum and coke rum coke 7
## 4 calimocho red wine coke 5
## 5 polar "" seltzer 5
## 6 IPA beer "" 5
## 7 dark and stormy rumginger beer "" 8
First, we replaced the NAs
by empty spaces, and then we joined the variables spirit.x
, and spirit.y
using unite
. The formatting isn’t great, unforunately. We can play around a little bit to make it look nicer:
ds = ds %>% mutate(spirit=paste(" ", spirit,sep=""))
final = full_join(join, ds, by='name') %>%
mutate(price=coalesce(price.x,price.y)) %>%
select(-price.x, -price.y) %>%
replace(is.na(.),"") %>%
unite(spirit, spirit.x, spirit.y, sep="")
final
## # A tibble: 7 x 4
## name spirit mixer price
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 8
## 2 mimosa sparkling wine orange juice 6
## 3 rum and coke rum coke 7
## 4 calimocho red wine coke 5
## 5 polar "" seltzer 5
## 6 IPA beer "" 5
## 7 dark and stormy rum ginger beer "" 8
Now suppose that we’re at some bar where the total price for a mixed drink is equal to the sum of the prices of the components. The prices are
nonalcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"),
mixer=c("tonic","orange juice", "coke", "coke","seltzer")) %>%
mutate(price=c(3, 2, 2, 2, 5))
alcoholic = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"),
spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>%
mutate(price=c(5, 4, 5, 3, 5))
How can we merge the data and combine (add) prices?
We can start by full_join
ing by name
full_join(nonalcoholic, alcoholic, by='name')
## # A tibble: 6 x 5
## name mixer price.x spirit price.y
## <chr> <chr> <dbl> <chr> <dbl>
## 1 g&t tonic 3 gin 5
## 2 mimosa orange juice 2 sparkling wine 4
## 3 rum and coke coke 2 rum 5
## 4 calimocho coke 2 red wine 3
## 5 polar seltzer 5 <NA> NA
## 6 IPA <NA> NA beer 5
Unfortunately, a simple mutate
instruction isn’t satisfactory because the sum of an NA
+ a number is equal to NA
:
full_join(nonalcoholic, alcoholic, by='name') %>% mutate(total=price.x+price.y)
## # A tibble: 6 x 6
## name mixer price.x spirit price.y total
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 g&t tonic 3 gin 5 8
## 2 mimosa orange juice 2 sparkling wine 4 6
## 3 rum and coke coke 2 rum 5 7
## 4 calimocho coke 2 red wine 3 5
## 5 polar seltzer 5 <NA> NA NA
## 6 IPA <NA> NA beer 5 NA
We can solve this problem in different ways. One way is using coalesce
to replace the NA
s by zeros in price.x
and price.y
, and then adding:
full_join(nonalcoholic, alcoholic, by='name') %>%
mutate(price.x=coalesce(price.x,0), price.y=coalesce(price.y,0), total=price.x+price.y) %>%
select(name, spirit, mixer, total)
## # A tibble: 6 x 4
## name spirit mixer total
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 8
## 2 mimosa sparkling wine orange juice 6
## 3 rum and coke rum coke 7
## 4 calimocho red wine coke 5
## 5 polar <NA> seltzer 5
## 6 IPA beer <NA> 5
In general, the way that R
deals with NA
s can be a little frustrating, as these examples show.
Source: Data wrangling cheat sheet
Consider the tibble
s y
and z
, which have matching variable names x1
and x2
(this is important!):
y
## # A tibble: 3 x 2
## x1 x2
## <chr> <int>
## 1 A 1
## 2 B 2
## 3 C 3
z
## # A tibble: 3 x 2
## x1 x2
## <chr> <int>
## 1 B 2
## 2 C 3
## 3 D 4
Set operations
intersect(y,z)
: find rows that appear in both y
and z
intersect(y,z)
## # A tibble: 2 x 2
## x1 x2
## <chr> <int>
## 1 B 2
## 2 C 3
union(y,z)
: find rows that appear in y
or z
(or both):union(y,z)
## # A tibble: 4 x 2
## x1 x2
## <chr> <int>
## 1 D 4
## 2 C 3
## 3 B 2
## 4 A 1
setdiff(y, z)
: find rows that appear in y
but not in z
:setdiff(y,z)
## # A tibble: 1 x 2
## x1 x2
## <chr> <int>
## 1 A 1
Binding rows and columns
If you want to bind the rows of y
and z
bind_rows(y,z)
## # A tibble: 6 x 2
## x1 x2
## <chr> <int>
## 1 A 1
## 2 B 2
## 3 C 3
## 4 B 2
## 5 C 3
## 6 D 4
If you want to bind the columns of y
and z
:
bind_cols(y,z)
## # A tibble: 3 x 4
## x1 x2 x11 x21
## <chr> <int> <chr> <int>
## 1 A 1 B 2
## 2 B 2 C 3
## 3 C 3 D 4