dplyr







package use
dplyr data manipulation operations tool
ggplot powerful graphics language for creating elegant and complex plots
forcats reordering factor levels
haven Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files to R
purrr enhances R’s functional programming toolkit
readr read data (like ‘csv’, ‘tsv’, and ‘fwf’)
readxl get data out of Excel and into R
stringr make working with strings as easy as possible (regular expressions)
tidyr The goal of tidyr is to help you create tidy data. Tidy data is data where:







Load the dataset

library(dplyr)
address <- url("http://www.trutschnig.net/RTR2015.RData")
load(address)
head(RTR2015)
##                                           id               mtime
## 329731 Oadff95f3-880a-4cb3-9534-b85e932fb137 2014-01-01 00:10:00
## 368684 Oc2852599-37bd-4c3c-8fe1-15c4884f5926 2014-01-01 00:16:00
## 137787 O48a47c74-c7a6-4ac6-9c24-fc442d0674dd 2014-01-01 00:21:00
## 200828 O69f25c23-5089-45d9-8407-3159f35f4a7e 2014-01-01 00:21:00
## 450318 Oede0db57-1d56-432c-bf1d-73327add2591 2014-01-01 00:27:00
## 106797 O383e9306-9612-4c39-af8f-1b018f6db871 2014-01-01 00:41:00
##              mymd     mym op_name nw_cat               device
## 329731 2014-01-01 2014-01    Pear     3G        Galaxy Note 2
## 368684 2014-01-01 2014-01    Pear     3G        Galaxy Note 2
## 137787 2014-01-01 2014-01    Kiwi     3G Galaxy Note 10.1 LTE
## 200828 2014-01-01 2014-01    Kiwi     3G            Galaxy S3
## 450318 2014-01-01 2014-01    Kiwi     3G Galaxy Note 10.1 LTE
## 106797 2014-01-01 2014-01   Apple     3G            iPhone 5s
##        device_platform device_has_lte longitude latitude iso_adm2
## 329731         Android          FALSE  13.20764 47.34299   at0504
## 368684         Android          FALSE  13.22863 47.34588   at0504
## 137787         Android           TRUE  15.75156 48.20457   at0319
## 200828         Android          FALSE  16.39714 48.23074   at0900
## 450318         Android           TRUE  15.63205 48.20800   at0302
## 106797             iOS           TRUE  15.51962 47.12490   at0606
##        rtr_speed_dl rtr_speed_ul rtr_ping
## 329731         3226          419  56.6295
## 368684         2068          572  41.7948
## 137787        11604         2881  63.5496
## 200828         1948         1039  45.0728
## 450318         6277         2484  59.1642
## 106797         3577         2433  47.9287
RTR=tbl_df(RTR2015)
RTR
## # A tibble: 239,060 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##  * <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
glimpse(RTR)
## Observations: 239,060
## Variables: 15
## $ id              <chr> "Oadff95f3-880a-4cb3-9534-b85e932fb137", "Oc28...
## $ mtime           <dttm> 2014-01-01 00:10:00, 2014-01-01 00:16:00, 201...
## $ mymd            <date> 2014-01-01, 2014-01-01, 2014-01-01, 2014-01-0...
## $ mym             <chr> "2014-01", "2014-01", "2014-01", "2014-01", "2...
## $ op_name         <chr> "Pear", "Pear", "Kiwi", "Kiwi", "Kiwi", "Apple...
## $ nw_cat          <fct> 3G, 3G, 3G, 3G, 3G, 3G, 3G, 3G, 3G, 3G, 2G, 3G...
## $ device          <fct> Galaxy Note 2, Galaxy Note 2, Galaxy Note 10.1...
## $ device_platform <fct> Android, Android, Android, Android, Android, i...
## $ device_has_lte  <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, F...
## $ longitude       <dbl> 13.20764, 13.22863, 15.75156, 16.39714, 15.632...
## $ latitude        <dbl> 47.34299, 47.34588, 48.20457, 48.23074, 48.208...
## $ iso_adm2        <fct> at0504, at0504, at0319, at0900, at0302, at0606...
## $ rtr_speed_dl    <dbl> 3226, 2068, 11604, 1948, 6277, 3577, 6526, 315...
## $ rtr_speed_ul    <dbl> 419, 572, 2881, 1039, 2484, 2433, 1999, 1921, ...
## $ rtr_ping        <dbl> 56.6295, 41.7948, 63.5496, 45.0728, 59.1642, 4...





The Basic Grammar of Dplyr

The 5 basic verbs


Verb Semantic
select() Select variables based on their name
filter() Select cases based on their values
mutate()/transmute() Add new variables that are functions of existing variables
arrange() Reorder the cases
summarise() Condense multiple values to a single value


No modification of the source dataset!


Usage


select()


Example

Select variables in which order you like
select(RTR, device, device_platform, op_name)
## # A tibble: 239,060 x 3
##    device               device_platform op_name
##  * <fct>                <fct>           <chr>  
##  1 Galaxy Note 2        Android         Pear   
##  2 Galaxy Note 2        Android         Pear   
##  3 Galaxy Note 10.1 LTE Android         Kiwi   
##  4 Galaxy S3            Android         Kiwi   
##  5 Galaxy Note 10.1 LTE Android         Kiwi   
##  6 iPhone 5s            iOS             Apple  
##  7 iPhone 5s            iOS             Apple  
##  8 iPhone 4s            iOS             Pear   
##  9 Galaxy S4 LTE        Android         Pear   
## 10 iPhone 5s            iOS             Kiwi   
## # ... with 239,050 more rows



There are six helper functions

Function Semantic
starts_with() Each variable starting with a defined string
ends_with() Each variable ending with a defined string
contains() Each variable containing a defined string
matches() Each variable matching a defined string
num_range() Each variable matching a defined string + number
one_of() Each variable whose names are in a group of names



select()

1. Select all variables from rtr_speed_dl to rtr_ping!
select(RTR, rtr_speed_dl:rtr_ping)
## # A tibble: 239,060 x 3
##    rtr_speed_dl rtr_speed_ul rtr_ping
##  *        <dbl>        <dbl>    <dbl>
##  1        3226.         419.     56.6
##  2        2068.         572.     41.8
##  3       11604.        2881.     63.5
##  4        1948.        1039.     45.1
##  5        6277.        2484.     59.2
##  6        3577.        2433.     47.9
##  7        6526.        1999.     47.5
##  8        3155.        1921.     39.9
##  9       11354.         837.     47.8
## 10       13600.        3048.     59.0
## # ... with 239,050 more rows


2. Select all variables except of id and iso_adm2!
select(RTR, -c(id, iso_adm2))
## # A tibble: 239,060 x 13
##    mtime               mymd       mym     op_name nw_cat device           
##  * <dttm>              <date>     <chr>   <chr>   <fct>  <fct>            
##  1 2014-01-01 00:10:00 2014-01-01 2014-01 Pear    3G     Galaxy Note 2    
##  2 2014-01-01 00:16:00 2014-01-01 2014-01 Pear    3G     Galaxy Note 2    
##  3 2014-01-01 00:21:00 2014-01-01 2014-01 Kiwi    3G     Galaxy Note 10.1…
##  4 2014-01-01 00:21:00 2014-01-01 2014-01 Kiwi    3G     Galaxy S3        
##  5 2014-01-01 00:27:00 2014-01-01 2014-01 Kiwi    3G     Galaxy Note 10.1…
##  6 2014-01-01 00:41:00 2014-01-01 2014-01 Apple   3G     iPhone 5s        
##  7 2014-01-01 00:42:00 2014-01-01 2014-01 Apple   3G     iPhone 5s        
##  8 2014-01-01 00:48:00 2014-01-01 2014-01 Pear    3G     iPhone 4s        
##  9 2014-01-01 01:05:00 2014-01-01 2014-01 Pear    3G     Galaxy S4 LTE    
## 10 2014-01-01 01:07:00 2014-01-01 2014-01 Kiwi    3G     iPhone 5s        
## # ... with 239,050 more rows, and 7 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


3. What does happen when you use following code?
device <- 5
select(RTR, device)
## # A tibble: 239,060 x 1
##    device              
##  * <fct>               
##  1 Galaxy Note 2       
##  2 Galaxy Note 2       
##  3 Galaxy Note 10.1 LTE
##  4 Galaxy S3           
##  5 Galaxy Note 10.1 LTE
##  6 iPhone 5s           
##  7 iPhone 5s           
##  8 iPhone 4s           
##  9 Galaxy S4 LTE       
## 10 iPhone 5s           
## # ... with 239,050 more rows
select(RTR, identity(device))
## # A tibble: 239,060 x 1
##    op_name
##  * <chr>  
##  1 Pear   
##  2 Pear   
##  3 Kiwi   
##  4 Kiwi   
##  5 Kiwi   
##  6 Apple  
##  7 Apple  
##  8 Pear   
##  9 Pear   
## 10 Kiwi   
## # ... with 239,050 more rows
select(RTR, !! device)
## # A tibble: 239,060 x 1
##    op_name
##  * <chr>  
##  1 Pear   
##  2 Pear   
##  3 Kiwi   
##  4 Kiwi   
##  5 Kiwi   
##  6 Apple  
##  7 Apple  
##  8 Pear   
##  9 Pear   
## 10 Kiwi   
## # ... with 239,050 more rows


4. Select all variables containing the string “ng”!
select(RTR, contains("ng"))
## # A tibble: 239,060 x 2
##    longitude rtr_ping
##  *     <dbl>    <dbl>
##  1      13.2     56.6
##  2      13.2     41.8
##  3      15.8     63.5
##  4      16.4     45.1
##  5      15.6     59.2
##  6      15.5     47.9
##  7      15.5     47.5
##  8      13.2     39.9
##  9      13.8     47.8
## 10      15.2     59.0
## # ... with 239,050 more rows


5. Select the variables “device”, “op_name”, and “rtr_ping” (use one_of())!
select(RTR, one_of(c("device", "op_name", "rtr_ping")))
## # A tibble: 239,060 x 3
##    device               op_name rtr_ping
##  * <fct>                <chr>      <dbl>
##  1 Galaxy Note 2        Pear        56.6
##  2 Galaxy Note 2        Pear        41.8
##  3 Galaxy Note 10.1 LTE Kiwi        63.5
##  4 Galaxy S3            Kiwi        45.1
##  5 Galaxy Note 10.1 LTE Kiwi        59.2
##  6 iPhone 5s            Apple       47.9
##  7 iPhone 5s            Apple       47.5
##  8 iPhone 4s            Pear        39.9
##  9 Galaxy S4 LTE        Pear        47.8
## 10 iPhone 5s            Kiwi        59.0
## # ... with 239,050 more rows


6. Select all variables starting with “dev” AND ending with “e”!
RTR_devices <- select(RTR, starts_with("dev"))
select(RTR_devices, ends_with("e"))
## # A tibble: 239,060 x 2
##    device               device_has_lte
##  * <fct>                <lgl>         
##  1 Galaxy Note 2        FALSE         
##  2 Galaxy Note 2        FALSE         
##  3 Galaxy Note 10.1 LTE TRUE          
##  4 Galaxy S3            FALSE         
##  5 Galaxy Note 10.1 LTE TRUE          
##  6 iPhone 5s            TRUE          
##  7 iPhone 5s            TRUE          
##  8 iPhone 4s            FALSE         
##  9 Galaxy S4 LTE        TRUE          
## 10 iPhone 5s            TRUE          
## # ... with 239,050 more rows



mutate()

  • Adds new columns to the dataset (usually functions of existing ones)
  • Works similar to the base transform()
  • If you only want to keep the new variables use transmute()


Example

mutate(RTR, diff_speed = rtr_speed_ul - rtr_speed_dl)
## # A tibble: 239,060 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 9 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>,
## #   diff_speed <dbl>


Directly refer to new variables
mutate(RTR,
       abs_diff_speed = abs(rtr_speed_ul - rtr_speed_dl),
       perc_diff_speed = abs_diff_speed / rtr_speed_dl * 100)
## # A tibble: 239,060 x 17
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 10 more variables:
## #   device_platform <fct>, device_has_lte <lgl>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>, abs_diff_speed <dbl>,
## #   perc_diff_speed <dbl>


Only keep new variables
transmute(RTR,
          diff_speed = rtr_speed_ul - rtr_speed_dl,
          abs_diff_speed = abs(rtr_speed_ul - rtr_speed_dl))
## # A tibble: 239,060 x 2
##    diff_speed abs_diff_speed
##         <dbl>          <dbl>
##  1     -2807.          2807.
##  2     -1496.          1496.
##  3     -8723.          8723.
##  4      -909.           909.
##  5     -3793.          3793.
##  6     -1144.          1144.
##  7     -4527.          4527.
##  8     -1234.          1234.
##  9    -10517.         10517.
## 10    -10552.         10552.
## # ... with 239,050 more rows


When adding a column from a variable outside the dataset, it either has to have a length of one
var <- "Hello World"
mutate(RTR, new = var)
## # A tibble: 239,060 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 9 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>, new <chr>


…or match the length of the dataset
var <- seq(1, nrow(RTR))
mutate(RTR, new = var)
## # A tibble: 239,060 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 9 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>, new <int>



mutate()

1. Create a variable that contains the sum of longitude and latitude!
mutate(RTR, sum_long_lat = longitude + latitude)
## # A tibble: 239,060 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 9 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>,
## #   sum_long_lat <dbl>


2. Create a variable that contains the ping in seconds (is: milliseconds)!
mutate(RTR, ping_seconds = rtr_ping / 1000)
## # A tibble: 239,060 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 9 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>,
## #   ping_seconds <dbl>


3. Create a variable that contains the quotient of latitude and longitude (divisor)! Only keep that new variable!
transmute(RTR, quot_lat_long = latitude / longitude)
## # A tibble: 239,060 x 1
##    quot_lat_long
##            <dbl>
##  1          3.58
##  2          3.58
##  3          3.06
##  4          2.94
##  5          3.08
##  6          3.04
##  7          3.04
##  8          3.59
##  9          3.40
## 10          3.21
## # ... with 239,050 more rows



filter()

  • Selects rows (cases) where the expression is true
  • Variables are separated by “,” which is equivalent to “&”
  • Use | as logical “or” operator between variables
  • Usable are all logical operators, is.na, and %in%


Example

filter(RTR, op_name == "Pear")
## # A tibble: 100,965 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Oadff95f3-… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  2 Oc2852599-… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy…
##  3 Oef71e021-… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone…
##  4 Oef957387-… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy…
##  5 Oe0bab19d-… 2014-01-01 01:25:00 2014-01-01 2014… Pear    3G     Nexus 4
##  6 O25373236-… 2014-01-01 02:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  7 O8e2eaee3-… 2014-01-01 02:29:00 2014-01-01 2014… Pear    3G     Galaxy…
##  8 O6b01181c-… 2014-01-01 06:57:00 2014-01-01 2014… Pear    2G     SM-T211
##  9 O69618fa8-… 2014-01-01 07:14:00 2014-01-01 2014… Pear    4G     Galaxy…
## 10 Od48e6a8c-… 2014-01-01 07:15:00 2014-01-01 2014… Pear    3G     Galaxy…
## # ... with 100,955 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


More than one variable
filter(RTR, op_name == "Pear", nw_cat == "3G")
## # A tibble: 53,091 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Oadff95f3-… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  2 Oc2852599-… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy…
##  3 Oef71e021-… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone…
##  4 Oef957387-… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy…
##  5 Oe0bab19d-… 2014-01-01 01:25:00 2014-01-01 2014… Pear    3G     Nexus 4
##  6 O25373236-… 2014-01-01 02:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  7 O8e2eaee3-… 2014-01-01 02:29:00 2014-01-01 2014… Pear    3G     Galaxy…
##  8 Od48e6a8c-… 2014-01-01 07:15:00 2014-01-01 2014… Pear    3G     Galaxy…
##  9 O89a7c80e-… 2014-01-01 07:17:00 2014-01-01 2014… Pear    3G     LG G2  
## 10 Of8ff4a2d-… 2014-01-01 07:35:00 2014-01-01 2014… Pear    3G     Nexus 5
## # ... with 53,081 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


Usage of %in%
filter(RTR, device %in% c("Galaxy Note 2", "Galaxy Note 10.1"))
## # A tibble: 2,733 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Oadff95f3-… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  2 Oc2852599-… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy…
##  3 O2dfb19a1-… 2014-01-01 10:34:00 2014-01-01 2014… Kiwi    3G     Galaxy…
##  4 Oc88d177a-… 2014-01-01 11:59:00 2014-01-01 2014… Apple   3G     Galaxy…
##  5 O879652cc-… 2014-01-01 13:49:00 2014-01-01 2014… Pear    2G     Galaxy…
##  6 O5ac02695-… 2014-01-01 15:13:00 2014-01-01 2014… Pear    3G     Galaxy…
##  7 O9baf2d9f-… 2014-01-02 07:11:00 2014-01-02 2014… Pear    3G     Galaxy…
##  8 O7f3e3424-… 2014-01-02 07:20:00 2014-01-02 2014… Pear    3G     Galaxy…
##  9 O1939c394-… 2014-01-02 09:32:00 2014-01-02 2014… Apple   3G     Galaxy…
## 10 Ob369df63-… 2014-01-02 09:49:00 2014-01-02 2014… Apple   3G     Galaxy…
## # ... with 2,723 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


Usage of |
filter(RTR, rtr_speed_dl < 3000 | rtr_speed_ul < 500)
## # A tibble: 55,830 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 Oe2f2adbe… 2014-01-01 01:14:00 2014-01-01 2014… Kiwi    2G     HTC Wil…
##  5 Oe0bab19d… 2014-01-01 01:25:00 2014-01-01 2014… Pear    3G     Nexus 4 
##  6 O7cb19690… 2014-01-01 01:26:00 2014-01-01 2014… Kiwi    2G     Galaxy …
##  7 Oe1239547… 2014-01-01 02:14:00 2014-01-01 2014… Apple   2G     Galaxy …
##  8 Od5a015ed… 2014-01-01 02:31:00 2014-01-01 2014… Apple   3G     Galaxy …
##  9 O8e8bb20f… 2014-01-01 06:53:00 2014-01-01 2014… Apple   3G     Galaxy …
## 10 O6b01181c… 2014-01-01 06:57:00 2014-01-01 2014… Pear    2G     SM-T211 
## # ... with 55,820 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>



filter()

1. Filter all “Galaxy Note 2”!
filter(RTR, device == "Galaxy Note 2")
## # A tibble: 2,082 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Oadff95f3-… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  2 Oc2852599-… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy…
##  3 O879652cc-… 2014-01-01 13:49:00 2014-01-01 2014… Pear    2G     Galaxy…
##  4 O5ac02695-… 2014-01-01 15:13:00 2014-01-01 2014… Pear    3G     Galaxy…
##  5 O9baf2d9f-… 2014-01-02 07:11:00 2014-01-02 2014… Pear    3G     Galaxy…
##  6 O7f3e3424-… 2014-01-02 07:20:00 2014-01-02 2014… Pear    3G     Galaxy…
##  7 Ob369df63-… 2014-01-02 09:49:00 2014-01-02 2014… Apple   3G     Galaxy…
##  8 Obec412c9-… 2014-01-02 11:38:00 2014-01-02 2014… Pear    3G     Galaxy…
##  9 O71038e64-… 2014-01-02 11:45:00 2014-01-02 2014… Pear    3G     Galaxy…
## 10 Oc7129058-… 2014-01-02 12:16:00 2014-01-02 2014… Apple   3G     Galaxy…
## # ... with 2,072 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


2. Filter all Apple devices that have LTE!
filter(RTR, device_platform == "iOS", device_has_lte == TRUE)
## # A tibble: 75,025 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  2 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  3 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
##  4 O56352054… 2014-01-01 02:23:00 2014-01-01 2014… Kiwi    3G     iPhone …
##  5 O170f6f0d… 2014-01-01 08:41:00 2014-01-01 2014… Apple   3G     iPhone 5
##  6 O1f55288c… 2014-01-01 09:08:00 2014-01-01 2014… Pear    3G     iPhone 5
##  7 O497ba5c7… 2014-01-01 09:23:00 2014-01-01 2014… Pear    3G     iPhone 5
##  8 O3a0cc9be… 2014-01-01 10:22:00 2014-01-01 2014… Apple   3G     iPad mi…
##  9 O8afb0b63… 2014-01-01 10:29:00 2014-01-01 2014… Apple   3G     iPhone 5
## 10 O71c8d6ea… 2014-01-01 10:30:00 2014-01-01 2014… Pear    3G     iPhone 5
## # ... with 75,015 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


3. Filter all longitude coordinates that are greater than 13 and smaller than 14!
filter(RTR, longitude > 13, longitude < 14)
## # A tibble: 23,772 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Oadff95f3-… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy…
##  2 Oc2852599-… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy…
##  3 Oef71e021-… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone…
##  4 Oef957387-… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy…
##  5 Od5a015ed-… 2014-01-01 02:31:00 2014-01-01 2014… Apple   3G     Galaxy…
##  6 Of6074f8d-… 2014-01-01 02:40:00 2014-01-01 2014… Apple   3G     HTC On…
##  7 Of4f037aa-… 2014-01-01 08:06:00 2014-01-01 2014… Kiwi    3G     Galaxy…
##  8 O2e2223e0-… 2014-01-01 08:45:00 2014-01-01 2014… Pear    3G     iPhone…
##  9 Obe795f00-… 2014-01-01 10:36:00 2014-01-01 2014… Pear    3G     Galaxy…
## 10 Oaabff0d5-… 2014-01-01 10:41:00 2014-01-01 2014… Apple   3G     Galaxy…
## # ... with 23,762 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


4. Filter all longitude coordinates that are 13, 14, 15, and 16!
filter(RTR, longitude %in% c(13:16))
## # A tibble: 37 x 15
##    id           mtime               mymd       mym   op_name nw_cat device
##    <chr>        <dttm>              <date>     <chr> <chr>   <fct>  <fct> 
##  1 O3b76250e-0… 2014-10-11 07:39:44 2014-10-11 2014… Apple   3G     iPhon…
##  2 Od45f5834-0… 2014-10-16 14:37:03 2014-10-16 2014… Kiwi    3G     iPhon…
##  3 O19cd7844-b… 2014-10-20 08:16:10 2014-10-20 2014… Kiwi    3G     iPhon…
##  4 O39f99095-0… 2014-10-26 04:43:33 2014-10-26 2014… Pear    3G     iPhon…
##  5 O24bd0f28-6… 2014-12-29 20:39:35 2014-12-29 2014… Pear    4G     iPhon…
##  6 Odf21e5e7-9… 2015-01-25 19:00:03 2015-01-25 2015… Apple   4G     iPhon…
##  7 O9636cf3d-a… 2015-02-18 13:17:55 2015-02-18 2015… Pear    3G     iPhon…
##  8 O59aa5e04-b… 2015-03-07 21:27:03 2015-03-07 2015… Apple   4G     iPhon…
##  9 O3b6389d8-0… 2015-04-15 22:46:46 2015-04-15 2015… Apple   4G     iPhon…
## 10 O4cbaa4aa-6… 2015-04-15 22:47:12 2015-04-15 2015… Apple   4G     iPhon…
## # ... with 27 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


5. Filter all cases of device platforms that have no missing values!
filter(RTR, !is.na(device_platform))
## # A tibble: 239,060 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


6. Filter all cases that are greater than the mean download speed!
filter(RTR, rtr_speed_dl > mean(rtr_speed_dl))
## # A tibble: 76,742 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 O418baa97-… 2014-01-01 07:13:00 2014-01-01 2014… Apple   3G     ME302KL
##  2 O69618fa8-… 2014-01-01 07:14:00 2014-01-01 2014… Pear    4G     Galaxy…
##  3 Oefdc9660-… 2014-01-01 09:22:00 2014-01-01 2014… Pear    3G     Galaxy…
##  4 O9a22bd24-… 2014-01-01 11:59:00 2014-01-01 2014… Pear    3G     Galaxy…
##  5 Oef360337-… 2014-01-01 15:49:00 2014-01-01 2014… Pear    4G     Galaxy…
##  6 Ob815d039-… 2014-01-01 15:50:00 2014-01-01 2014… Pear    4G     Galaxy…
##  7 O9ccd7011-… 2014-01-02 00:49:00 2014-01-02 2014… Kiwi    4G     iPhone…
##  8 Ob9e210c4-… 2014-01-02 05:27:00 2014-01-02 2014… Apple   4G     Nexus 7
##  9 O8c771612-… 2014-01-02 06:28:00 2014-01-02 2014… Pear    3G     Galaxy…
## 10 Oc1e6bea7-… 2014-01-02 06:48:00 2014-01-02 2014… Apple   4G     Nexus 7
## # ... with 76,732 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


7. Filter all cases where the upload speed is smaller than 50% of the download speed!
filter(RTR, (rtr_speed_ul / rtr_speed_dl) < 0.5)
## # A tibble: 168,036 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  6 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
##  7 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
##  8 Oe2f2adbe… 2014-01-01 01:14:00 2014-01-01 2014… Kiwi    2G     HTC Wil…
##  9 Oe0bab19d… 2014-01-01 01:25:00 2014-01-01 2014… Pear    3G     Nexus 4 
## 10 O7cb19690… 2014-01-01 01:26:00 2014-01-01 2014… Kiwi    2G     Galaxy …
## # ... with 168,026 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


8. Filter all cases where the upload speed is greater than 10,000 or the device has LTE!
filter(RTR, rtr_speed_dl > 10000| device_has_lte == TRUE)
## # A tibble: 187,675 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  2 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  3 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  4 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  5 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
##  6 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
##  7 Oe0bab19d… 2014-01-01 01:25:00 2014-01-01 2014… Pear    3G     Nexus 4 
##  8 O25373236… 2014-01-01 02:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  9 Oe1239547… 2014-01-01 02:14:00 2014-01-01 2014… Apple   2G     Galaxy …
## 10 O56352054… 2014-01-01 02:23:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 187,665 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>



arrange()

What does arrange do?

  • arranges the dataset by a column
  • Syntax: arrange(.data,...)
    • .data = df/tbl
    • = Comma separated list of unquoted variable names



How does it work?

  • gives out a copy of the original dataset
  • orders data automatically (by factor-order, alphabetical or numerical)
  • 2 Options:
    • lowest on top: arrange(x, ...)
    • highest on top: arrange(x, desc(...))


Example

arrange(RTR,mtime) #lowest on top
## # A tibble: 239,060 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
arrange(RTR,desc(mtime)) #highest on top
## # A tibble: 239,060 x 15
##    id        mtime               mymd       mym   op_name nw_cat device   
##    <chr>     <dttm>              <date>     <chr> <chr>   <fct>  <fct>    
##  1 Of7c626b… 2015-11-24 23:54:27 2015-11-24 2015… Kiwi    3G     Galaxy S…
##  2 O8a0b863… 2015-11-24 23:48:12 2015-11-24 2015… Pear    4G     iPhone 6…
##  3 O50932ec… 2015-11-24 23:44:50 2015-11-24 2015… Apple   4G     iPhone 6 
##  4 O3d996a9… 2015-11-24 23:44:46 2015-11-24 2015… Apple   3G     Samsung …
##  5 O9bfca0d… 2015-11-24 23:43:20 2015-11-24 2015… Apple   3G     Samsung …
##  6 O5119e98… 2015-11-24 23:41:34 2015-11-24 2015… Pear    4G     Samsung …
##  7 O24c51b7… 2015-11-24 23:23:56 2015-11-24 2015… Pear    4G     Samsung …
##  8 Oafedb30… 2015-11-24 23:18:35 2015-11-24 2015… Pear    4G     Galaxy S…
##  9 O6036555… 2015-11-24 23:10:42 2015-11-24 2015… Pear    4G     Galaxy S…
## 10 Oe45067c… 2015-11-24 23:08:54 2015-11-24 2015… Pear    3G     Galaxy S…
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>

Example with a (non-ordered) factor

arrange(A, weekday)
## # A tibble: 1,096 x 5
##    ymd        weekday nr_weekday sum_out holiday
##    <fct>      <fct>        <int>   <int>   <dbl>
##  1 2007-01-05 Fri              5   25650   0.500
##  2 2007-01-12 Fri              5   13890   0.   
##  3 2007-01-19 Fri              5   13770   0.   
##  4 2007-01-26 Fri              5   14660   0.   
##  5 2007-02-02 Fri              5   17010   0.   
##  6 2007-02-09 Fri              5   20940   0.   
##  7 2007-02-16 Fri              5   15900   0.   
##  8 2007-02-23 Fri              5   13940   0.   
##  9 2007-03-02 Fri              5   20190   0.   
## 10 2007-03-09 Fri              5   18920   0.   
## # ... with 1,086 more rows


Arranging by multiple columns - priorities


Example

RTR_sel <- select(RTR, id, rtr_speed_ul,rtr_speed_dl,rtr_ping)
arrange(RTR_sel, rtr_speed_ul)
## # A tibble: 239,060 x 4
##    id                                   rtr_speed_ul rtr_speed_dl rtr_ping
##    <chr>                                       <dbl>        <dbl>    <dbl>
##  1 O432bcf6b-b63f-4d43-af52-87fb5adec5…           1.          42.     79.7
##  2 O9f5662ea-7408-401c-83cb-87911f3442…           1.          19.    220. 
##  3 Oe6465982-64f0-49d5-843e-45474ab0ca…           1.         418.     83.9
##  4 O3866a77e-67ec-4611-a7a4-dc5d33efbd…           1.         126.    320. 
##  5 O104a7cfe-8c62-4a8a-9afc-37c189965d…           1.        2250.     55.3
##  6 O546f6037-0dab-49d7-9b96-21c5e9cc9e…           1.          57.    279. 
##  7 O74d7ac03-ff7f-4b3c-8869-7398eb6cd2…           1.          20.    139. 
##  8 Oe9113807-1dbe-481f-aae9-b9e686c81f…           1.         105.    158. 
##  9 O1dab072d-bc1a-411a-a8b5-ecb590b054…           1.          92.    325. 
## 10 O88da59f3-ec80-4f48-a9a2-1b2eedeec7…           1.          23.    360. 
## # ... with 239,050 more rows
arrange(RTR_sel, rtr_speed_ul,rtr_speed_dl,rtr_ping)
## # A tibble: 239,060 x 4
##    id                                   rtr_speed_ul rtr_speed_dl rtr_ping
##    <chr>                                       <dbl>        <dbl>    <dbl>
##  1 O9f5662ea-7408-401c-83cb-87911f3442…           1.          19.    220. 
##  2 O74d7ac03-ff7f-4b3c-8869-7398eb6cd2…           1.          20.    139. 
##  3 O88da59f3-ec80-4f48-a9a2-1b2eedeec7…           1.          23.    360. 
##  4 O432bcf6b-b63f-4d43-af52-87fb5adec5…           1.          42.     79.7
##  5 O94402959-1e7d-4ed5-95d3-33bb4e99ac…           1.          42.    338. 
##  6 O2cbca85a-8bad-47bd-a977-0094b72787…           1.          52.    501. 
##  7 O546f6037-0dab-49d7-9b96-21c5e9cc9e…           1.          57.    279. 
##  8 O1dab072d-bc1a-411a-a8b5-ecb590b054…           1.          92.    325. 
##  9 Oe9113807-1dbe-481f-aae9-b9e686c81f…           1.         105.    158. 
## 10 O3866a77e-67ec-4611-a7a4-dc5d33efbd…           1.         126.    320. 
## # ... with 239,050 more rows



Combining arrange and mutate - arranging ranks


Example

### fastest device, within group
RTR %>%
  filter(!is.na(rtr_speed_ul) & !is.na(rtr_speed_dl))%>%
  group_by(device) %>% 
  mutate(total_speed = rtr_speed_dl + rtr_speed_ul, rank_per_device = rank(desc(total_speed))) %>%
  arrange(device, rank_per_device) %>%
  select(device, total_speed, rank_per_device)
## # A tibble: 239,060 x 3
## # Groups:   device [980]
##    device         total_speed rank_per_device
##    <fct>                <dbl>           <dbl>
##  1 02_jbla668          13209.              1.
##  2 2206_jbla768_f       8828.              1.
##  3 5042D              121736.              1.
##  4 5042D               95326.              2.
##  5 5042D               82375.              3.
##  6 5042D               76154.              4.
##  7 5042D               75454.              5.
##  8 5042D               72084.              6.
##  9 5042D               71631.              7.
## 10 5042D               70441.              8.
## # ... with 239,050 more rows
###fastest device, between group
RTR %>%
  filter(!is.na(rtr_speed_ul) & !is.na(rtr_speed_dl))%>%
  group_by(device) %>% 
  summarise(mean_total_speed = mean(rtr_speed_dl + rtr_speed_ul)) %>%
  mutate(rank_best_device = rank(desc(mean_total_speed)))%>%
  arrange(rank_best_device) %>%
  select(device, mean_total_speed, rank_best_device)
## # A tibble: 980 x 3
##    device                    mean_total_speed rank_best_device
##    <fct>                                <dbl>            <dbl>
##  1 STV100-4                           150651.               1.
##  2 PLK-UL00                           132022.               2.
##  3 E560                               124192.               3.
##  4 X12                                116530.               4.
##  5 Samsung Note 4 (SM-N910F)          115078.               5.
##  6 ZTE Blade A452                     114368.               6.
##  7 SM-N920F                           114082.               7.
##  8 SM-A510F                           112789.               8.
##  9 PLK-AL10                           112204.               9.
## 10 SM-A310F                           109553.              10.
## # ... with 970 more rows
### find the device with the smallest range in total speed
RTR %>%
  filter(!is.na(rtr_speed_ul) & !is.na(rtr_speed_dl))%>%
  group_by(device) %>% 
  summarise(range_tot_speed = diff(range(rtr_speed_dl + rtr_speed_ul)), n_devices = n()) %>%
  filter(n_devices > 5) %>%
  mutate(rank_smallest_range = rank(range_tot_speed)) %>%
  arrange(rank_smallest_range)
## # A tibble: 601 x 4
##    device                 range_tot_speed n_devices rank_smallest_range
##    <fct>                            <dbl>     <int>               <dbl>
##  1 ZP300                             158.        10                  1.
##  2 RC0709B                          1223.         8                  2.
##  3 RunboX5-King                     1579.         9                  3.
##  4 LA-Q1                            2196.         7                  4.
##  5 Galaxy Mini (GT-S5570)           2269.         7                  5.
##  6 C2005                            2299.        10                  6.
##  7 MEDION LIFE P4310                2512.        32                  7.
##  8 ZTE MEDION LIFE P4310            2790.         7                  8.
##  9 QUANTUM 4                        2946.         7                  9.
## 10 Sony Xperia tipo Dual            2990.         6                 10.
## # ... with 591 more rows


arrange()


1. Simple arrange call over RTR: arrange by upload-speed (note: see how a numeric gets ordered)
arrange(RTR, rtr_speed_ul)
## # A tibble: 239,060 x 15
##    id           mtime               mymd       mym   op_name nw_cat device
##    <chr>        <dttm>              <date>     <chr> <chr>   <fct>  <fct> 
##  1 O432bcf6b-b… 2014-01-02 14:53:00 2014-01-02 2014… Kiwi    3G     iPhon…
##  2 O9f5662ea-7… 2014-01-07 12:24:00 2014-01-07 2014… Kiwi    2G     iPhon…
##  3 Oe6465982-6… 2014-01-11 10:18:00 2014-01-11 2014… Apple   3G     iPhon…
##  4 O3866a77e-6… 2014-01-19 07:19:00 2014-01-19 2014… Pear    2G     iPhon…
##  5 O104a7cfe-8… 2014-01-26 19:45:00 2014-01-26 2014… Apple   3G     iPhon…
##  6 O546f6037-0… 2014-03-29 18:53:00 2014-03-29 2014… Pear    2G     iPhon…
##  7 O74d7ac03-f… 2014-04-26 15:21:00 2014-04-26 2014… Apple   3G     iPhon…
##  8 Oe9113807-1… 2014-05-05 10:17:00 2014-05-05 2014… Kiwi    2G     iPhon…
##  9 O1dab072d-b… 2014-05-24 15:57:00 2014-05-24 2014… Pear    2G     iPhon…
## 10 O88da59f3-e… 2014-08-09 14:24:00 2014-08-09 2014… Pear    2G     iPhon…
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


2. Arrange with desc() - arrange by upload-speed, highest on top and in second order by lowest ping
(e.g. person with fastest upload-speed should be on top and if 2 or more people have the same ul-speed,it gets ordered by the lower ping)!
arrange(RTR, desc(rtr_speed_ul), rtr_ping)
## # A tibble: 239,060 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 O87536f15-… 2014-08-08 18:45:00 2014-08-08 2014… Pear    4G     Galaxy…
##  2 O430537d0-… 2015-11-18 21:57:42 2015-11-18 2015… Pear    4G     SM-G92…
##  3 Ofbc28c59-… 2014-09-08 16:58:00 2014-09-08 2014… Pear    4G     Galaxy…
##  4 O976b7a52-… 2014-11-07 12:14:40 2014-11-07 2014… Pear    4G     Galaxy…
##  5 Ocfddd626-… 2014-11-06 07:27:00 2014-11-06 2014… Pear    4G     Galaxy…
##  6 O008e8fca-… 2014-12-11 05:11:22 2014-12-11 2014… Pear    4G     A0001  
##  7 Od84cb3d1-… 2014-08-28 17:28:00 2014-08-28 2014… Pear    4G     Galaxy…
##  8 Of214831b-… 2014-09-04 06:21:00 2014-09-04 2014… Pear    4G     Galaxy…
##  9 Ofd40caec-… 2014-08-29 14:34:00 2014-08-29 2014… Pear    4G     Galaxy…
## 10 O4d895570-… 2014-10-19 09:04:39 2014-10-19 2014… Pear    4G     Sony X…
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


3. Arrange RTR by upload speed and in second order by the download speed and in third order by highest ping (slowest upload + slowest download + highest ping should be on top)!
arrange(RTR,rtr_speed_ul,rtr_speed_dl,desc(rtr_ping))
## # A tibble: 239,060 x 15
##    id           mtime               mymd       mym   op_name nw_cat device
##    <chr>        <dttm>              <date>     <chr> <chr>   <fct>  <fct> 
##  1 O9f5662ea-7… 2014-01-07 12:24:00 2014-01-07 2014… Kiwi    2G     iPhon…
##  2 O74d7ac03-f… 2014-04-26 15:21:00 2014-04-26 2014… Apple   3G     iPhon…
##  3 O88da59f3-e… 2014-08-09 14:24:00 2014-08-09 2014… Pear    2G     iPhon…
##  4 O94402959-1… 2014-08-24 10:08:00 2014-08-24 2014… Pear    2G     iPhon…
##  5 O432bcf6b-b… 2014-01-02 14:53:00 2014-01-02 2014… Kiwi    3G     iPhon…
##  6 O2cbca85a-8… 2015-07-06 08:42:42 2015-07-06 2015… Pear    2G     iPhon…
##  7 O546f6037-0… 2014-03-29 18:53:00 2014-03-29 2014… Pear    2G     iPhon…
##  8 O1dab072d-b… 2014-05-24 15:57:00 2014-05-24 2014… Pear    2G     iPhon…
##  9 Oe9113807-1… 2014-05-05 10:17:00 2014-05-05 2014… Kiwi    2G     iPhon…
## 10 O3866a77e-6… 2014-01-19 07:19:00 2014-01-19 2014… Pear    2G     iPhon…
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


4. Find the person with the slowest upload and download speed - create in the arrange-function a new variable “full_speed”, that is the sum of upload and download-speed and arrange by it (if you’ve done it correctly this happens automatically!) and in second order by highest ping and select only the device and the nw_cat!
temp1 <- arrange(RTR, full_speed = rtr_speed_dl + rtr_speed_ul, desc(rtr_ping))
select(temp1, device, nw_cat)
## # A tibble: 239,060 x 2
##    device         nw_cat
##    <fct>          <fct> 
##  1 iPhone 5s      2G    
##  2 iPhone 5s      2G    
##  3 iPhone 5s      2G    
##  4 Galaxy S3      2G    
##  5 iPhone 5       2G    
##  6 iPhone 5s      2G    
##  7 iPad 2 GSM     3G    
##  8 Sony Xperia SP 2G    
##  9 HTC One        3G    
## 10 Galaxy S2      2G    
## # ... with 239,050 more rows



summarise()

What does summarise() do?

  • uses dataset to create summary statistics
    • is a new dataset with length 1 (if data not grouped)
  • Syntax: summarise(.data,…)
    • are name-value pairs of comma separated summary functions
      • name = name of the variable in the results
      • value = an expression that returns a single value (e.g. min(x))

Example

summarise(RTR, max_download_speed = max(rtr_speed_dl))
## # A tibble: 1 x 1
##   max_download_speed
##                <dbl>
## 1            149989.


Which functions can be used in summarise?


Small but useful:

(a few) Functions Semantic
min(x) smallest value of x
max(x) gives back the biggest value of x
n() number of observations
n_distinct(x) number of unique observations of x
mean(x) mean of x
median(x) median of x


Example

RTR_fil <- filter(RTR, !is.na(rtr_speed_dl))
summarise(RTR_fil, n_devices = n(), n_unique_devices = n_distinct(device), mean_dl_speed = mean(rtr_speed_dl))
## # A tibble: 1 x 3
##   n_devices n_unique_devices mean_dl_speed
##       <int>            <int>         <dbl>
## 1    239060              980        21904.


Why should i use summarise()?



summarise()


1. Use summarise to give back the amount of devices (call the variable “n_observed”) and the amount of unique devices (call it “unique_devices”)
summarise(RTR, n_observed = n(), unique_devices = n_distinct(device))
## # A tibble: 1 x 2
##   n_observed unique_devices
##        <int>          <int>
## 1     239060            980


2. Use summarise to get the biggest download-speed, the smallest upload-speed, the median date (mymd) and the mean ping!
summarise(RTR, max(rtr_speed_dl), min(rtr_speed_ul), median(mymd), mean(rtr_ping))
## # A tibble: 1 x 4
##   `max(rtr_speed_dl)` `min(rtr_speed_ul)` `median(mymd)` `mean(rtr_ping)`
##                 <dbl>               <dbl> <date>                    <dbl>
## 1             149989.                  1. 2015-01-20                 54.1


3. Combination: Find out the number of devices (“n_devices”), the number of unique devices (“n_unique_devices”) and the mean download speed of devices that run on Android.
(Tipp: Filter the missing values in rtr_speed_dl & Android-Phones (device_platform) and then use summarise to get the variables)
temp1 <- filter(RTR, !is.na(rtr_speed_dl) & device_platform == "Android")
summarise(temp1, n_devices = n(), n_unique_devices = n_distinct(device), mean_dl_speed = mean(rtr_speed_dl))
## # A tibble: 1 x 3
##   n_devices n_unique_devices mean_dl_speed
##       <int>            <int>         <dbl>
## 1    154519              951        20261.



Pipe Operator

  • Syntax: .data %>% function(x) equivalent to function(.data,x)
    • .data = df/tbl
    • x = any parameters

summarise(RTR,min_longitude=min(longitude),max_longitude=max(longitude)) 
RTR %>% 
  summarise(min_longitude=min(longitude),max_longitude=max(longitude)) 
## # A tibble: 1 x 2
##   min_longitude max_longitude
##           <dbl>         <dbl>
## 1          9.54          17.1
summarise(RTR,min_latitude=min(latitude),max_latitude=max(latitude)) 
RTR %>% 
  summarise(min_latitude=min(latitude),max_latitude=max(latitude)) 
## # A tibble: 1 x 2
##   min_latitude max_latitude
##          <dbl>        <dbl>
## 1         46.4         49.0



### Piping


1. In this lab you have to calculate the mean download speed and the median upload speed of the RTR data using the pipe operator. Below you can see how the output should look like.
RTR %>% 
  summarise(mean_rtr_speed_dl = mean(rtr_speed_dl, na.rm = TRUE), median_rtr_speed_ul=median(rtr_speed_ul,na.rm = TRUE)) 
## # A tibble: 1 x 2
##   mean_rtr_speed_dl median_rtr_speed_ul
##               <dbl>               <dbl>
## 1            21904.               2656.


2. In this lab you have to calculate the mean download speed for all devices using the LTE technology.
RTR %>% 
  filter(device_has_lte==TRUE) %>% 
  summarise(mean_rtr_speed_dl=mean(rtr_speed_dl))
## # A tibble: 1 x 1
##   mean_rtr_speed_dl
##               <dbl>
## 1            27156.


3. In this lab you have to calculate the median download speed for all devices using the LTE technology.
RTR %>% 
  filter(device_has_lte==TRUE) %>% 
  summarise(median_rtr_speed_dl=median(rtr_speed_dl))
## # A tibble: 1 x 1
##   median_rtr_speed_dl
##                 <dbl>
## 1              16838.


4. In this lab you have to calculate the median and the maximum download speed for all devices which do not use the LTE technology.
RTR %>% 
  filter(device_has_lte==FALSE) %>% 
  summarise(median_rtr_speed_dl=median(rtr_speed_dl), max_rtr_speed_dl=max(rtr_speed_dl))
## # A tibble: 1 x 2
##   median_rtr_speed_dl max_rtr_speed_dl
##                 <dbl>            <dbl>
## 1               3261.          127231.


5. Calculate the minimum value of all columns containing the string ‘rtr’. This summary should just include devices of the Pear operator.
RTR %>% 
  filter(op_name=='Pear') %>% 
  select(contains('rtr')) %>% 
  summarise_all(funs(mean))
## # A tibble: 1 x 3
##   rtr_speed_dl rtr_speed_ul rtr_ping
##          <dbl>        <dbl>    <dbl>
## 1       25213.        9063.     51.4


6. Calculate the mean download speed for all Galaxy Note 2 with ping > 70 and upload speed >2000. This summary should just include devices of the Kiwi operator.
RTR %>% 
  filter(device=='Galaxy Note 2', op_name=='Kiwi',rtr_ping > 70, rtr_speed_ul> 2000) %>% 
  summarise(mean_rtr_speed_dl = mean(rtr_speed_dl))
## # A tibble: 1 x 1
##   mean_rtr_speed_dl
##               <dbl>
## 1             5658.


7. Salzburg’s geographic longitude is 13.033333. Compare the median download rate of regions above (east) and below (west) this longitude.
e_median_dl=RTR %>% 
  filter(longitude>13.033333) %>% 
  summarise(east_median_dl= median(rtr_speed_dl))
w_median_dl= RTR %>% 
  filter(longitude<13.033333) %>% 
  summarise(west_median_dl= median(rtr_speed_dl))

e_median_dl=e_median_dl$east_median_dl
w_median_dl=w_median_dl$west_median_dl
cat(paste('The median download rate is ',e_median_dl,' in the east of Salzburg and ',w_median_dl,' in the west.',sep=''))
## The median download rate is 11034 in the east of Salzburg and 9478 in the west.


8. Calculate the median geographic latitude. Then compare the median upload rate between above and under this line.
lat=RTR %>% 
  summarise(latitude_median= median(latitude))
lat_median=lat$latitude_median[1]
north_median=RTR %>% 
  filter(latitude>lat_median) %>% 
  summarise(north_median_ul= median(rtr_speed_ul))
south_median=RTR %>% 
  filter(latitude<lat_median) %>%
  summarise(south_median_ul= median(rtr_speed_ul))
north_median=north_median$north_median_ul
south_median=south_median$south_median_ul
cat(paste("The latitude median is ",lat_median,'. The obtained median upload rate north this line is ',north_median,' and south \n this line ',south_median, '.',sep=''))
## The latitude median is 48.08865835. The obtained median upload rate north this line is 2807 and south 
##  this line 2506.



group_by()

  • Breaks down a dataset into groups of rows
  • Verbs are then applied by group
  • Used to inspect and compare group differences


How does group_by() affect the basic verbs?

Verb Effect when grouped
select() Same effect as ungrouped. Grouping variables are always retained
arrange() Same effect as ungrouped. When .by_group = TRUE it first orders by grouping variables
filter() Same effect as ungrouped. Useful functions like rank() or min(x) == x
mutate() Same effect as ungrouped. Useful functions like rank() or min(x) == x
summarise() Computes the summary for each group


Ergo: group_by() is most powerful when combined with summarise()!


Example

group_by(RTR, device)
## # A tibble: 239,060 x 15
## # Groups:   device [980]
##    id         mtime               mymd       mym   op_name nw_cat device  
##  * <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Oadff95f3… 2014-01-01 00:10:00 2014-01-01 2014… Pear    3G     Galaxy …
##  2 Oc2852599… 2014-01-01 00:16:00 2014-01-01 2014… Pear    3G     Galaxy …
##  3 O48a47c74… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  4 O69f25c23… 2014-01-01 00:21:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  5 Oede0db57… 2014-01-01 00:27:00 2014-01-01 2014… Kiwi    3G     Galaxy …
##  6 O383e9306… 2014-01-01 00:41:00 2014-01-01 2014… Apple   3G     iPhone …
##  7 O4b28bf72… 2014-01-01 00:42:00 2014-01-01 2014… Apple   3G     iPhone …
##  8 Oef71e021… 2014-01-01 00:48:00 2014-01-01 2014… Pear    3G     iPhone …
##  9 Oef957387… 2014-01-01 01:05:00 2014-01-01 2014… Pear    3G     Galaxy …
## 10 O6c813ae1… 2014-01-01 01:07:00 2014-01-01 2014… Kiwi    3G     iPhone …
## # ... with 239,050 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


When there are multiple grouping variables, every execution of summarise() peels off one level of the grouping
dev <- group_by(RTR, device_platform, device)
(per_dev <- summarise(dev, count = n()))
## # A tibble: 980 x 3
## # Groups:   device_platform [?]
##    device_platform device           count
##    <fct>           <fct>            <int>
##  1 Android         02_jbla668           1
##  2 Android         2206_jbla768_f       1
##  3 Android         5042D               43
##  4 Android         6039Y               10
##  5 Android         6045K                6
##  6 Android         6045Y                7
##  7 Android         7043K                1
##  8 Android         709v82_jbla118      14
##  9 Android         8030Y                2
## 10 Android         96WJ99PJ76LT39B1     6
## # ... with 970 more rows
(per_plat <- summarise(per_dev, count = sum(count)))
## # A tibble: 2 x 2
##   device_platform  count
##   <fct>            <int>
## 1 Android         154519
## 2 iOS              84541


Be careful when rolling up summaries!

Think about the weighting for means and variances!


Application of the pipe operator
dl <- RTR %>%
  group_by(device) %>%
  summarise(count = n(), dl = mean(rtr_speed_dl, na.rm = TRUE), ping = mean(rtr_ping, na.rm = TRUE)) %>%
  filter(dl < 60000, count > 100)

ggplot(dl, aes(dl, ping)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()



group_by()

1. First calculate the mean upload speed per device then calculate the weighted mean upload speed per mobile phone standard (nw_cat)! Finally rank the mean in descending order!
per_device <- RTR %>%
  group_by(nw_cat, device) %>%
  summarise(count_dev = n(), ul = mean(rtr_speed_ul, na.rm = TRUE))
(per_nw_cat <- per_device %>% 
  summarise(count_nw = n(), ul = weighted.mean(ul, count_dev, na.rm = TRUE)) %>%
  mutate(rank = rank(desc(ul))) %>%
  arrange(rank))
## # A tibble: 3 x 4
##   nw_cat count_nw     ul  rank
##   <fct>     <int>  <dbl> <dbl>
## 1 4G          302 19625.    1.
## 2 3G          913  1966.    2.
## 3 2G          462   125.    3.



Complex integration of verbs


Complex examples

Explore the use of group-by in combination with summarise:

1. Take the RTR-set and Write a function, that lets you see, which devices were most frequently used from most used to least used - most used should have rank 1!
(tipp: use group_by, summarise, mutate (for a (descending) ranking) and arrange by the ranking)
RTR %>%
  group_by(device) %>%
  summarise(freq_of_use = n()) %>%
  mutate(rank = rank(desc(freq_of_use))) %>%
  arrange(rank)
## # A tibble: 980 x 3
##    device                freq_of_use  rank
##    <fct>                       <int> <dbl>
##  1 iPhone 6                    21743    1.
##  2 iPhone 5s                   18917    2.
##  3 iPhone 5                    14031    3.
##  4 Galaxy S4 LTE               10634    4.
##  5 Galaxy S5                   10366    5.
##  6 Galaxy S3                    9551    6.
##  7 iPhone 6 Plus                6734    7.
##  8 LGE Nexus 5                  5702    8.
##  9 iPad mini 2G Cellular        5638    9.
## 10 Galaxy S4 LTE-A              4852   10.
## # ... with 970 more rows


Using the pipe-operators and a combination of dplyr-functions:

2. Take the RTR set and filter for downloadspeed and uploadspeed greater then their respective means. Then create a new variable (with mutate) “sum_speed” that is the sum of the downloadspeed and the uploadspeed. create another new variable “rank_speed” where you rank “sum_speed” descendingly. Arrange by “rank_speed” and in the final output show only “sum_speed” and “rank_speed”

RTR %>%
  select(rtr_speed_dl, rtr_speed_ul) %>%
  filter(rtr_speed_dl > mean(rtr_speed_dl) & rtr_speed_ul > mean(rtr_speed_ul)) %>%
  mutate(sum_speed = (rtr_speed_dl + rtr_speed_ul), rank_speed = rank(desc(sum_speed))) %>%
  arrange(rank_speed) %>%
  select(sum_speed, rank_speed)
## # A tibble: 55,288 x 2
##    sum_speed rank_speed
##        <dbl>      <dbl>
##  1   197710.         1.
##  2   196830.         2.
##  3   196075.         3.
##  4   195933.         4.
##  5   195860.         5.
##  6   195675.         6.
##  7   195147.         7.
##  8   194772.         8.
##  9   194707.         9.
## 10   194615.        10.
## # ... with 55,278 more rows


3. Now write your own nested summarise-code: your last summarise should show how many unique devices run on which nw_cat further grouped by device_platform.

Tip: your group-by call needs 3 columns and afterwards you need 2 summarise calls.

Tip 2: use device_platform, nw_cat, device in this order; summarise with number_devices = n() to get number of devices per model and further summarise the n_distinct of number_devices to see how many unique models run on their respective nw-category split into Android and iOS-phones

RTR %>%
  group_by(device_platform, nw_cat,device) %>%
  summarise(n_per_device = n()) %>%
  summarise(unique_devices = n_distinct(n_per_device))
## # A tibble: 6 x 3
## # Groups:   device_platform [?]
##   device_platform nw_cat unique_devices
##   <fct>           <fct>           <int>
## 1 Android         2G                 71
## 2 Android         3G                179
## 3 Android         4G                117
## 4 iOS             2G                 18
## 5 iOS             3G                 28
## 6 iOS             4G                 20

Joins

What does join do?

  • Syntax: e.g. inner_join(x,y,by=NULL,copy = FALSE,suffix = c(".x",".y"),...)
    • joins 2 tbls (x,y) with a shared character vector of variables (by =“key-col”) together
      • by default: “natural join” -> all variables with common names across the 2 sets
      • 6 different join-functions in dplyr
    • best case: variables in key-col match 100 percent - bind_rows would be enough
    • usually: only a few matching variables in x and y
      • different joins - different output


Why you should use join and not merge

  • Keeps order of rows as is
  • Syntax is easier to read, learn and “more intuitive”
  • And: can be used on databases (connect and change it there with dplyr-code)


Example for explicitly stated & natural joins

file1: rows 0 - 20
file2: rows 10 - 30
(temp1 <- left_join(file1, file2, by = "id")) #explicitly stating keycol
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
(temp2 <- merge(file1, file2, by = "id", all.x = T))
##                                       id                      device
## 1  Offfb0e8a-8a4e-46d1-ae82-3ccd58e48964                   iPhone 5s
## 2  Offfb42a3-58d7-469e-bbfd-53031a70836c              Galaxy S4 Mini
## 3  Offfb681e-19c5-4285-b29e-81984e6bd36e                HTC Desire S
## 4  Offfbe951-bf98-4653-97e0-b4da1f6683f8            Galaxy S4 Active
## 5  Offfcf842-b732-4b28-b083-ea2878afbcd4            Galaxy S4 Active
## 6  Offfd173b-89ee-440c-bfb2-56a6efac165a             Galaxy S4 LTE-A
## 7  Offfd5337-2602-4a82-8f6d-37acfaaf754c               Galaxy S3 LTE
## 8  Offfdc5a7-7490-463d-abfe-e92b7e083e21                   HTC One S
## 9  Offfe3f7c-51f6-4b98-a833-a7b1e43eef99             Galaxy S4 LTE-A
## 10 Offfe464e-3c77-4bd6-b6f2-18a5646f8481                    iPhone 6
## 11 Offfe4ba7-55fe-4537-bf2b-1f9402709510             Galaxy S4 LTE-A
## 12 Offfe5a19-84ae-4909-9cae-b3443d1204da iPad Air 2 (Wi-Fi/Cellular)
## 13 Offfe8df3-5180-411e-965a-b8ebc824676d               Galaxy S4 LTE
## 14 Offfeb095-0ff8-4cb9-b276-1c9e97e637c5                    SM-G925F
## 15 Offfecdb4-76d0-4697-be0e-11b2911963ac                    iPhone 5
## 16 Offfeec5c-4b40-46a4-ae97-c121f326e1e2              HTC Desire 510
## 17 Offff0ee1-1cb0-4b23-ac4c-aa8b704b62f9                   Galaxy S5
## 18 Offff830e-3122-4992-ab3e-a19fbef43747                   iPhone 4s
## 19 Offffa9ea-865d-4f7e-8d14-e5be6006bd1a              iPhone 6s Plus
## 20 Offfffc9d-75a5-4665-9aa1-c7378f0811c2                   iPhone 6s
##    device_platform device_has_lte rtr_speed_dl rtr_speed_ul rtr_ping
## 1              iOS           TRUE         3835         2137  37.2755
## 2          Android           TRUE        15259         3160  38.0284
## 3          Android          FALSE          655          872  50.3343
## 4          Android           TRUE         3520         1877  47.8821
## 5          Android           TRUE        19065         1750  39.5813
## 6          Android           TRUE        13085          827  31.5617
## 7          Android           TRUE        63335        22201  25.0297
## 8          Android          FALSE         3962         3012  70.8923
## 9          Android           TRUE         3340           65  37.9499
## 10             iOS           TRUE        40061          987  39.9457
## 11         Android           TRUE       140156        48355  19.7262
## 12             iOS           TRUE        51751        22994  21.2528
## 13         Android           TRUE        25868         3362  40.4511
## 14         Android           TRUE        45008        19824  42.7394
## 15             iOS           TRUE         3810         1343  39.1077
## 16         Android           TRUE        39025        18378  18.7053
## 17         Android           TRUE         2486           58 140.3440
## 18             iOS          FALSE         5608         1984  60.9711
## 19             iOS           TRUE        59876        13818  39.8866
## 20             iOS           TRUE       105078        25598  25.3622
##                  mtime       mymd     mym op_name nw_cat longitude
## 1  2014-12-22 18:29:19 2014-12-22 2014-12    Kiwi     3G  16.34548
## 2  2014-08-28 09:28:00 2014-08-28 2014-08   Apple     3G  16.57574
## 3  2015-09-10 18:03:04 2015-09-10 2015-09    Pear     3G  16.32590
## 4  2014-05-22 09:57:00 2014-05-22 2014-05   Apple     3G  14.33279
## 5  2014-07-21 20:10:00 2014-07-21 2014-07    Pear     3G  12.56355
## 6  2014-05-19 13:51:00 2014-05-19 2014-05    Pear     3G  15.99446
## 7  2015-03-27 08:46:18 2015-03-27 2015-03    Kiwi     4G  16.76051
## 8  2014-06-24 19:46:00 2014-06-24 2014-06   Apple     3G  13.63181
## 9  2014-02-22 08:34:00 2014-02-22 2014-02    Pear     3G  15.89740
## 10 2015-08-19 09:47:50 2015-08-19 2015-08   Apple     4G  14.31039
## 11                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 12                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 13                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 14                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 15                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 16                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 17                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 18                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 19                <NA>       <NA>    <NA>    <NA>   <NA>        NA
## 20                <NA>       <NA>    <NA>    <NA>   <NA>        NA
##    latitude iso_adm2
## 1  48.15482   at0900
## 2  48.30935   at0308
## 3  48.24507   at0900
## 4  46.61040   at0201
## 5  47.58698   at0704
## 6  46.68344   at0623
## 7  47.94608   at0107
## 8  48.00336   at0417
## 9  47.24231   at0622
## 10 46.62826   at0201
## 11       NA     <NA>
## 12       NA     <NA>
## 13       NA     <NA>
## 14       NA     <NA>
## 15       NA     <NA>
## 16       NA     <NA>
## 17       NA     <NA>
## 18       NA     <NA>
## 19       NA     <NA>
## 20       NA     <NA>
setequal(temp1, temp2)
## TRUE
identical(temp1, temp2)
## [1] FALSE
left_join(file1, file2) #natural join - see message
## Joining, by = "id"
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>


Mutating joins

  • join 2 tbls and return a new dataset
  • 4 different functions:
    • left_join: matching rows of y get joined into x
    • right_join: matching rows of x get joined into y
    • inner_join: “exclusive join” only rows with matches in the other set get joined
    • full_join: every row from y gets merged into x


Examples mutating joins

left_join(file1, file2, by = "id")
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
right_join(file1, file2, by = "id")
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe4… iPhon… iOS             TRUE                 40061.         987.
##  2 Offfe3… Galax… Android         TRUE                  3340.          65.
##  3 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  4 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  5 Offfd1… Galax… Android         TRUE                 13085.         827.
##  6 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  7 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  8 Offfb6… HTC D… Android         FALSE                  655.         872.
##  9 Offfb4… Galax… Android         TRUE                 15259.        3160.
## 10 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
## 11 Offfaf… <NA>   <NA>            NA                      NA           NA 
## 12 Offfaf… <NA>   <NA>            NA                      NA           NA 
## 13 Offfa1… <NA>   <NA>            NA                      NA           NA 
## 14 Offf97… <NA>   <NA>            NA                      NA           NA 
## 15 Offf97… <NA>   <NA>            NA                      NA           NA 
## 16 Offf93… <NA>   <NA>            NA                      NA           NA 
## 17 Offf92… <NA>   <NA>            NA                      NA           NA 
## 18 Offf91… <NA>   <NA>            NA                      NA           NA 
## 19 Offf90… <NA>   <NA>            NA                      NA           NA 
## 20 Offf8c… <NA>   <NA>            NA                      NA           NA 
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
inner_join(file1, file2, by = "id")
## # A tibble: 10 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  8 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  9 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 10 Offfb6… HTC D… Android         FALSE                  655.         872.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
full_join(file1, file2, by = "id")
## # A tibble: 30 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## # ... with 20 more rows, and 9 more variables: rtr_ping <dbl>,
## #   mtime <dttm>, mymd <date>, mym <chr>, op_name <chr>, nw_cat <fct>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>


What to do, when key columns don’t match by name?

file1 <- rename(file1, id_number = id)

left_join(file1, file2) #with error
## Error: `by` required, because the data sources have no common variables
left_join(file1, file2, by = c("id_number" = "id")) #solution
## # A tibble: 20 x 15
##    id_number        device     device_platform device_has_lte rtr_speed_dl
##    <chr>            <fct>      <fct>           <lgl>                 <dbl>
##  1 Offfe3f7c-51f6-… Galaxy S4… Android         TRUE                  3340.
##  2 Offfd173b-89ee-… Galaxy S4… Android         TRUE                 13085.
##  3 Offfbe951-bf98-… Galaxy S4… Android         TRUE                  3520.
##  4 Offfdc5a7-7490-… HTC One S  Android         FALSE                 3962.
##  5 Offfcf842-b732-… Galaxy S4… Android         TRUE                 19065.
##  6 Offfb42a3-58d7-… Galaxy S4… Android         TRUE                 15259.
##  7 Offff0ee1-1cb0-… Galaxy S5  Android         TRUE                  2486.
##  8 Offfb0e8a-8a4e-… iPhone 5s  iOS             TRUE                  3835.
##  9 Offfe8df3-5180-… Galaxy S4… Android         TRUE                 25868.
## 10 Offfd5337-2602-… Galaxy S3… Android         TRUE                 63335.
## 11 Offfeb095-0ff8-… SM-G925F   Android         TRUE                 45008.
## 12 Offfeec5c-4b40-… HTC Desir… Android         TRUE                 39025.
## 13 Offfe464e-3c77-… iPhone 6   iOS             TRUE                 40061.
## 14 Offfecdb4-76d0-… iPhone 5   iOS             TRUE                  3810.
## 15 Offfb681e-19c5-… HTC Desir… Android         FALSE                  655.
## 16 Offffa9ea-865d-… iPhone 6s… iOS             TRUE                 59876.
## 17 Offfe4ba7-55fe-… Galaxy S4… Android         TRUE                140156.
## 18 Offfffc9d-75a5-… iPhone 6s  iOS             TRUE                105078.
## 19 Offff830e-3122-… iPhone 4s  iOS             FALSE                 5608.
## 20 Offfe5a19-84ae-… iPad Air … iOS             TRUE                 51751.
## # ... with 10 more variables: rtr_speed_ul <dbl>, rtr_ping <dbl>,
## #   mtime <dttm>, mymd <date>, mym <chr>, op_name <chr>, nw_cat <fct>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>


Filtering joins


Examples filtering joins

semi_join(file1, file2)
## Joining, by = "id"
## # A tibble: 10 x 7
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  8 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  9 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 10 Offfb6… HTC D… Android         FALSE                  655.         872.
## # ... with 1 more variable: rtr_ping <dbl>
anti_join(file1, file2)
## Joining, by = "id"
## # A tibble: 10 x 7
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offff0… Galax… Android         TRUE                  2486.          58.
##  2 Offfe8… Galax… Android         TRUE                 25868.        3362.
##  3 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
##  4 Offfee… HTC D… Android         TRUE                 39025.       18378.
##  5 Offfec… iPhon… iOS             TRUE                  3810.        1343.
##  6 Offffa… iPhon… iOS             TRUE                 59876.       13818.
##  7 Offfe4… Galax… Android         TRUE                140156.       48355.
##  8 Offfff… iPhon… iOS             TRUE                105078.       25598.
##  9 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 10 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 1 more variable: rtr_ping <dbl>


What can go wrong in joins?




Joins

Creating files to work with - execute the code for file 1 and file 2

file1 contains the first 20 rows of data about devices and their respective speed + the id-variable
file 2 contains rows 10 to 30 timestamps and geographical data + the id-variable -


Mutating joins

The syntax is easy, it is just important to see, what every single function does with your set - use file1 and file2 to join and look at the output -
1.1. Can you recreate the same output with 4 different codes (all left_join())?
## left_join
# method 1 - natural call
left_join(file1, file2)

# method 2 - explicitly define the id
left_join(file1, file2, by = "id")

# method 3 - explicit call with pipe-operator
file1 %>%
  left_join(file2, by = "id")

# method 4 - natural call with pipe operator
file1 %>%
  left_join(file2)
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## 11 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
## 12 Offfee… HTC D… Android         TRUE                 39025.       18378.
## 13 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 14 Offfec… iPhon… iOS             TRUE                  3810.        1343.
## 15 Offfb6… HTC D… Android         FALSE                  655.         872.
## 16 Offffa… iPhon… iOS             TRUE                 59876.       13818.
## 17 Offfe4… Galax… Android         TRUE                140156.       48355.
## 18 Offfff… iPhon… iOS             TRUE                105078.       25598.
## 19 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 20 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>


1.2. Right join file 1 and file 2 and try to understand what happened with the data!
right_join(file1, file2)
## # A tibble: 20 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe4… iPhon… iOS             TRUE                 40061.         987.
##  2 Offfe3… Galax… Android         TRUE                  3340.          65.
##  3 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  4 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  5 Offfd1… Galax… Android         TRUE                 13085.         827.
##  6 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  7 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  8 Offfb6… HTC D… Android         FALSE                  655.         872.
##  9 Offfb4… Galax… Android         TRUE                 15259.        3160.
## 10 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
## 11 Offfaf… <NA>   <NA>            NA                      NA           NA 
## 12 Offfaf… <NA>   <NA>            NA                      NA           NA 
## 13 Offfa1… <NA>   <NA>            NA                      NA           NA 
## 14 Offf97… <NA>   <NA>            NA                      NA           NA 
## 15 Offf97… <NA>   <NA>            NA                      NA           NA 
## 16 Offf93… <NA>   <NA>            NA                      NA           NA 
## 17 Offf92… <NA>   <NA>            NA                      NA           NA 
## 18 Offf91… <NA>   <NA>            NA                      NA           NA 
## 19 Offf90… <NA>   <NA>            NA                      NA           NA 
## 20 Offf8c… <NA>   <NA>            NA                      NA           NA 
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>


2. Inner join file 1 and file 2 and try to understand what happened with the data!
inner_join(file1, file2)
## # A tibble: 10 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  8 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  9 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 10 Offfb6… HTC D… Android         FALSE                  655.         872.
## # ... with 9 more variables: rtr_ping <dbl>, mtime <dttm>, mymd <date>,
## #   mym <chr>, op_name <chr>, nw_cat <fct>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>


3. Full join file 1 and file 2 and try to understand what happened with the data!
full_join(file1, file2)
## # A tibble: 30 x 15
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offff0… Galax… Android         TRUE                  2486.          58.
##  8 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  9 Offfe8… Galax… Android         TRUE                 25868.        3362.
## 10 Offfd5… Galax… Android         TRUE                 63335.       22201.
## # ... with 20 more rows, and 9 more variables: rtr_ping <dbl>,
## #   mtime <dttm>, mymd <date>, mym <chr>, op_name <chr>, nw_cat <fct>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>


>Now for practice: the id is the same variable but with different names - use the code below to change file 1!
(file1 <- rename(file1, id_number = id))
## # A tibble: 20 x 7
##    id_number        device     device_platform device_has_lte rtr_speed_dl
##    <chr>            <fct>      <fct>           <lgl>                 <dbl>
##  1 Offfe3f7c-51f6-… Galaxy S4… Android         TRUE                  3340.
##  2 Offfd173b-89ee-… Galaxy S4… Android         TRUE                 13085.
##  3 Offfbe951-bf98-… Galaxy S4… Android         TRUE                  3520.
##  4 Offfdc5a7-7490-… HTC One S  Android         FALSE                 3962.
##  5 Offfcf842-b732-… Galaxy S4… Android         TRUE                 19065.
##  6 Offfb42a3-58d7-… Galaxy S4… Android         TRUE                 15259.
##  7 Offff0ee1-1cb0-… Galaxy S5  Android         TRUE                  2486.
##  8 Offfb0e8a-8a4e-… iPhone 5s  iOS             TRUE                  3835.
##  9 Offfe8df3-5180-… Galaxy S4… Android         TRUE                 25868.
## 10 Offfd5337-2602-… Galaxy S3… Android         TRUE                 63335.
## 11 Offfeb095-0ff8-… SM-G925F   Android         TRUE                 45008.
## 12 Offfeec5c-4b40-… HTC Desir… Android         TRUE                 39025.
## 13 Offfe464e-3c77-… iPhone 6   iOS             TRUE                 40061.
## 14 Offfecdb4-76d0-… iPhone 5   iOS             TRUE                  3810.
## 15 Offfb681e-19c5-… HTC Desir… Android         FALSE                  655.
## 16 Offffa9ea-865d-… iPhone 6s… iOS             TRUE                 59876.
## 17 Offfe4ba7-55fe-… Galaxy S4… Android         TRUE                140156.
## 18 Offfffc9d-75a5-… iPhone 6s  iOS             TRUE                105078.
## 19 Offff830e-3122-… iPhone 4s  iOS             FALSE                 5608.
## 20 Offfe5a19-84ae-… iPad Air … iOS             TRUE                 51751.
## # ... with 2 more variables: rtr_speed_ul <dbl>, rtr_ping <dbl>


4. Try a normal left_join with file 1 and 2 without specifiying the keys - what happens now? Would it change if you specifiy by = “id”?
left_join(file1, file2)
## Error: `by` required, because the data sources have no common variables


5. As the names differ you need to specifiy in the by-column which two columns have the matching data: write your own call and use a variation of by = c(“id_x” = “id_y”) to make it work
left_join(file1, file2, by = c("id_number" = "id"))
## # A tibble: 20 x 15
##    id_number        device     device_platform device_has_lte rtr_speed_dl
##    <chr>            <fct>      <fct>           <lgl>                 <dbl>
##  1 Offfe3f7c-51f6-… Galaxy S4… Android         TRUE                  3340.
##  2 Offfd173b-89ee-… Galaxy S4… Android         TRUE                 13085.
##  3 Offfbe951-bf98-… Galaxy S4… Android         TRUE                  3520.
##  4 Offfdc5a7-7490-… HTC One S  Android         FALSE                 3962.
##  5 Offfcf842-b732-… Galaxy S4… Android         TRUE                 19065.
##  6 Offfb42a3-58d7-… Galaxy S4… Android         TRUE                 15259.
##  7 Offff0ee1-1cb0-… Galaxy S5  Android         TRUE                  2486.
##  8 Offfb0e8a-8a4e-… iPhone 5s  iOS             TRUE                  3835.
##  9 Offfe8df3-5180-… Galaxy S4… Android         TRUE                 25868.
## 10 Offfd5337-2602-… Galaxy S3… Android         TRUE                 63335.
## 11 Offfeb095-0ff8-… SM-G925F   Android         TRUE                 45008.
## 12 Offfeec5c-4b40-… HTC Desir… Android         TRUE                 39025.
## 13 Offfe464e-3c77-… iPhone 6   iOS             TRUE                 40061.
## 14 Offfecdb4-76d0-… iPhone 5   iOS             TRUE                  3810.
## 15 Offfb681e-19c5-… HTC Desir… Android         FALSE                  655.
## 16 Offffa9ea-865d-… iPhone 6s… iOS             TRUE                 59876.
## 17 Offfe4ba7-55fe-… Galaxy S4… Android         TRUE                140156.
## 18 Offfffc9d-75a5-… iPhone 6s  iOS             TRUE                105078.
## 19 Offff830e-3122-… iPhone 4s  iOS             FALSE                 5608.
## 20 Offfe5a19-84ae-… iPad Air … iOS             TRUE                 51751.
## # ... with 10 more variables: rtr_speed_ul <dbl>, rtr_ping <dbl>,
## #   mtime <dttm>, mymd <date>, mym <chr>, op_name <chr>, nw_cat <fct>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>


Filtering joins

Rename the file again to avoid unnecessary work with the code below!
file1 <- rename(file1, id = id_number)


Semi-join: checks which rows of x have matching values in y and gives those of x back
Can you state the difference to an inner join?
6. Now use semi_join on file 1 and 2 and try to understand the output!
semi_join(file1, file2)
## # A tibble: 10 x 7
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offfe3… Galax… Android         TRUE                  3340.          65.
##  2 Offfd1… Galax… Android         TRUE                 13085.         827.
##  3 Offfbe… Galax… Android         TRUE                  3520.        1877.
##  4 Offfdc… HTC O… Android         FALSE                 3962.        3012.
##  5 Offfcf… Galax… Android         TRUE                 19065.        1750.
##  6 Offfb4… Galax… Android         TRUE                 15259.        3160.
##  7 Offfb0… iPhon… iOS             TRUE                  3835.        2137.
##  8 Offfd5… Galax… Android         TRUE                 63335.       22201.
##  9 Offfe4… iPhon… iOS             TRUE                 40061.         987.
## 10 Offfb6… HTC D… Android         FALSE                  655.         872.
## # ... with 1 more variable: rtr_ping <dbl>


Anti_join: returns rows of x with no matching values in y
7. Now use anti_join on file1 and 2 and try to understand the output!
anti_join(file1,file2)
## # A tibble: 10 x 7
##    id      device device_platform device_has_lte rtr_speed_dl rtr_speed_ul
##    <chr>   <fct>  <fct>           <lgl>                 <dbl>        <dbl>
##  1 Offff0… Galax… Android         TRUE                  2486.          58.
##  2 Offfe8… Galax… Android         TRUE                 25868.        3362.
##  3 Offfeb… SM-G9… Android         TRUE                 45008.       19824.
##  4 Offfee… HTC D… Android         TRUE                 39025.       18378.
##  5 Offfec… iPhon… iOS             TRUE                  3810.        1343.
##  6 Offffa… iPhon… iOS             TRUE                 59876.       13818.
##  7 Offfe4… Galax… Android         TRUE                140156.       48355.
##  8 Offfff… iPhon… iOS             TRUE                105078.       25598.
##  9 Offff8… iPhon… iOS             FALSE                 5608.        1984.
## 10 Offfe5… iPad … iOS             TRUE                 51751.       22994.
## # ... with 1 more variable: rtr_ping <dbl>



Set Operations

What are set operations and what do they do?

  • originated from baseR, dplyr overrides those to make them more generic (you can pass other arguments then x and y)
  • takes two datasets and performs an operation on them
  • functions:
    • union: takes x and y as inputs and returns all rows that are in at least one dataset
    • intersect: takes x and y as inputs and returns all rows that the datasets share
    • setdiff: returns all rows that are in x with no matching values in y
    • setequal: check if both sets have identical rows, in any order
    • combination of first 3: all rows in one set, but not in the other


Set Operations in action

file3: first 40 rows
file4: rows 20 - 60
union(file3, file4) #all rows, no duplicates
## # A tibble: 60 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offf37a3a-… 2015-09-23 22:25:08 2015-09-23 2015… Kiwi    3G     iPhone…
##  2 Offf2add1-… 2015-09-12 18:28:33 2015-09-12 2015… Pear    4G     Samsun…
##  3 Offf15c81-… 2015-08-30 19:14:38 2015-08-30 2015… Apple   3G     HIGHWA…
##  4 Offf43eef-… 2015-07-19 19:28:25 2015-07-19 2015… Pear    3G     iPhone…
##  5 Offf4f1dc-… 2015-07-17 10:58:20 2015-07-17 2015… Apple   4G     iPhone…
##  6 Offf226bf-… 2015-06-26 17:13:08 2015-06-26 2015… Pear    2G     iPhone…
##  7 Offf1b772-… 2015-04-15 14:52:44 2015-04-15 2015… Pear    3G     Galaxy…
##  8 Offf26cd8-… 2014-12-29 12:42:12 2014-12-29 2014… Apple   4G     iPhone…
##  9 Offf319fe-… 2014-09-29 14:04:00 2014-09-29 2014… Kiwi    3G     Galaxy…
## 10 Offf47c9a-… 2014-09-19 19:13:00 2014-09-19 2014… Kiwi    3G     iPhone…
## # ... with 50 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
intersect(file3, file4) #matching keys, no duplicates
## # A tibble: 20 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offf64ac2… 2014-01-24 12:04:00 2014-01-24 2014… Pear    3G     Galaxy …
##  2 Offf92296… 2014-01-26 21:44:00 2014-01-26 2014… Pear    3G     Nexus 5 
##  3 Offf6c55b… 2014-01-31 19:40:00 2014-01-31 2014… Pear    3G     Galaxy …
##  4 Offf58d24… 2014-02-04 00:11:00 2014-02-04 2014… Pear    3G     Sony Xp…
##  5 Offf7072a… 2014-02-21 12:06:00 2014-02-21 2014… Pear    3G     Galaxy …
##  6 Offf6b1f8… 2014-02-21 15:44:00 2014-02-21 2014… Pear    2G     iPad 4 …
##  7 Offf7c734… 2014-06-05 16:04:00 2014-06-05 2014… Kiwi    3G     Sony Xp…
##  8 Offf97613… 2014-10-31 06:14:50 2014-10-31 2014… Kiwi    2G     iPhone 6
##  9 Offf73f44… 2014-11-29 23:48:43 2014-11-29 2014… Pear    2G     Galaxy …
## 10 Offf902d7… 2014-12-07 21:26:04 2014-12-07 2014… Kiwi    3G     iPad 4 …
## 11 Offf64782… 2015-01-10 15:20:08 2015-01-10 2015… Pear    3G     Galaxy …
## 12 Offf8c086… 2015-01-11 15:53:41 2015-01-11 2015… Apple   4G     LGE Nex…
## 13 Offfa188c… 2015-02-17 16:47:14 2015-02-17 2015… Pear    4G     LGE LG …
## 14 Offf939b9… 2015-03-01 09:17:46 2015-03-01 2015… Kiwi    3G     Z500    
## 15 Offfafb38… 2015-05-02 18:55:11 2015-05-02 2015… Apple   3G     LG Opti…
## 16 Offf6e668… 2015-05-06 11:09:09 2015-05-06 2015… Pear    3G     iPhone …
## 17 Offf91485… 2015-08-03 10:25:33 2015-08-03 2015… Kiwi    3G     Galaxy …
## 18 Offf72340… 2015-10-07 18:53:34 2015-10-07 2015… Pear    4G     Galaxy …
## 19 Offfaf977… 2015-11-06 22:48:16 2015-11-06 2015… Pear    4G     iPad mi…
## 20 Offf978ff… 2015-11-13 15:41:58 2015-11-13 2015… Apple   3G     iPhone 6
## # ... with 8 more variables: device_platform <fct>, device_has_lte <lgl>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>
setdiff(file3,file4) #all rows of x with no matching key values in y
## # A tibble: 20 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offfe3f7c… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy …
##  2 Offfd173b… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy …
##  3 Offfbe951… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy …
##  4 Offfdc5a7… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC One…
##  5 Offfcf842… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy …
##  6 Offfb42a3… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy …
##  7 Offff0ee1… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy …
##  8 Offfb0e8a… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone …
##  9 Offfe8df3… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy …
## 10 Offfd5337… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy …
## 11 Offfeb095… 2015-08-15 18:53:54 2015-08-15 2015… Apple   4G     SM-G925F
## 12 Offfeec5c… 2015-08-19 05:44:17 2015-08-19 2015… Pear    4G     HTC Des…
## 13 Offfe464e… 2015-08-19 09:47:50 2015-08-19 2015… Apple   4G     iPhone 6
## 14 Offfecdb4… 2015-08-28 11:29:46 2015-08-28 2015… Kiwi    3G     iPhone 5
## 15 Offfb681e… 2015-09-10 18:03:04 2015-09-10 2015… Pear    3G     HTC Des…
## 16 Offffa9ea… 2015-10-16 11:35:58 2015-10-16 2015… Apple   4G     iPhone …
## 17 Offfe4ba7… 2015-11-06 20:29:45 2015-11-06 2015… Pear    4G     Galaxy …
## 18 Offfffc9d… 2015-11-17 13:47:03 2015-11-17 2015… Pear    4G     iPhone …
## 19 Offff830e… 2015-11-17 18:01:22 2015-11-17 2015… Kiwi    3G     iPhone …
## 20 Offfe5a19… 2015-11-18 21:09:23 2015-11-18 2015… Pear    4G     iPad Ai…
## # ... with 8 more variables: device_platform <fct>, device_has_lte <lgl>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>
## combination:
temp1 <- union(file3,file4)
temp2 <- intersect(file3, file4)
setdiff(temp1, temp2)
## # A tibble: 40 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offf37a3a-… 2015-09-23 22:25:08 2015-09-23 2015… Kiwi    3G     iPhone…
##  2 Offf2add1-… 2015-09-12 18:28:33 2015-09-12 2015… Pear    4G     Samsun…
##  3 Offf15c81-… 2015-08-30 19:14:38 2015-08-30 2015… Apple   3G     HIGHWA…
##  4 Offf43eef-… 2015-07-19 19:28:25 2015-07-19 2015… Pear    3G     iPhone…
##  5 Offf4f1dc-… 2015-07-17 10:58:20 2015-07-17 2015… Apple   4G     iPhone…
##  6 Offf226bf-… 2015-06-26 17:13:08 2015-06-26 2015… Pear    2G     iPhone…
##  7 Offf1b772-… 2015-04-15 14:52:44 2015-04-15 2015… Pear    3G     Galaxy…
##  8 Offf26cd8-… 2014-12-29 12:42:12 2014-12-29 2014… Apple   4G     iPhone…
##  9 Offf319fe-… 2014-09-29 14:04:00 2014-09-29 2014… Kiwi    3G     Galaxy…
## 10 Offf47c9a-… 2014-09-19 19:13:00 2014-09-19 2014… Kiwi    3G     iPhone…
## # ... with 30 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>



Set operators

Creating files to work with


Union: takes x and y as inputs and returns all rows that are in at least one dataset
1. Call union on file 3 and 4 and try to understand the output!
union(file3, file4)
## # A tibble: 60 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offf37a3a-… 2015-09-23 22:25:08 2015-09-23 2015… Kiwi    3G     iPhone…
##  2 Offf2add1-… 2015-09-12 18:28:33 2015-09-12 2015… Pear    4G     Samsun…
##  3 Offf15c81-… 2015-08-30 19:14:38 2015-08-30 2015… Apple   3G     HIGHWA…
##  4 Offf43eef-… 2015-07-19 19:28:25 2015-07-19 2015… Pear    3G     iPhone…
##  5 Offf4f1dc-… 2015-07-17 10:58:20 2015-07-17 2015… Apple   4G     iPhone…
##  6 Offf226bf-… 2015-06-26 17:13:08 2015-06-26 2015… Pear    2G     iPhone…
##  7 Offf1b772-… 2015-04-15 14:52:44 2015-04-15 2015… Pear    3G     Galaxy…
##  8 Offf26cd8-… 2014-12-29 12:42:12 2014-12-29 2014… Apple   4G     iPhone…
##  9 Offf319fe-… 2014-09-29 14:04:00 2014-09-29 2014… Kiwi    3G     Galaxy…
## 10 Offf47c9a-… 2014-09-19 19:13:00 2014-09-19 2014… Kiwi    3G     iPhone…
## # ... with 50 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


intersect: takes x and y as inputs and returns all rows that the datasets share
2. Call intersect on file 3 and 4 and try to understand the output!
intersect(file3, file4)
## # A tibble: 20 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offf64ac2… 2014-01-24 12:04:00 2014-01-24 2014… Pear    3G     Galaxy …
##  2 Offf92296… 2014-01-26 21:44:00 2014-01-26 2014… Pear    3G     Nexus 5 
##  3 Offf6c55b… 2014-01-31 19:40:00 2014-01-31 2014… Pear    3G     Galaxy …
##  4 Offf58d24… 2014-02-04 00:11:00 2014-02-04 2014… Pear    3G     Sony Xp…
##  5 Offf7072a… 2014-02-21 12:06:00 2014-02-21 2014… Pear    3G     Galaxy …
##  6 Offf6b1f8… 2014-02-21 15:44:00 2014-02-21 2014… Pear    2G     iPad 4 …
##  7 Offf7c734… 2014-06-05 16:04:00 2014-06-05 2014… Kiwi    3G     Sony Xp…
##  8 Offf97613… 2014-10-31 06:14:50 2014-10-31 2014… Kiwi    2G     iPhone 6
##  9 Offf73f44… 2014-11-29 23:48:43 2014-11-29 2014… Pear    2G     Galaxy …
## 10 Offf902d7… 2014-12-07 21:26:04 2014-12-07 2014… Kiwi    3G     iPad 4 …
## 11 Offf64782… 2015-01-10 15:20:08 2015-01-10 2015… Pear    3G     Galaxy …
## 12 Offf8c086… 2015-01-11 15:53:41 2015-01-11 2015… Apple   4G     LGE Nex…
## 13 Offfa188c… 2015-02-17 16:47:14 2015-02-17 2015… Pear    4G     LGE LG …
## 14 Offf939b9… 2015-03-01 09:17:46 2015-03-01 2015… Kiwi    3G     Z500    
## 15 Offfafb38… 2015-05-02 18:55:11 2015-05-02 2015… Apple   3G     LG Opti…
## 16 Offf6e668… 2015-05-06 11:09:09 2015-05-06 2015… Pear    3G     iPhone …
## 17 Offf91485… 2015-08-03 10:25:33 2015-08-03 2015… Kiwi    3G     Galaxy …
## 18 Offf72340… 2015-10-07 18:53:34 2015-10-07 2015… Pear    4G     Galaxy …
## 19 Offfaf977… 2015-11-06 22:48:16 2015-11-06 2015… Pear    4G     iPad mi…
## 20 Offf978ff… 2015-11-13 15:41:58 2015-11-13 2015… Apple   3G     iPhone 6
## # ... with 8 more variables: device_platform <fct>, device_has_lte <lgl>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>


Setdiff: returns all rows that are in x with no matching values in y
3. Call setdiff on file 3 and 4 and try to understand the output!
setdiff(file3,file4)
## # A tibble: 20 x 15
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offfe3f7c… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy …
##  2 Offfd173b… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy …
##  3 Offfbe951… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy …
##  4 Offfdc5a7… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC One…
##  5 Offfcf842… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy …
##  6 Offfb42a3… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy …
##  7 Offff0ee1… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy …
##  8 Offfb0e8a… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone …
##  9 Offfe8df3… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy …
## 10 Offfd5337… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy …
## 11 Offfeb095… 2015-08-15 18:53:54 2015-08-15 2015… Apple   4G     SM-G925F
## 12 Offfeec5c… 2015-08-19 05:44:17 2015-08-19 2015… Pear    4G     HTC Des…
## 13 Offfe464e… 2015-08-19 09:47:50 2015-08-19 2015… Apple   4G     iPhone 6
## 14 Offfecdb4… 2015-08-28 11:29:46 2015-08-28 2015… Kiwi    3G     iPhone 5
## 15 Offfb681e… 2015-09-10 18:03:04 2015-09-10 2015… Pear    3G     HTC Des…
## 16 Offffa9ea… 2015-10-16 11:35:58 2015-10-16 2015… Apple   4G     iPhone …
## 17 Offfe4ba7… 2015-11-06 20:29:45 2015-11-06 2015… Pear    4G     Galaxy …
## 18 Offfffc9d… 2015-11-17 13:47:03 2015-11-17 2015… Pear    4G     iPhone …
## 19 Offff830e… 2015-11-17 18:01:22 2015-11-17 2015… Kiwi    3G     iPhone …
## 20 Offfe5a19… 2015-11-18 21:09:23 2015-11-18 2015… Pear    4G     iPad Ai…
## # ... with 8 more variables: device_platform <fct>, device_has_lte <lgl>,
## #   longitude <dbl>, latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>


4. Combination: try to write a call that gives back all rows, that are in one set, but not the other (use all 3 set operators from above)!
temp1 <- union(file3,file4)
temp2 <- intersect(file3, file4)
(temp3 <- setdiff(temp1, temp2))
## # A tibble: 40 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offf37a3a-… 2015-09-23 22:25:08 2015-09-23 2015… Kiwi    3G     iPhone…
##  2 Offf2add1-… 2015-09-12 18:28:33 2015-09-12 2015… Pear    4G     Samsun…
##  3 Offf15c81-… 2015-08-30 19:14:38 2015-08-30 2015… Apple   3G     HIGHWA…
##  4 Offf43eef-… 2015-07-19 19:28:25 2015-07-19 2015… Pear    3G     iPhone…
##  5 Offf4f1dc-… 2015-07-17 10:58:20 2015-07-17 2015… Apple   4G     iPhone…
##  6 Offf226bf-… 2015-06-26 17:13:08 2015-06-26 2015… Pear    2G     iPhone…
##  7 Offf1b772-… 2015-04-15 14:52:44 2015-04-15 2015… Pear    3G     Galaxy…
##  8 Offf26cd8-… 2014-12-29 12:42:12 2014-12-29 2014… Apple   4G     iPhone…
##  9 Offf319fe-… 2014-09-29 14:04:00 2014-09-29 2014… Kiwi    3G     Galaxy…
## 10 Offf47c9a-… 2014-09-19 19:13:00 2014-09-19 2014… Kiwi    3G     iPhone…
## # ... with 30 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


5. This should be equal to the merged data of setdiff(x,y) and setdiff(y,x) but use a combination of all 3 to get the right answer - use setequal() on your creation and temp4 to check if you got the right answer!
temp4 <- bind_rows(setdiff(file3,file4),setdiff(file4,file3)) #temp4

setequal(temp3, temp4)
## TRUE



Binds

bind_rows()

  • bind_rows(): bind_rows(…,.id = NULL)
    • = data frames to combine, either a
      • Dataframe
      • List that could be a dataframe
      • List of dataframes
    • .id
      • creates column of identifiers
    • adds rows of y below rows of x
    • rows get matched by col-names
    • if missing data -> filled with NA#


Examples (and why .id is cool)

bind_rows(setdiff(file3,file4),union(file3,file4)) #not useful but helps understanding
## # A tibble: 80 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offfe3f7c-… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy…
##  2 Offfd173b-… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy…
##  3 Offfbe951-… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy…
##  4 Offfdc5a7-… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC On…
##  5 Offfcf842-… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy…
##  6 Offfb42a3-… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy…
##  7 Offff0ee1-… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy…
##  8 Offfb0e8a-… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone…
##  9 Offfe8df3-… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy…
## 10 Offfd5337-… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy…
## # ... with 70 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
bind_rows(setdiff = setdiff(file3,file4), union = union(file3,file4), .id = "set_operator") #this is why bind_rows is cool
## # A tibble: 80 x 16
##    set_operator id     mtime               mymd       mym   op_name nw_cat
##    <chr>        <chr>  <dttm>              <date>     <chr> <chr>   <fct> 
##  1 setdiff      Offfe… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G    
##  2 setdiff      Offfd… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G    
##  3 setdiff      Offfb… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G    
##  4 setdiff      Offfd… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G    
##  5 setdiff      Offfc… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G    
##  6 setdiff      Offfb… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G    
##  7 setdiff      Offff… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G    
##  8 setdiff      Offfb… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G    
##  9 setdiff      Offfe… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G    
## 10 setdiff      Offfd… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G    
## # ... with 70 more rows, and 9 more variables: device <fct>,
## #   device_platform <fct>, device_has_lte <lgl>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>


bind_cols()

  • bind_cols(): bind_cols(…)
    • = (again) dataframes to combine
    • adds the cols of y to x
      • attention: keys need to match!


Examples (and why bind_cols is dangerous)

file5: 20 rows, first 8 cols
file6: 20 rows, id (as key) + 9 - 15
file7: file 6 rearranged by id
(temp1 <- bind_cols(file5,file6)) #binds it -> combined right!
## # A tibble: 20 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offfe3f7c… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy …
##  2 Offfd173b… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy …
##  3 Offfbe951… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy …
##  4 Offfdc5a7… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC One…
##  5 Offfcf842… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy …
##  6 Offfb42a3… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy …
##  7 Offff0ee1… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy …
##  8 Offfb0e8a… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone …
##  9 Offfe8df3… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy …
## 10 Offfd5337… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy …
## 11 Offfeb095… 2015-08-15 18:53:54 2015-08-15 2015… Apple   4G     SM-G925F
## 12 Offfeec5c… 2015-08-19 05:44:17 2015-08-19 2015… Pear    4G     HTC Des…
## 13 Offfe464e… 2015-08-19 09:47:50 2015-08-19 2015… Apple   4G     iPhone 6
## 14 Offfecdb4… 2015-08-28 11:29:46 2015-08-28 2015… Kiwi    3G     iPhone 5
## 15 Offfb681e… 2015-09-10 18:03:04 2015-09-10 2015… Pear    3G     HTC Des…
## 16 Offffa9ea… 2015-10-16 11:35:58 2015-10-16 2015… Apple   4G     iPhone …
## 17 Offfe4ba7… 2015-11-06 20:29:45 2015-11-06 2015… Pear    4G     Galaxy …
## 18 Offfffc9d… 2015-11-17 13:47:03 2015-11-17 2015… Pear    4G     iPhone …
## 19 Offff830e… 2015-11-17 18:01:22 2015-11-17 2015… Kiwi    3G     iPhone …
## 20 Offfe5a19… 2015-11-18 21:09:23 2015-11-18 2015… Pear    4G     iPad Ai…
## # ... with 9 more variables: device_platform <fct>, id1 <chr>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
(temp2 <- bind_cols(file5,file7)) #same as temp1?
## # A tibble: 20 x 16
##    id         mtime               mymd       mym   op_name nw_cat device  
##    <chr>      <dttm>              <date>     <chr> <chr>   <fct>  <fct>   
##  1 Offfe3f7c… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy …
##  2 Offfd173b… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy …
##  3 Offfbe951… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy …
##  4 Offfdc5a7… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC One…
##  5 Offfcf842… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy …
##  6 Offfb42a3… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy …
##  7 Offff0ee1… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy …
##  8 Offfb0e8a… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone …
##  9 Offfe8df3… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy …
## 10 Offfd5337… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy …
## 11 Offfeb095… 2015-08-15 18:53:54 2015-08-15 2015… Apple   4G     SM-G925F
## 12 Offfeec5c… 2015-08-19 05:44:17 2015-08-19 2015… Pear    4G     HTC Des…
## 13 Offfe464e… 2015-08-19 09:47:50 2015-08-19 2015… Apple   4G     iPhone 6
## 14 Offfecdb4… 2015-08-28 11:29:46 2015-08-28 2015… Kiwi    3G     iPhone 5
## 15 Offfb681e… 2015-09-10 18:03:04 2015-09-10 2015… Pear    3G     HTC Des…
## 16 Offffa9ea… 2015-10-16 11:35:58 2015-10-16 2015… Apple   4G     iPhone …
## 17 Offfe4ba7… 2015-11-06 20:29:45 2015-11-06 2015… Pear    4G     Galaxy …
## 18 Offfffc9d… 2015-11-17 13:47:03 2015-11-17 2015… Pear    4G     iPhone …
## 19 Offff830e… 2015-11-17 18:01:22 2015-11-17 2015… Kiwi    3G     iPhone …
## 20 Offfe5a19… 2015-11-18 21:09:23 2015-11-18 2015… Pear    4G     iPad Ai…
## # ... with 9 more variables: device_platform <fct>, id1 <chr>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>
setequal(temp1, temp2) #compare if rows are same (any order)
## FALSE: Rows in x but not y: 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10[...]. Rows in y but not x: 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10[...].
anti_join(file5,file7, by = "id") #do we have keyvalues in x without matching in y?
## # A tibble: 0 x 8
## # ... with 8 variables: id <chr>, mtime <dttm>, mymd <date>, mym <chr>,
## #   op_name <chr>, nw_cat <fct>, device <fct>, device_platform <fct>
#no -> arrange both files the same; if yes: semi_join and save -> arrange both sets -> bind_cols -> bind old rows again
file5 <- arrange(file5, id) 
file7 <- arrange(file7, id)

temp3 <- bind_cols(file5, file7) #bind ordered
setequal(temp1, temp3) #check if its now right
## TRUE


Why use bind_cols() or bind_rows(), when you can use baseR rbind() and cbind()?

  • faster
  • output is a tibble
  • syntax is more flexible – can also bind lists of dataframes
  • most relevant (for bind_rows at least): .id
    • easy to keep track of where the data comes from
  • when is this useful?
    • E.g. plotting with ggplot2 and its aes


Binds

Bind_rows:
Adds rows of y below the rows of x, rows get matched by name but the datasets don’t need to completely match - empty spaces get filled up with NAs
1. Use any set operators to create 2 datasets and use bind_rows on them (as we did with temp4 in set operators) - like this, this is not really useful - but wait for 2)!
bind_rows(setdiff(file3,file4),union(file3,file4)) #this is not useful but helps you understand how bind_rows work
## # A tibble: 80 x 15
##    id          mtime               mymd       mym   op_name nw_cat device 
##    <chr>       <dttm>              <date>     <chr> <chr>   <fct>  <fct>  
##  1 Offfe3f7c-… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G     Galaxy…
##  2 Offfd173b-… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G     Galaxy…
##  3 Offfbe951-… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G     Galaxy…
##  4 Offfdc5a7-… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G     HTC On…
##  5 Offfcf842-… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G     Galaxy…
##  6 Offfb42a3-… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G     Galaxy…
##  7 Offff0ee1-… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G     Galaxy…
##  8 Offfb0e8a-… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G     iPhone…
##  9 Offfe8df3-… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G     Galaxy…
## 10 Offfd5337-… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G     Galaxy…
## # ... with 70 more rows, and 8 more variables: device_platform <fct>,
## #   device_has_lte <lgl>, longitude <dbl>, latitude <dbl>, iso_adm2 <fct>,
## #   rtr_speed_dl <dbl>, rtr_speed_ul <dbl>, rtr_ping <dbl>


2. Now use the same call, but define the “.id”-variable as “set_operator” and name the datasets by the set operator you used! Can you explain why this is useful?
bind_rows(setdiff = setdiff(file3,file4), union = union(file3,file4), .id = "set_operator")
## # A tibble: 80 x 16
##    set_operator id     mtime               mymd       mym   op_name nw_cat
##    <chr>        <chr>  <dttm>              <date>     <chr> <chr>   <fct> 
##  1 setdiff      Offfe… 2014-02-22 08:34:00 2014-02-22 2014… Pear    3G    
##  2 setdiff      Offfd… 2014-05-19 13:51:00 2014-05-19 2014… Pear    3G    
##  3 setdiff      Offfb… 2014-05-22 09:57:00 2014-05-22 2014… Apple   3G    
##  4 setdiff      Offfd… 2014-06-24 19:46:00 2014-06-24 2014… Apple   3G    
##  5 setdiff      Offfc… 2014-07-21 20:10:00 2014-07-21 2014… Pear    3G    
##  6 setdiff      Offfb… 2014-08-28 09:28:00 2014-08-28 2014… Apple   3G    
##  7 setdiff      Offff… 2014-09-14 15:32:00 2014-09-14 2014… Apple   3G    
##  8 setdiff      Offfb… 2014-12-22 18:29:19 2014-12-22 2014… Kiwi    3G    
##  9 setdiff      Offfe… 2015-02-03 03:23:05 2015-02-03 2015… Apple   3G    
## 10 setdiff      Offfd… 2015-03-27 08:46:18 2015-03-27 2015… Kiwi    4G    
## # ... with 70 more rows, and 9 more variables: device <fct>,
## #   device_platform <fct>, device_has_lte <lgl>, longitude <dbl>,
## #   latitude <dbl>, iso_adm2 <fct>, rtr_speed_dl <dbl>,
## #   rtr_speed_ul <dbl>, rtr_ping <dbl>
# as you should see now, the output has a new column named set_operator that shows you from where your data origins - this can be used as groups in group_by and summarise


##### bind_cols: ##### Adds the cols of y to the cols of x - expects the rows to be in the matching order and to have the same number of rows

create files to work with

file 5 holds 20 rows and the first 8 columns of RTR file 6 the id-column for identification purposes and the cols 9 to 15 from RTR - ideal to bind_cols file 7 is file 6 arranged by id


3. Use bind_cols on file5 and file6 and save it as temp1!
temp1 <- bind_cols(file5,file6)


4. Now, file7 is just file 6 arranged by id while file 5 is not arranged - what happens when you use bind_cols on file 5 with file 7? Try it and store it as temp2!
temp2 <- bind_cols(file5,file7)


5. Compare temp1 - the one in the right order (f5,f6) and temp2 (f5,f7) the one with not matching order using setequal()!
setequal(temp1, temp2) 
## FALSE: Rows in x but not y: 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10[...]. Rows in y but not x: 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10[...].
# doesn't look to bad, only that its completely false and you didn't even get an error!
setequal() tells you that what you did was nonsense! - so before joining make sure that what you do is right!


6. Use anti-join on file 5 and file 7 to see that all data matches.
anti_join(file5,file7, by = "id")
## # A tibble: 0 x 8
## # ... with 8 variables: id <chr>, mtime <dttm>, mymd <date>, mym <chr>,
## #   op_name <chr>, nw_cat <fct>, device <fct>, device_platform <fct>


7. Now that you are sure that at least your data matches (your tibble should’ve been empty), arrange both sets by the id-column to not run into the risk of doing dumb things again (don’t forget to store it again)!
file5 <- arrange(file5, id)
file7 <- arrange(file7, id)


8. Now bind them together with bind_cols and check with setequal if you are now similar to temp1!
temp3 <- bind_cols(file5, file7)
setequal(temp1, temp3)
## TRUE
# Congrats, you passed all of our exercises! We hope you had fun and learned a lot!