Lubridate & Tidytext

Athanasios Paschos, Florian Nemetz & Kaja Benz

3th of June 2019

Introduction - Specifying Dates

Example:

Introduction - Specifying Dates

Example:

x <- "23-10-2018"

str(x)   
##  chr "23-10-2018"
#  first possible function to import dates 
y <- as.Date(x) 
y
## [1] "0023-10-20"
str(y) 
##  Date[1:1], format: "0023-10-20"

Introduction - With the right format

library(tidyverse)
x <- "2013-07-13"
  
y <- as.Date(x)

y
## [1] "2013-07-13"
str(y)    #   structure of y 
##  Date[1:1], format: "2013-07-13"
jan_12 <- as.Date("2015-01-12")
jan_12
## [1] "2015-01-12"
unclass(jan_12)
## [1] 16447

Introduction - Arithmetic and logical operators

x <- as.Date(c("2019-06-03", "2015-01-01", "1985-04-24", "2010-03-31", "1945-10-23"))
x
## [1] "2019-06-03" "2015-01-01" "1985-04-24" "2010-03-31" "1945-10-23"
min(x)
## [1] "1945-10-23"
x[1] > x[4]
## [1] TRUE
y <- x[1] - x[3]
y
## Time difference of 12458 days

Introduction- Getting datetimes into R

Examples

x <- as.POSIXct(c("2003-10-01 22:12:00", "2012-03-24 11:34:00"))
x
## [1] "2003-10-01 22:12:00 CEST" "2012-03-24 11:34:00 CET"
x1 <- as.POSIXct(c("2010-10-01 12:12:00", "2012-03-24 11:34:00"), tz = "America/Los_Angeles")
x1
## [1] "2010-10-01 12:12:00 PDT" "2012-03-24 11:34:00 PDT"
str(x1)
##  POSIXct[1:2], format: "2010-10-01 12:12:00" "2012-03-24 11:34:00"
unclass(x1)
## [1] 1285960320 1332614040
## attr(,"tzone")
## [1] "America/Los_Angeles"

Lubridate - Introduction

Lubridate has the advantage that:

  1. is simple to use,

  2. parses formats that don’t have the ISO - 8601 format

  3. allows parsing more than one date format in one vector

  4. All functions return the same time-zone “UTC”.

  5. Ignores seperators

Lubridate - Parsing functions - 1st option

Examples

library(lubridate)
x1 <- mdy(c("12.23.2009", "2/13/2000", "10231998"))  # without separators

x1
## [1] "2009-12-23" "2000-02-13" "1998-10-23"
x2 <- ymd("2018:11:30")
    
x2
## [1] "2018-11-30"
str(x2)
##  Date[1:1], format: "2018-11-30"

Examples

# two different separators 
    
x3 <- dmy("27-2.1999")   
    
x3
## [1] "1999-02-27"
x4 <- dmy("12.5.2000", "12.23.2010") 
## Warning: 1 failed to parse.
x4
## [1] "2000-05-12" NA

Lubridate - Parsing Functions - 2nd option

There is the function parse_date_time() which takes an additional argument, orders,

where you can specify the order of the components in the date.

General Format : parse_date_time(x, orders = " " or c( " " ), tz = “UTC”, truncated = ,…)

Example :

# parse "2010 September 20th" 

parse_date_time("2010 September 20th", orders = "ymd")
## [1] "2010-09-20 UTC"

that would be equivalent to using the ymd() function from the previous exercise.

Lubridate - Parsing Functions - 2nd option (2)

Examples :

x <- "Monday June 1st 2001 at 5pm"

parse_date_time(x, orders = "AmdyIp")
## Warning: All formats failed to parse. No formats found.
## [1] NA
parse_date_time(c("10.23.2010", "2019-10-23"), order = c("mdy","ymd"))
## [1] "2010-10-23 UTC" "2019-10-23 UTC"

Examples

#  Specify order to include both "mdy" and "dmy"

dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3 November 2014")

parse_date_time(dates, orders = c("mdy", "dmy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-11-03 UTC"
dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3.10.2014")

parse_date_time(dates, orders = c("mdy", "dmy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-03-10 UTC"
parse_date_time(dates, orders = c("dmy", "mdy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-10-03 UTC"

Lubridate - Parsing date-times

dmy_hm() ydm_hms() mdy_h()

Examples

x <- c("2014-04-14-04-35-59", "2010-04-01-12-00-00")
y <- ymd_hms(x)
y 
## [1] "2014-04-14 04:35:59 UTC" "2010-04-01 12:00:00 UTC"
parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"),
                     orders = c("dmy_hms", "ydm_HMS","mdy_HMS"))
## Warning: hms, hm and ms usage is deprecated, please use HMS, HM or MS
## instead. Deprecated in version '1.5.6'.
## [1] "1994-04-13 12:00:00 UTC" "2000-09-23 03:30:40 UTC"
## [3] "2008-05-16 11:25:30 UTC"
parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"),
                                            orders = c("dmy_HMS", "ydm_HMS","mdy_HMS"))
## [1] "1994-04-13 12:00:00 UTC" "2000-09-23 03:30:40 UTC"
## [3] "2008-05-16 11:25:30 UTC"

Lubridate - One tricky form for incomplete dates

ymd_hms(…, tz = NULL, truncated = 0)

Examples

x <- c("2011-12-31 12:59:59",   "2010-01-01 12:11", "2010-01-01 12",  "2010-11")
ymd_hms(x, truncated = 4)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2010-11-01 00:00:00 UTC"
x1 <- c("2011-12-31",   "2010-01-01", "2010-01-01",   "2010-11")
ymd(x1, truncated = 1)
## [1] "2011-12-31" "2010-01-01" "2010-01-01" "2010-11-01"
ymd_hms(x, truncated = 3)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2020-10-11 00:00:00 UTC"
ymd_hms(x1, truncated = 1)
## Warning: All formats failed to parse. No formats found.
## [1] NA NA NA NA

Lubridate - What can we extract?

Components of a datetime can be extracted by lubridate functions with the same name like:

They all work the same way just pass in a datetime or vector of datetimes. There are also a few useful functions that return other aspects of a datetime like if it occurs in the morning am(), in a leap_year(), or which quarter() or semester() it occurs in.

Examples

x <- mdy_hms("12.23.2018 12:34:45",  "2-28-2015 02:34:02", "4-15-2013 11:56:34",   "3-5-2011 06:12:00")

month(x)
## [1] 12  2  4  3
year(x)
## [1] 2018 2015 2013 2011
semester(x)
## [1] 2 1 1 1
yday(x)
## [1] 357  59 105  64
wday(x)
## [1] 1 7 2 7

Lubridate - Adding useful labels

month(x) and wday(x)

  1. label and
  2. abbr

Examples

x <- mdy_hms("12.23.2018 12:34:45", "2-28-2015 02:34:02", "4-15-2013 11:56:34", "3-5-2011 06:12:00")
x
## [1] "2018-12-23 12:34:45 UTC" "2015-02-28 02:34:02 UTC"
## [3] "2013-04-15 11:56:34 UTC" "2011-03-05 06:12:00 UTC"
month(x)
## [1] 12  2  4  3
month(x, label = TRUE, abbr = FALSE)
## [1] Dezember Februar  April    März    
## 12 Levels: Jänner < Februar < März < April < Mai < Juni < ... < Dezember
wday(x, label = TRUE, abbr = TRUE ) # more readable  with %>% table()
## [1] So\\. Sa\\. Mo\\. Sa\\.
## Levels: So\\. < Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\.
wday(x)
## [1] 1 7 2 7

Lubridate - Adding useful labels

For “R” the the first day of the week is on Sunday

library(lubridate)

wday(x, label = TRUE)
## [1] So\\. Sa\\. Mo\\. Sa\\.
## Levels: So\\. < Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\.
wday(x)
## [1] 1 7 2 7
##          ymd weekday nr_weekday revenue_euro
## 1 2019-05-06   Mo\\.          2        10000
## 2 2019-05-07   Di\\.          3        12000
## 3 2019-05-08   Mi\\.          4        12500
## 4 2019-05-09   Do\\.          5         9000
## 5 2019-05-10   Fr\\.          6        13000

Lubridate - Adding useful labels

Example

ymd <- ymd(c("2019-05-06", "2019-05-07", "2019-05-08", "2019-05-09","2019-05-10"))

weekday <- wday(ymd, label = TRUE, week_start = getOption("lubridate.week.start",1))
weekday
## [1] Mo\\. Di\\. Mi\\. Do\\. Fr\\.
## Levels: Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\. < So\\.
nr_weekday <- wday(ymd, week_start = getOption("lubridate.week.start",1))
nr_weekday
## [1] 1 2 3 4 5
##          ymd weekday nr_weekday revenue_euro
## 1 2019-05-06   Mo\\.          1        10000
## 2 2019-05-07   Di\\.          2        12000
## 3 2019-05-08   Mi\\.          3        12500
## 4 2019-05-09   Do\\.          4         9000
## 5 2019-05-10   Fr\\.          5        13000

Lubridate - Plotting

url <- "https://archive.org/download/BildungAnDeutschenHochschulenAll2/Bildung%20an%20deutschen%20Hochschulen_all2.xlsx"
GET(url, write_disk(tf <- tempfile(fileext = ".xlsx")))
## Response [https://ia801506.us.archive.org/0/items/BildungAnDeutschenHochschulenAll2/Bildung%20an%20deutschen%20Hochschulen_all2.xlsx]
##   Date: 2019-06-03 08:51
##   Status: 200
##   Content-Type: application/octet-stream
##   Size: 37.9 kB
## <ON DISK>  C:\Users\flone\AppData\Local\Temp\RtmpuKUnII\file1fbc34923f34.xlsx
Dataset_Hochschule <- read_excel(tf)

# Original plot
plot_Hochschule <-ggplot(Dataset_Hochschule, aes(x = Year, y = Numbers/1000 , color = Citizenship, shape = Status)) +
  geom_line(alpha = 0.4, shape = 5, size =1) +
  geom_jitter(width = 0.2) +
  facet_wrap(.~Gender) +
ggtitle("
Developement of german and non-german HEI personnel from 2005-2017
by Gender, Citizenship and occupational status group") +
  scale_y_continuous("Numbers in thousands", seq(0, max(Dataset_Hochschule$Numbers), 10)) + 
   scale_x_continuous(limits = c(2005, 2017), breaks = seq(2005,2017,2)) + theme_bw() + 
  
 theme(panel.spacing = unit(1.5, "lines"))
## Warning: Ignoring unknown parameters: shape
ggdraw(add_sub(plot_Hochschule, 
"Note: 
The shape of the data points shows the occupation 
and the color of the lines shows the nationality of the persons. 
The category 'Professor' includes full-time professors and junior professors. 
The category 'Non-Professor' includes full-time scientific and artistic personnel.  
It consists of docents and assistants, scientific and artistic co-workers
and staff for other special tasks.",
x=0.6, vjust=0.5, size = 9, fontface = "bold",  vpadding = grid::unit(1, "lines")))
## Warning: Removed 5 rows containing missing values (geom_point).

Lubridate - Time zones

Time zones represent the same across different geographic locations.

Examples

#  to find a specific time zone:   "Lisbon"

grep("Lisbon",OlsonNames(),value=TRUE) 
## [1] "Europe/Lisbon"
#  To find the time in Lisbon in specific time: for example right now
t <- with_tz(now(), tz = "Europe/Lisbon")
t
## [1] "2019-06-03 09:51:53 WEST"
# to change the timezone without changing the clock time.  →  in Jamaica
time <- dmy_hms("23.07.2019 23:12:45")

jam <- force_tz(time, tz = "Jamaica")
jam
## [1] "2019-07-23 23:12:45 EST"

Lubridate - Rounding

Examples

time  <- ymd_hms("2017-06-23 07:13:28 UTC")
wday(time)
## [1] 6
floor_date(time, unit = "month")  # ( weird function ) # Round down to month
## [1] "2017-06-01 UTC"
ceiling_date(time, unit = "week") #Round up to week
## [1] "2017-06-25 UTC"
round_date(time, unit = "6 minutes")  # Round to nearest 6 minutes
## [1] "2017-06-23 07:12:00 UTC"
round_date(time, unit = "3 years") # Round to nearest 3 years
## [1] "2016-01-01 UTC"

Lubridate- Taking differences of daytimes

To get finer control over a difference between datetimes we use the base function difftime(). For example instead of time1 - time2 , we use difftime(time1, time2)

Examples

date1<- mdy("July 23, 1985")
date2<- mdy_hms("Sep 20th , 1990, 02:56:15")
# How many days between date1 and date2 ?
difftime(date2, date1, unit = "days")
## Time difference of 1885.122 days
# How many weeks between date1 and date2 ?
difftime(date2, date1, unit = "weeks")
## Time difference of 269.3032 weeks

Lubridate - Time spans

Creating a time span - Periods

years() , months() , weeks() , days() , hours() minutes() , seconds()

Examples :

weeks(5)
## [1] "35d 0H 0M 0S"
hours(10)
## [1] "10H 0M 0S"
days(30)
## [1] "30d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"

Creating a time span - Durations

dyears() , dweeks() , ddays() , dhours() , dminutes() , dseconds()

Examples

ddays(4)
## [1] "345600s (~4 days)"
dyears(2)  
## [1] "63072000s (~2 years)"
dweeks(4)  
## [1] "2419200s (~4 weeks)"
ddays(23)
## [1] "1987200s (~3.29 weeks)"

Lubridate - Time spans - Extra

Examples - Another ways to create periods

#Separate period and units vectors

period(c(90, 5), c("second", "minute"))
## [1] "5M 90S"
# Units as arguments

period(second = 3, minute = 1, hour = 2, day = 13, week = 1)
## [1] "20d 2H 1M 3S"

Examples

days() + minutes()   
## [1] "1d 0H 1M 0S"
ymd_hms("2015.2.3 21:23:00") + weeks(3)
## [1] "2015-02-24 21:23:00 UTC"
dmy("23th Sep 2014") + dhours(81)
## [1] "2014-09-26 09:00:00 UTC"
c(1:3) * hours(1)
## [1] "1H 0M 0S" "2H 0M 0S" "3H 0M 0S"
x <- ymd("2016-10-03")
x + days(2) + hours(8) + minutes(45)
## [1] "2016-10-05 08:45:00 UTC"

Lubridate - Generating sequences of datetimes

today() + 1:10 * days(1)
##  [1] "2019-06-04" "2019-06-05" "2019-06-06" "2019-06-07" "2019-06-08"
##  [6] "2019-06-09" "2019-06-10" "2019-06-11" "2019-06-12" "2019-06-13"

Lubridate - The tricky thing about months

In general lubridate returns the same day of the month in the next month,

but since the 31st of February doesn’t exist lubridate returns a missing value, NA

jan <- ymd_hms("2010-01-31 03:04:05")
jan + months(1:3)  
## [1] NA                        "2010-03-31 03:04:05 UTC"
## [3] NA

%m+% and %m-% that have different behavior.

Examples

jan <- ymd_hms("2010-01-31 03:04:05")
jan + months(1:3) 
## [1] NA                        "2010-03-31 03:04:05 UTC"
## [3] NA
jan1 <- ymd("2012-01-31")
jan1 %m+% months(1:3)
## [1] "2012-02-29" "2012-03-31" "2012-04-30"
apr1 <- ymd("2018.04.30")
apr1 - months(1:3)
## [1] "2018-03-30" NA           "2018-01-30"
apr1 %m-% months(1:3)
## [1] "2018-03-30" "2018-02-28" "2018-01-30"

Lubridate - Roll Back Date To Last Day Of Previous Month

rollback(dates, roll_to_first = FALSE, preserve_hms = TRUE) Dates: A POSIXct, POSIXlt or Date class object.

Examples

date <- ymd("2010-03-03")
dates <- date + months(0:2)
dates
## [1] "2010-03-03" "2010-04-03" "2010-05-03"
rollback(dates)
## [1] "2010-02-28" "2010-03-31" "2010-04-30"
date <- ymd_hms("2010-03-03 12:44:22")

rollback(date, roll_to_first = TRUE)
## [1] "2010-03-01 12:44:22 UTC"
rollback(date, preserve_hms = FALSE)
## [1] "2010-02-28 UTC"
rollback(date, roll_to_first = TRUE, preserve_hms = FALSE)
## [1] "2010-03-01 UTC"

Lubridate - Comparing intervals and datetimes

y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")

x21 <- ymd("2001-02-09") %--% mdy("3.23.2001")

ymd("2002-03-30") %within% y2001
## [1] FALSE
x21 %within% y2001
## [1] TRUE

Lubridate - Converting to durations and periods

Examples

int1 <- interval(ymd("2009-01-01"), ymd("2009-08-01"))
as.duration(int1)
## [1] "18316800s (~30.29 weeks)"
as.period(int1)
## [1] "7m 0d 0H 0M 0S"

Tidytext

How would you analyze the content of a book or a tweet..?

Tidytext

Julia Silge and David Robinson developed the package “tidytext”, which makes text mining easier and more effective.

Tidytext Flow:

Text Data

To get the text data there are different R packages accessing different sources

… or you could read in any other PDF or Word documents

We will be using GutenbergR to download religious books from Buddhism, Christianity, Hinduism, Islam and Judaism.

Tidy Text

Hadley Wickham (2014) defines tidy data as data, that has been cleaned and reshaped in a way that is ready for analysis. The tidy text format is defined as a table with one-token-per-row.

Tidy text here is defined as one token-per-row, wheras a token can be

Where are we now?

Tidytext Flow:

Tokenization

By default, one token is one word.

x_tidy <- x_all %>%
 unnest_tokens(word, text)
x_tidy
## # A tibble: 19,620,261 x 2
##    gutenberg_id word        
##           <int> <chr>       
##  1          124 encoded     
##  2          124 under       
##  3          124 the         
##  4          124 direction   
##  5          124 of          
##  6          124 robert      
##  7          124 kraft       
##  8          124 for         
##  9          124 distribution
## 10          124 through     
## # ... with 19,620,251 more rows

As you can see here we have one word per row.

Tokenization

But you can also have two words (bigram) as a token.

x_ntoken <- x_all %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
x_ntoken
## # A tibble: 19,620,007 x 2
##    gutenberg_id bigram              
##           <int> <chr>               
##  1          124 encoded under       
##  2          124 under the           
##  3          124 the direction       
##  4          124 direction of        
##  5          124 of robert           
##  6          124 robert kraft        
##  7          124 kraft for           
##  8          124 for distribution    
##  9          124 distribution through
## 10          124 through the         
## # ... with 19,619,997 more rows

As you can see here we have two words per row.

Exercise 1

  1. unnest_tokens of x_all
    1. one token = one word
    2. one token = two words
    3. Count the word frequency

Where are we now?

Tidytext Flow:

Stopwords

Stopwords do not really reflect the content but are very common. Here are some examples:

##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"     
## [11] "yours"     "yourself"

Code:

x_cleaned <- x_tidy %>%
  anti_join(get_stopwords())

Exercise 2

  1. Antijoin the stopwords
    1. and count again: better?

Where are we now?

Tidytext Flow:

Exercise 3

  1. Have a look at the different sentiments: get_sentiments
    1. nrc
    2. bing

Sentiments

What could be useful for the analysis, is the valence of the words:

nr <- get_sentiments("nrc")
nr[1:6,1:2]
## # A tibble: 6 x 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear
bi <- get_sentiments("bing")
bi[10:16,1:2]
## # A tibble: 7 x 2
##   word     sentiment
##   <chr>    <chr>    
## 1 abort    negative 
## 2 aborted  negative 
## 3 aborts   negative 
## 4 abound   positive 
## 5 abounds  positive 
## 6 abrade   negative 
## 7 abrasive negative

Exercise 4

  1. Join words with sentiments using inner_join

Clean data

x_cleaned <- x_tidy %>%
  anti_join(get_stopwords()) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>% # delete __that__
  inner_join(get_sentiments("bing"))

Analysis

Now that the data is tokenized and cleaned:

x_cleaned %>%
  count(word, sort = TRUE)
## # A tibble: 5,401 x 2
##    word       n
##    <chr>  <int>
##  1 great  28975
##  2 good   25780
##  3 love   23054
##  4 like   22559
##  5 sin    19049
##  6 faith  16067
##  7 work   14622
##  8 death  14292
##  9 well   14174
## 10 divine 13943
## # ... with 5,391 more rows

Where are we now?

Tidytext Flow:

Group by

Grouping the words by religion, which results in frequencies of negative vs. positive words.

frequency <- total %>%
  count (gutenberg_bookshelf, sentiment)%>% #number of positve and negative words
  group_by (gutenberg_bookshelf) %>% #group by religion
  mutate (proportion = n / sum (n))

Where are we now?

Tidytext Flow:

Grouped Plot

Another Plot (not grouped)