Athanasios Paschos, Florian Nemetz & Kaja Benz
3th of June 2019
All-numeric dates can be written in several forms:
31/12/99 or 31.12.99 or 1999-12-31.
So we need one unambiguous and machine-readable date form.
The global standard numeric date format - ISO 8601 standard with the format YYYY-MM-DD.
Example:
If you have a character string that represents a date in the ISO 8601 standard you can turn it into a Date using the as.Date() function.
Just pass the character string (or a vector of character strings) as the first argument. Something important is to write the dates in ISO 8601 standard that means with the form YYYY-MM-DD.
Example:
x <- "23-10-2018"
str(x)
## chr "23-10-2018"
# first possible function to import dates
y <- as.Date(x)
y
## [1] "0023-10-20"
str(y)
## Date[1:1], format: "0023-10-20"
library(tidyverse)
x <- "2013-07-13"
y <- as.Date(x)
y
## [1] "2013-07-13"
str(y) # structure of y
## Date[1:1], format: "2013-07-13"
jan_12 <- as.Date("2015-01-12")
jan_12
## [1] "2015-01-12"
unclass(jan_12)
## [1] 16447
Because Date-objects are represented as the number of days since 1970-01-01 you can compare dates with the usual logical operators (<, ==, > etc.)
You can also find extremes with min() and max() and even subtract two dates to find out the time between them.
x <- as.Date(c("2019-06-03", "2015-01-01", "1985-04-24", "2010-03-31", "1945-10-23"))
x
## [1] "2019-06-03" "2015-01-01" "1985-04-24" "2010-03-31" "1945-10-23"
min(x)
## [1] "1945-10-23"
x[1] > x[4]
## [1] TRUE
y <- x[1] - x[3]
y
## Time difference of 12458 days
Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use as.POSIXct().
as.POSIXct() expects strings to be with the format YYYY-MM-DD HH:MM:SS. The times will be interpreted in local time.
You can check your timezone with Sys.timezone().
If you want the time to be interpreted in a different timezone, you just set the tz = “timezone” argument of as.POSIXct()
Tip 1: use OlsonNames() to get a list of time zones defined in your machine
x <- as.POSIXct(c("2003-10-01 22:12:00", "2012-03-24 11:34:00"))
x
## [1] "2003-10-01 22:12:00 CEST" "2012-03-24 11:34:00 CET"
x1 <- as.POSIXct(c("2010-10-01 12:12:00", "2012-03-24 11:34:00"), tz = "America/Los_Angeles")
x1
## [1] "2010-10-01 12:12:00 PDT" "2012-03-24 11:34:00 PDT"
str(x1)
## POSIXct[1:2], format: "2010-10-01 12:12:00" "2012-03-24 11:34:00"
unclass(x1)
## [1] 1285960320 1332614040
## attr(,"tzone")
## [1] "America/Los_Angeles"
Lubridate has the advantage that:
is simple to use,
parses formats that don’t have the ISO - 8601 format
allows parsing more than one date format in one vector
All functions return the same time-zone “UTC”.
Ignores seperators
Provides a set of functions for parsing dates of a known order.
For example, ymd() will parse dates with year first, followed by month and then day. All the functions with y, m and d in any order exist: mdy() , dmy() , ydm(), myd(), dym()
The parsing is flexible, for example:
it will parse the m whether it is numeric (e.g. 9 or 09) and
a full month name (e.g. September), or an abbreviated month-name (e.g. Sep).
library(lubridate)
x1 <- mdy(c("12.23.2009", "2/13/2000", "10231998")) # without separators
x1
## [1] "2009-12-23" "2000-02-13" "1998-10-23"
x2 <- ymd("2018:11:30")
x2
## [1] "2018-11-30"
str(x2)
## Date[1:1], format: "2018-11-30"
# two different separators
x3 <- dmy("27-2.1999")
x3
## [1] "1999-02-27"
x4 <- dmy("12.5.2000", "12.23.2010")
## Warning: 1 failed to parse.
x4
## [1] "2000-05-12" NA
There is the function parse_date_time() which takes an additional argument, orders,
where you can specify the order of the components in the date.
General Format : parse_date_time(x, orders = " " or c( " " ), tz = “UTC”, truncated = ,…)
Example :
# parse "2010 September 20th"
parse_date_time("2010 September 20th", orders = "ymd")
## [1] "2010-09-20 UTC"
that would be equivalent to using the ymd() function from the previous exercise.
One big advantage is that you can specify a vector of orders, and that allows parsing of dates where multiple formats might be used.
Another advantage of parse_date_time() is that you can use more format characters.
For example, you can specify weekday names with A, I for 12 hour time,am/pm indicators with p and many others.
You can read more about format here (https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/parse_date_time)
x <- "Monday June 1st 2001 at 5pm"
parse_date_time(x, orders = "AmdyIp")
## Warning: All formats failed to parse. No formats found.
## [1] NA
parse_date_time(c("10.23.2010", "2019-10-23"), order = c("mdy","ymd"))
## [1] "2010-10-23 UTC" "2019-10-23 UTC"
# Specify order to include both "mdy" and "dmy"
dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3 November 2014")
parse_date_time(dates, orders = c("mdy", "dmy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-11-03 UTC"
dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3.10.2014")
parse_date_time(dates, orders = c("mdy", "dmy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-03-10 UTC"
parse_date_time(dates, orders = c("dmy", "mdy"))
## [1] "2011-10-23 UTC" "2010-09-10 UTC" "2012-04-26 UTC" "2011-07-13 UTC"
## [5] "2014-10-03 UTC"
dmy_hm() ydm_hms() mdy_h()
x <- c("2014-04-14-04-35-59", "2010-04-01-12-00-00")
y <- ymd_hms(x)
y
## [1] "2014-04-14 04:35:59 UTC" "2010-04-01 12:00:00 UTC"
parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"),
orders = c("dmy_hms", "ydm_HMS","mdy_HMS"))
## Warning: hms, hm and ms usage is deprecated, please use HMS, HM or MS
## instead. Deprecated in version '1.5.6'.
## [1] "1994-04-13 12:00:00 UTC" "2000-09-23 03:30:40 UTC"
## [3] "2008-05-16 11:25:30 UTC"
parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"),
orders = c("dmy_HMS", "ydm_HMS","mdy_HMS"))
## [1] "1994-04-13 12:00:00 UTC" "2000-09-23 03:30:40 UTC"
## [3] "2008-05-16 11:25:30 UTC"
ymd_hms(…, tz = NULL, truncated = 0)
x <- c("2011-12-31 12:59:59", "2010-01-01 12:11", "2010-01-01 12", "2010-11")
ymd_hms(x, truncated = 4)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2010-11-01 00:00:00 UTC"
x1 <- c("2011-12-31", "2010-01-01", "2010-01-01", "2010-11")
ymd(x1, truncated = 1)
## [1] "2011-12-31" "2010-01-01" "2010-01-01" "2010-11-01"
ymd_hms(x, truncated = 3)
## [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
## [3] "2010-01-01 12:00:00 UTC" "2020-10-11 00:00:00 UTC"
ymd_hms(x1, truncated = 1)
## Warning: All formats failed to parse. No formats found.
## [1] NA NA NA NA
Components of a datetime can be extracted by lubridate functions with the same name like:
They all work the same way just pass in a datetime or vector of datetimes. There are also a few useful functions that return other aspects of a datetime like if it occurs in the morning am(), in a leap_year(), or which quarter() or semester() it occurs in.
x <- mdy_hms("12.23.2018 12:34:45", "2-28-2015 02:34:02", "4-15-2013 11:56:34", "3-5-2011 06:12:00")
month(x)
## [1] 12 2 4 3
year(x)
## [1] 2018 2015 2013 2011
semester(x)
## [1] 2 1 1 1
yday(x)
## [1] 357 59 105 64
wday(x)
## [1] 1 7 2 7
month(x) and wday(x)
x <- mdy_hms("12.23.2018 12:34:45", "2-28-2015 02:34:02", "4-15-2013 11:56:34", "3-5-2011 06:12:00")
x
## [1] "2018-12-23 12:34:45 UTC" "2015-02-28 02:34:02 UTC"
## [3] "2013-04-15 11:56:34 UTC" "2011-03-05 06:12:00 UTC"
month(x)
## [1] 12 2 4 3
month(x, label = TRUE, abbr = FALSE)
## [1] Dezember Februar April März
## 12 Levels: Jänner < Februar < März < April < Mai < Juni < ... < Dezember
wday(x, label = TRUE, abbr = TRUE ) # more readable with %>% table()
## [1] So\\. Sa\\. Mo\\. Sa\\.
## Levels: So\\. < Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\.
wday(x)
## [1] 1 7 2 7
For “R” the the first day of the week is on Sunday
library(lubridate)
wday(x, label = TRUE)
## [1] So\\. Sa\\. Mo\\. Sa\\.
## Levels: So\\. < Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\.
wday(x)
## [1] 1 7 2 7
## ymd weekday nr_weekday revenue_euro
## 1 2019-05-06 Mo\\. 2 10000
## 2 2019-05-07 Di\\. 3 12000
## 3 2019-05-08 Mi\\. 4 12500
## 4 2019-05-09 Do\\. 5 9000
## 5 2019-05-10 Fr\\. 6 13000
To change this you must use the extra argument :
week_start = getOption(“lubridate.week.start”, 7)
The number 7 corresponds to Sunday…
ymd <- ymd(c("2019-05-06", "2019-05-07", "2019-05-08", "2019-05-09","2019-05-10"))
weekday <- wday(ymd, label = TRUE, week_start = getOption("lubridate.week.start",1))
weekday
## [1] Mo\\. Di\\. Mi\\. Do\\. Fr\\.
## Levels: Mo\\. < Di\\. < Mi\\. < Do\\. < Fr\\. < Sa\\. < So\\.
nr_weekday <- wday(ymd, week_start = getOption("lubridate.week.start",1))
nr_weekday
## [1] 1 2 3 4 5
## ymd weekday nr_weekday revenue_euro
## 1 2019-05-06 Mo\\. 1 10000
## 2 2019-05-07 Di\\. 2 12000
## 3 2019-05-08 Mi\\. 3 12500
## 4 2019-05-09 Do\\. 4 9000
## 5 2019-05-10 Fr\\. 5 13000
url <- "https://archive.org/download/BildungAnDeutschenHochschulenAll2/Bildung%20an%20deutschen%20Hochschulen_all2.xlsx"
GET(url, write_disk(tf <- tempfile(fileext = ".xlsx")))
## Response [https://ia801506.us.archive.org/0/items/BildungAnDeutschenHochschulenAll2/Bildung%20an%20deutschen%20Hochschulen_all2.xlsx]
## Date: 2019-06-03 08:51
## Status: 200
## Content-Type: application/octet-stream
## Size: 37.9 kB
## <ON DISK> C:\Users\flone\AppData\Local\Temp\RtmpuKUnII\file1fbc34923f34.xlsx
Dataset_Hochschule <- read_excel(tf)
# Original plot
plot_Hochschule <-ggplot(Dataset_Hochschule, aes(x = Year, y = Numbers/1000 , color = Citizenship, shape = Status)) +
geom_line(alpha = 0.4, shape = 5, size =1) +
geom_jitter(width = 0.2) +
facet_wrap(.~Gender) +
ggtitle("
Developement of german and non-german HEI personnel from 2005-2017
by Gender, Citizenship and occupational status group") +
scale_y_continuous("Numbers in thousands", seq(0, max(Dataset_Hochschule$Numbers), 10)) +
scale_x_continuous(limits = c(2005, 2017), breaks = seq(2005,2017,2)) + theme_bw() +
theme(panel.spacing = unit(1.5, "lines"))
## Warning: Ignoring unknown parameters: shape
ggdraw(add_sub(plot_Hochschule,
"Note:
The shape of the data points shows the occupation
and the color of the lines shows the nationality of the persons.
The category 'Professor' includes full-time professors and junior professors.
The category 'Non-Professor' includes full-time scientific and artistic personnel.
It consists of docents and assistants, scientific and artistic co-workers
and staff for other special tasks.",
x=0.6, vjust=0.5, size = 9, fontface = "bold", vpadding = grid::unit(1, "lines")))
## Warning: Removed 5 rows containing missing values (geom_point).
Time zones represent the same across different geographic locations.
For example “2016-08-21 11:53:24” in “UCT” time zone is identical “2016-08-21 19:53:24” in “CST” TZ. Lubridate provides two functions that help dealing with time zones:
with_tz() : shows the same instant in a different timezone
force_tz() : This function changes the timezone without changing the clock time.
Tip 2: use grep(“country_name (or city_name)”,OlsonNames(),value=TRUE) to find a specific time zone
# to find a specific time zone: "Lisbon"
grep("Lisbon",OlsonNames(),value=TRUE)
## [1] "Europe/Lisbon"
# To find the time in Lisbon in specific time: for example right now
t <- with_tz(now(), tz = "Europe/Lisbon")
t
## [1] "2019-06-03 09:51:53 WEST"
# to change the timezone without changing the clock time. â in Jamaica
time <- dmy_hms("23.07.2019 23:12:45")
jam <- force_tz(time, tz = "Jamaica")
jam
## [1] "2019-07-23 23:12:45 EST"
ceiling_date() : rounds up.
All three take a unit argument which specifies the resolution of rounding.
You can specify “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”,“halfyear”, or “year”.
Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.
floor_date(x, unit = " " ),
round_date(x, unit = " " ),
ceiling_date(x, unit = " " )
time <- ymd_hms("2017-06-23 07:13:28 UTC")
wday(time)
## [1] 6
floor_date(time, unit = "month") # ( weird function ) # Round down to month
## [1] "2017-06-01 UTC"
ceiling_date(time, unit = "week") #Round up to week
## [1] "2017-06-25 UTC"
round_date(time, unit = "6 minutes") # Round to nearest 6 minutes
## [1] "2017-06-23 07:12:00 UTC"
round_date(time, unit = "3 years") # Round to nearest 3 years
## [1] "2016-01-01 UTC"
To get finer control over a difference between datetimes we use the base function difftime(). For example instead of time1 - time2 , we use difftime(time1, time2)
date1<- mdy("July 23, 1985")
date2<- mdy_hms("Sep 20th , 1990, 02:56:15")
# How many days between date1 and date2 ?
difftime(date2, date1, unit = "days")
## Time difference of 1885.122 days
# How many weeks between date1 and date2 ?
difftime(date2, date1, unit = "weeks")
## Time difference of 269.3032 weeks
Period Human concept of a time span
Datetime + period of one day = Same time on the next day
Duration Stopwatch concept of a time span
Datetime + duration of one day = Datetime + 86400 seconds
years() , months() , weeks() , days() , hours() minutes() , seconds()
Examples :
weeks(5)
## [1] "35d 0H 0M 0S"
hours(10)
## [1] "10H 0M 0S"
days(30)
## [1] "30d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"
dyears() , dweeks() , ddays() , dhours() , dminutes() , dseconds()
Examples
ddays(4)
## [1] "345600s (~4 days)"
dyears(2)
## [1] "63072000s (~2 years)"
dweeks(4)
## [1] "2419200s (~4 weeks)"
ddays(23)
## [1] "1987200s (~3.29 weeks)"
Why is there months() but no dmonths() ?
There is no direct unambiguous value of months in seconds since months have differing numbers of days.
28 or 29 days: February
The month is not a duration of time defined independently of when it occurs, but a special interval between two dates.
#Separate period and units vectors
period(c(90, 5), c("second", "minute"))
## [1] "5M 90S"
# Units as arguments
period(second = 3, minute = 1, hour = 2, day = 13, week = 1)
## [1] "20d 2H 1M 3S"
days() + minutes()
## [1] "1d 0H 1M 0S"
ymd_hms("2015.2.3 21:23:00") + weeks(3)
## [1] "2015-02-24 21:23:00 UTC"
dmy("23th Sep 2014") + dhours(81)
## [1] "2014-09-26 09:00:00 UTC"
c(1:3) * hours(1)
## [1] "1H 0M 0S" "2H 0M 0S" "3H 0M 0S"
x <- ymd("2016-10-03")
x + days(2) + hours(8) + minutes(45)
## [1] "2016-10-05 08:45:00 UTC"
By combining addition and multiplication with sequences you can generate sequences of datetimes.
For example, you can generate a sequence of periods from 1 day up to 10 days with:
1:10 * days(1)
Then by adding this sequence to a specific datetime, you can construct a sequence of datetimes from 1 day up to 10 days into the future
today() + 1:10 * days(1)
## [1] "2019-06-04" "2019-06-05" "2019-06-06" "2019-06-07" "2019-06-08"
## [6] "2019-06-09" "2019-06-10" "2019-06-11" "2019-06-12" "2019-06-13"
In general lubridate returns the same day of the month in the next month,
but since the 31st of February doesn’t exist lubridate returns a missing value, NA
jan <- ymd_hms("2010-01-31 03:04:05")
jan + months(1:3)
## [1] NA "2010-03-31 03:04:05 UTC"
## [3] NA
%m+% and %m-% that have different behavior.
Rather than returning an NA for a non-existent date,
they roll back to the last existing date.
jan <- ymd_hms("2010-01-31 03:04:05")
jan + months(1:3)
## [1] NA "2010-03-31 03:04:05 UTC"
## [3] NA
jan1 <- ymd("2012-01-31")
jan1 %m+% months(1:3)
## [1] "2012-02-29" "2012-03-31" "2012-04-30"
apr1 <- ymd("2018.04.30")
apr1 - months(1:3)
## [1] "2018-03-30" NA "2018-01-30"
apr1 %m-% months(1:3)
## [1] "2018-03-30" "2018-02-28" "2018-01-30"
Rollback changes a date to the last day of the previous month or to the first day of the month.
Optionally, the new date can retain the same hour, minute, and second information.
rollback(dates, roll_to_first = FALSE, preserve_hms = TRUE) Dates: A POSIXct, POSIXlt or Date class object.
roll_to_first Rollback to the first day of the month instead of the last day of the previous month
preserve_hms Retains the same hour, minute, and second information. If FALSE, the new date will be at 00:00:00.
date <- ymd("2010-03-03")
dates <- date + months(0:2)
dates
## [1] "2010-03-03" "2010-04-03" "2010-05-03"
rollback(dates)
## [1] "2010-02-28" "2010-03-31" "2010-04-30"
date <- ymd_hms("2010-03-03 12:44:22")
rollback(date, roll_to_first = TRUE)
## [1] "2010-03-01 12:44:22 UTC"
rollback(date, preserve_hms = FALSE)
## [1] "2010-02-28 UTC"
rollback(date, roll_to_first = TRUE, preserve_hms = FALSE)
## [1] "2010-03-01 UTC"
The operator %within% tests if the datetime (or interval) on the left hand side is within the interval of the right hand side.
For example, if y2001 is the interval covering the year 2001
y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
x21 <- ymd("2001-02-09") %--% mdy("3.23.2001")
ymd("2002-03-30") %within% y2001
## [1] FALSE
x21 %within% y2001
## [1] TRUE
Intervals are the most specific way to represent a span of time since they retain information about the exact start and end moments.
They can be converted to periods and durations exactly: it’s possible to calculate both the exact number of seconds elapsed between the start and end date, as well as the perceived change in clock time.
To do so you use the as.period(), and as.duration() functions, parsing in an interval as the only argument.
int1 <- interval(ymd("2009-01-01"), ymd("2009-08-01"))
as.duration(int1)
## [1] "18316800s (~30.29 weeks)"
as.period(int1)
## [1] "7m 0d 0H 0M 0S"
How would you analyze the content of a book or a tweet..?
Julia Silge and David Robinson developed the package “tidytext”, which makes text mining easier and more effective.
To get the text data there are different R packages accessing different sources
… or you could read in any other PDF or Word documents
We will be using GutenbergR to download religious books from Buddhism, Christianity, Hinduism, Islam and Judaism.
Hadley Wickham (2014) defines tidy data as data, that has been cleaned and reshaped in a way that is ready for analysis. The tidy text format is defined as a table with one-token-per-row.
Tidy text here is defined as one token-per-row, wheras a token can be
Tidytext Flow:
By default, one token is one word.
x_tidy <- x_all %>%
unnest_tokens(word, text)
x_tidy
## # A tibble: 19,620,261 x 2
## gutenberg_id word
## <int> <chr>
## 1 124 encoded
## 2 124 under
## 3 124 the
## 4 124 direction
## 5 124 of
## 6 124 robert
## 7 124 kraft
## 8 124 for
## 9 124 distribution
## 10 124 through
## # ... with 19,620,251 more rows
As you can see here we have one word per row.
But you can also have two words (bigram) as a token.
x_ntoken <- x_all %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
x_ntoken
## # A tibble: 19,620,007 x 2
## gutenberg_id bigram
## <int> <chr>
## 1 124 encoded under
## 2 124 under the
## 3 124 the direction
## 4 124 direction of
## 5 124 of robert
## 6 124 robert kraft
## 7 124 kraft for
## 8 124 for distribution
## 9 124 distribution through
## 10 124 through the
## # ... with 19,619,997 more rows
As you can see here we have two words per row.
Tidytext Flow:
Stopwords do not really reflect the content but are very common. Here are some examples:
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself"
Code:
x_cleaned <- x_tidy %>%
anti_join(get_stopwords())
Tidytext Flow:
What could be useful for the analysis, is the valence of the words:
nr <- get_sentiments("nrc")
nr[1:6,1:2]
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
bi <- get_sentiments("bing")
bi[10:16,1:2]
## # A tibble: 7 x 2
## word sentiment
## <chr> <chr>
## 1 abort negative
## 2 aborted negative
## 3 aborts negative
## 4 abound positive
## 5 abounds positive
## 6 abrade negative
## 7 abrasive negative
x_cleaned <- x_tidy %>%
anti_join(get_stopwords()) %>%
mutate(word = str_extract(word, "[a-z']+")) %>% # delete __that__
inner_join(get_sentiments("bing"))
Now that the data is tokenized and cleaned:
x_cleaned %>%
count(word, sort = TRUE)
## # A tibble: 5,401 x 2
## word n
## <chr> <int>
## 1 great 28975
## 2 good 25780
## 3 love 23054
## 4 like 22559
## 5 sin 19049
## 6 faith 16067
## 7 work 14622
## 8 death 14292
## 9 well 14174
## 10 divine 13943
## # ... with 5,391 more rows
Tidytext Flow:
Grouping the words by religion, which results in frequencies of negative vs. positive words.
frequency <- total %>%
count (gutenberg_bookshelf, sentiment)%>% #number of positve and negative words
group_by (gutenberg_bookshelf) %>% #group by religion
mutate (proportion = n / sum (n))
Tidytext Flow: