--- title: "Lubridate & Tidytext" author: "Athanasios Paschos, Florian Nemetz & Kaja Benz" date: "3th of June 2019" output: slidy_presentation: theme: journal highlight: default footer: "Tidytext & Lubridate" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = TRUE, warnings = FALSE) library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidytext) library (httr) library(readxl) library(cowplot) library(grid) ``` # Introduction - Specifying Dates * All-numeric dates can be written in several forms: **31/12/99** or **31.12.99** or **1999-12-31**. * So we need one unambiguous and machine-readable date form. * The global standard numeric date format - **ISO 8601 standard** with the format **YYYY-MM-DD**. **Example**: * 2th of January 2018 ----> **2018-01-02** * Sep 12th 2017 ----> **2017-09-12** # Introduction - Specifying Dates * If you have a character string that represents a date in the ISO 8601 standard you can turn it into a Date using the as.Date() function. * Just pass the character string (or a vector of character strings) as the first argument. Something important is to write the dates in ISO 8601 standard that means with the form **YYYY-MM-DD**. **Example**: ```{r} x <- "23-10-2018" str(x) # first possible function to import dates y <- as.Date(x) y str(y) ``` # Introduction - With the right format ```{r} library(tidyverse) x <- "2013-07-13" y <- as.Date(x) y str(y) # structure of y jan_12 <- as.Date("2015-01-12") jan_12 unclass(jan_12) ``` # Introduction - Arithmetic and logical operators * Because Date-objects are represented as the number of days since **1970-01-01** you can compare dates with the usual logical operators **(<, ==, > etc.)** * You can also find extremes with **min()** and **max()** and even subtract two dates to find out the time between them. ```{r} x <- as.Date(c("2019-06-03", "2015-01-01", "1985-04-24", "2010-03-31", "1945-10-23")) x min(x) x[1] > x[4] y <- x[1] - x[3] y ``` # Introduction- Getting datetimes into R * Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use **as.POSIXct()**. * **as.POSIXct()** expects strings to be with the format **YYYY-MM-DD HH:MM:SS**. The times will be interpreted in local time. * You can check your timezone with **Sys.timezone()**. * If you want the time to be interpreted in a different timezone, you just set the **tz = "timezone"** argument of as.POSIXct() * Tip 1: use OlsonNames() to get a list of time zones defined in your machine # **Examples** ```{r} x <- as.POSIXct(c("2003-10-01 22:12:00", "2012-03-24 11:34:00")) x x1 <- as.POSIXct(c("2010-10-01 12:12:00", "2012-03-24 11:34:00"), tz = "America/Los_Angeles") x1 str(x1) unclass(x1) ``` # Lubridate - Introduction **Lubridate** has the advantage that: 1. is simple to use, 2. parses formats that don't have the ISO - 8601 format 3. allows parsing more than one date format in one vector 4. All functions return the same time-zone **"UTC"**. 5. Ignores seperators # Lubridate - Parsing functions - 1st option * Provides a set of functions for parsing dates of a known order. * For example, **ymd()** will parse dates with year first, followed by month and then day. All the functions with **y**, **m** and **d** in any order exist: **mdy()** , **dmy()** , **ydm()**, **myd()**, **dym()** * The parsing is flexible, for example: * it will parse the **m** whether it is numeric (e.g. 9 or 09) and * a full month name (e.g. September), or an **abbreviated** month-name (e.g. Sep). # **Examples** ```{r warnings = TRUE} library(lubridate) x1 <- mdy(c("12.23.2009", "2/13/2000", "10231998")) # without separators x1 x2 <- ymd("2018:11:30") x2 str(x2) ``` # **Examples** ```{r} # two different separators x3 <- dmy("27-2.1999") x3 x4 <- dmy("12.5.2000", "12.23.2010") x4 ``` # Lubridate - Parsing Functions - 2nd option There is the function **parse_date_time()** which takes an additional argument, orders, where you can specify the order of the components in the date. **General Format** : **parse_date_time(x, orders = " " or c( " " ), tz = "UTC", truncated = ,...) ** **Example** : ```{r} # parse "2010 September 20th" parse_date_time("2010 September 20th", orders = "ymd") ``` that would be equivalent to using the **ymd()** function from the previous exercise. # Lubridate - Parsing Functions - 2nd option (2) * One big advantage is that you can specify a vector of orders, and that allows parsing of dates where multiple formats might be used. * Another advantage of **parse_date_time()** is that you can use more format characters. * For example, you can specify weekday names with **A**, **I** for 12 hour time,**am/pm** indicators with **p** and many others. * You can read more about format here (https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/parse_date_time) # **Examples** : ```{r} x <- "Monday June 1st 2001 at 5pm" parse_date_time(x, orders = "AmdyIp") parse_date_time(c("10.23.2010", "2019-10-23"), order = c("mdy","ymd")) ``` # **Examples** ```{r} # Specify order to include both "mdy" and "dmy" dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3 November 2014") parse_date_time(dates, orders = c("mdy", "dmy")) dates <- c("October 23, 2011", "September 10, 2010", "April 26, 2012", "13 July 2011", "3.10.2014") parse_date_time(dates, orders = c("mdy", "dmy")) parse_date_time(dates, orders = c("dmy", "mdy")) ``` # Lubridate - Parsing date-times * If your dates have times as well, you can use the functions that start with **ymd**, **dmy**, **mdy**, ... and are followed by any of _h, _hm or _hms . **dmy_hm()** **ydm_hms()** **mdy_h()** * For the **parse_date_time** function but in this case you need orders like **"ymd_HMS"**, **"mdy_HM"**, **"dmy_HMS"** i.e. the time elements with uppercase letters. All forms have tz = UTC ( default ) # **Examples** ```{r warnings = TRUE} x <- c("2014-04-14-04-35-59", "2010-04-01-12-00-00") y <- ymd_hms(x) y parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"), orders = c("dmy_hms", "ydm_HMS","mdy_HMS")) parse_date_time(c("13.4.1994 12:00:00", "2000-23-9 03:30:40","5.16.2008 11:25:30"), orders = c("dmy_HMS", "ydm_HMS","mdy_HMS")) ``` # Lubridate - One tricky form for incomplete dates **ymd_hms(..., tz = NULL, truncated = 0)** * If the **truncated** parameter is non-zero, the **ymd_hms** functions check for truncated formats. * For example, **ymd_hms()** with truncated = 3 will also parse incomplete dates like **2012-06-01 12:23**, **2012-06-01 12** and **2012-06-01** # **Examples** ```{r warnings = TRUE} x <- c("2011-12-31 12:59:59", "2010-01-01 12:11", "2010-01-01 12", "2010-11") ymd_hms(x, truncated = 4) x1 <- c("2011-12-31", "2010-01-01", "2010-01-01", "2010-11") ymd(x1, truncated = 1) ymd_hms(x, truncated = 3) ymd_hms(x1, truncated = 1) ``` # Lubridate - What can we extract? Components of a datetime can be extracted by lubridate functions with the same name like: * **year()**, * **month()**, * **day()**, * **hour()**, * **minute()** * **second()**. * **yday()** : the day of the year , * **wday()** : the day of the week They all work the same way just pass in a datetime or vector of datetimes. There are also a few useful functions that return other aspects of a datetime like if it occurs in the morning **am()**, in a **leap_year()**, or which **quarter()** or **semester()** it occurs in. # **Examples** ```{r} x <- mdy_hms("12.23.2018 12:34:45", "2-28-2015 02:34:02", "4-15-2013 11:56:34", "3-5-2011 06:12:00") month(x) year(x) semester(x) yday(x) wday(x) ``` # Lubridate - Adding useful labels * In the previous exercise you found the month and the day of the year of x : **month(x)** and **wday(x)** * Both the **month()** and **wday()** (day of the week) functions have additional arguments 1. **label** and 2. **abbr** * Set **label = TRUE** to have the output labelled with month (or weekday) names, and **abbr = FALSE** for those names to be written in full rather than abbreviated. *** # **Examples** ```{r} x <- mdy_hms("12.23.2018 12:34:45", "2-28-2015 02:34:02", "4-15-2013 11:56:34", "3-5-2011 06:12:00") x month(x) month(x, label = TRUE, abbr = FALSE) wday(x, label = TRUE, abbr = TRUE ) # more readable with %>% table() wday(x) ``` # Lubridate - Adding useful labels For **"R"** the the first day of the week is on Sunday ```{r} library(lubridate) wday(x, label = TRUE) wday(x) ``` ```{r echo = FALSE} ymd <- ymd(c("2019-05-06", "2019-05-07", "2019-05-08", "2019-05-09","2019-05-10")) weekday <- wday(ymd, label = TRUE) nr_weekday <- wday(ymd) revenue_euro <- c(10000, 12000, 12500, 9000, 13000) bank <- data.frame(ymd, weekday, nr_weekday, revenue_euro) bank ``` # Lubridate - Adding useful labels * To change this you must use the extra argument : **week_start = getOption("lubridate.week.start", 7)** * The number **7** corresponds to Sunday... * So if you want **"R"** to recognize as first day of the week something else you must change the value as follows : * **1 for Monday**, * **2 for Tuesday** and so on ... # **Example** ```{r} ymd <- ymd(c("2019-05-06", "2019-05-07", "2019-05-08", "2019-05-09","2019-05-10")) weekday <- wday(ymd, label = TRUE, week_start = getOption("lubridate.week.start",1)) weekday nr_weekday <- wday(ymd, week_start = getOption("lubridate.week.start",1)) nr_weekday ``` ```{r echo = FALSE} ymd <- ymd(c("2019-05-06", "2019-05-07", "2019-05-08", "2019-05-09","2019-05-10")) weekday <- wday(ymd, label = TRUE) nr_weekday <- wday(ymd, week_start = getOption("lubridate.week.start",1)) revenue_euro <- c(10000, 12000, 12500, 9000, 13000) bank <- data.frame(ymd, weekday, nr_weekday, revenue_euro) bank ``` # Lubridate - Plotting ```{r echo = TRUE, message = FALSE, warnings = FALSE} url <- "https://archive.org/download/BildungAnDeutschenHochschulenAll2/Bildung%20an%20deutschen%20Hochschulen_all2.xlsx" GET(url, write_disk(tf <- tempfile(fileext = ".xlsx"))) Dataset_Hochschule <- read_excel(tf) # Original plot plot_Hochschule <-ggplot(Dataset_Hochschule, aes(x = Year, y = Numbers/1000 , color = Citizenship, shape = Status)) + geom_line(alpha = 0.4, shape = 5, size =1) + geom_jitter(width = 0.2) + facet_wrap(.~Gender) + ggtitle(" Developement of german and non-german HEI personnel from 2005-2017 by Gender, Citizenship and occupational status group") + scale_y_continuous("Numbers in thousands", seq(0, max(Dataset_Hochschule$Numbers), 10)) + scale_x_continuous(limits = c(2005, 2017), breaks = seq(2005,2017,2)) + theme_bw() + theme(panel.spacing = unit(1.5, "lines")) ggdraw(add_sub(plot_Hochschule, "Note: The shape of the data points shows the occupation and the color of the lines shows the nationality of the persons. The category 'Professor' includes full-time professors and junior professors. The category 'Non-Professor' includes full-time scientific and artistic personnel. It consists of docents and assistants, scientific and artistic co-workers and staff for other special tasks.", x=0.6, vjust=0.5, size = 9, fontface = "bold", vpadding = grid::unit(1, "lines"))) ``` # Lubridate - Time zones Time zones represent the same across different geographic locations. * For example **"2016-08-21 11:53:24"** in **"UCT"** time zone is identical **"2016-08-21 19:53:24"** in **"CST"** TZ. **Lubridate** provides two functions that help dealing with time zones: **with_tz()** : shows the same instant in a different timezone **force_tz()** : This function changes the timezone without changing the clock time. * Tip 2: use **grep("country_name (or city_name)",OlsonNames(),value=TRUE)** to find a specific time zone # **Examples** ```{r} # to find a specific time zone: "Lisbon" grep("Lisbon",OlsonNames(),value=TRUE) # To find the time in Lisbon in specific time: for example right now t <- with_tz(now(), tz = "Europe/Lisbon") t # to change the timezone without changing the clock time. → in Jamaica time <- dmy_hms("23.07.2019 23:12:45") jam <- force_tz(time, tz = "Jamaica") jam ``` # Lubridate - Rounding * **round_date()** :rounds a date to the nearest value, * **floor_date()** :rounds down, * **ceiling_date()** : rounds up. * All three take a unit argument which specifies the resolution of rounding. * You can specify **"second"**, **"minute"**, **"hour"**, **"day"**, **"week"**, **"month"**, **"bimonth"**, **"quarter"**,**"halfyear"**, or **"year"**. * Or, you can specify any multiple of those units, e.g. **"5 years"**, **"3 minutes"** etc. * **floor_date(x, unit = " " )**, * **round_date(x, unit = " " )**, * **ceiling_date(x, unit = " " )** # **Examples** ```{r} time <- ymd_hms("2017-06-23 07:13:28 UTC") wday(time) floor_date(time, unit = "month") # ( weird function ) # Round down to month ceiling_date(time, unit = "week") #Round up to week round_date(time, unit = "6 minutes") # Round to nearest 6 minutes round_date(time, unit = "3 years") # Round to nearest 3 years ``` # Lubridate- Taking differences of daytimes To get finer control over a difference between datetimes we use the base function difftime(). For example instead of **time1 - time2** , we use **difftime(time1, time2)** * **difftime()** takes an argument units which specifies the units for the difference. Your options are * **"secs"**, * **"mins"**, * **"hours"**, * **"days"**, or * **"weeks"**. # **Examples** ```{r} date1<- mdy("July 23, 1985") date2<- mdy_hms("Sep 20th , 1990, 02:56:15") # How many days between date1 and date2 ? difftime(date2, date1, unit = "days") # How many weeks between date1 and date2 ? difftime(date2, date1, unit = "weeks") ``` # Lubridate - Time spans * **Period** Human concept of a time span Datetime + period of one day = Same time on the next day * **Duration** Stopwatch concept of a time span Datetime + duration of one day = Datetime + 86400 seconds # Creating a time span - **Periods** years() , months() , weeks() , days() , hours() minutes() , seconds() **Examples** : ```{r} weeks(5) hours(10) days(30) weeks(3) ``` # Creating a time span - Durations dyears() , dweeks() , ddays() , dhours() , dminutes() , dseconds() **Examples** ```{r} ddays(4) dyears(2) dweeks(4) ddays(23) ``` # Lubridate - Time spans - Extra * Why is there **months()** but no **dmonths()** ? * There is no direct unambiguous value of months in seconds since months have differing numbers of days. * 31 days: January, March, May, July, August, October * 30 days: April, June, September, November, December * 28 or 29 days: February * The month is not a duration of time defined independently of when it occurs, but a special interval between two dates. # **Examples** - Another ways to create periods ```{r} #Separate period and units vectors period(c(90, 5), c("second", "minute")) # Units as arguments period(second = 3, minute = 1, hour = 2, day = 13, week = 1) ``` # **Examples** ```{r} days() + minutes() ymd_hms("2015.2.3 21:23:00") + weeks(3) dmy("23th Sep 2014") + dhours(81) c(1:3) * hours(1) x <- ymd("2016-10-03") x + days(2) + hours(8) + minutes(45) ``` # Lubridate - Generating sequences of datetimes * By combining addition and multiplication with sequences you can generate sequences of datetimes. For example, you can generate a sequence of periods from 1 day up to 10 days with: **1:10 * days(1)** * Then by adding this sequence to a specific datetime, you can construct a sequence of datetimes from 1 day up to 10 days into the future ```{r} today() + 1:10 * days(1) ``` *** # Lubridate - The tricky thing about months In general lubridate returns the same day of the month in the next month, but since the 31st of February doesn't exist lubridate returns a missing value, NA * **Example** ```{r} jan <- ymd_hms("2010-01-31 03:04:05") jan + months(1:3) ``` * There are alternative addition and subtraction operators: **%m+%** and **%m-%** that have different behavior. * Rather than returning an NA for a non-existent date, they roll back to the last existing date. # **Examples** ```{r} jan <- ymd_hms("2010-01-31 03:04:05") jan + months(1:3) jan1 <- ymd("2012-01-31") jan1 %m+% months(1:3) apr1 <- ymd("2018.04.30") apr1 - months(1:3) apr1 %m-% months(1:3) ``` # Lubridate - Roll Back Date To Last Day Of Previous Month * **Rollback** changes a date to the last day of the previous month or to the first day of the month. * Optionally, the new date can retain the same hour, minute, and second information. **rollback(dates, roll_to_first = FALSE, preserve_hms = TRUE)** Dates: A POSIXct, POSIXlt or Date class object. * **roll_to_first** Rollback to the first day of the month instead of the last day of the previous month * **preserve_hms** Retains the same hour, minute, and second information. If FALSE, the new date will be at 00:00:00. # **Examples** ```{r} date <- ymd("2010-03-03") dates <- date + months(0:2) dates rollback(dates) date <- ymd_hms("2010-03-03 12:44:22") rollback(date, roll_to_first = TRUE) rollback(date, preserve_hms = FALSE) rollback(date, roll_to_first = TRUE, preserve_hms = FALSE) ``` *** # Lubridate - Comparing intervals and datetimes * The operator %within% tests if the datetime (or interval) on the left hand side is within the interval of the right hand side. * For example, if **y2001** is the interval covering the year **2001** ```{r} y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31") x21 <- ymd("2001-02-09") %--% mdy("3.23.2001") ymd("2002-03-30") %within% y2001 x21 %within% y2001 ``` *** # Lubridate - Converting to durations and periods * Intervals are the most specific way to represent a span of time since they retain information about the exact start and end moments. * They can be converted to periods and durations exactly: it's possible to calculate both the exact number of seconds elapsed between the start and end date, as well as the perceived change in clock time. * To do so you use the as.period(), and as.duration() functions, parsing in an interval as the only argument. # **Examples** ```{r} int1 <- interval(ymd("2009-01-01"), ymd("2009-08-01")) as.duration(int1) as.period(int1) ``` ```{r, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r alles, echo=FALSE, include=FALSE} knitr::opts_chunk$set(echo = FALSE) x <- c("gutenbergr", "tidytext", "ggplot2", "lubridate","dplyr", "readr", "tidyr", "rmarkdown", "stringr","data.table") ipak <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] print(new.pkg) if (length(new.pkg)>0){ install.packages(new.pkg, dependencies = TRUE, repos = c(CRAN = "http://cran.rstudio.com")) sapply(pkg, require, character.only = TRUE) } } ipak(pkg=x) lapply(x, require, character.only = TRUE) load("C:/Users/flone/OneDrive/Desktop/fulldata.RData") ``` # Tidytext How would you analyze the content of a book or a tweet..? # Tidytext __Julia Silge__ and __David Robinson__ developed the package "tidytext", which makes text mining easier and more effective. #Tidytext Flow: ![](C:/Users/flone/OneDrive\Desktop\Bilder Data Science\Bild 1.png) # Text Data To get the text data there are different R packages accessing different sources * GutenbergR * JaneaustenR * TwittR * InstaR ... or you could read in any other PDF or Word documents We will be using __GutenbergR__ to download religious books from Buddhism, Christianity, Hinduism, Islam and Judaism. # Tidy Text __Hadley Wickham (2014)__ defines tidy data as data, that has been cleaned and reshaped in a way that is ready for analysis. The tidy text format is defined as a *table with one-token-per-row*. Tidy text here is defined as one token-per-row, wheras a token can be + a single word (most common) + n-gram + sentence + paragraph # Where are we now? Tidytext Flow: ![](C:/Users\flone\OneDrive\Desktop\Bilder Data Science\Bild 2.png) # Tokenization By default, one token is one word. ```{r toki, echo = TRUE} x_tidy <- x_all %>% unnest_tokens(word, text) x_tidy ``` As you can see here we have one word per row. # Tokenization But you can also have two words (bigram) as a token. ```{r bigram, echo = TRUE} x_ntoken <- x_all %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) x_ntoken ``` As you can see here we have two words per row. # Exercise 1 1. unnest_tokens of x_all a) one token = one word b) one token = two words c) Count the word frequency # Where are we now? Tidytext Flow: ![](C:/Users\flone\OneDrive\Desktop\Bilder Data Science\Bild 3.png) # Stopwords Stopwords do not really reflect the content but are very common. Here are some examples: ```{r stop, echo=FALSE} sw<- get_stopwords(language="en", source="snowball") sw$word[1:12] ``` * Those are not useful for the further data analysis. * ... so we will be excluding stopwords Code: ```{r Code_stop, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE, paged.print=FALSE} x_cleaned <- x_tidy %>% anti_join(get_stopwords()) ``` # Exercise 2 2. Antijoin the stopwords a) and count again: better? # Where are we now? Tidytext Flow: ![](C:/Users\flone\OneDrive\Desktop\Bilder Data Science\Bild 4.png) # Exercise 3 3. Have a look at the different sentiments: get_sentiments a) nrc b) bing # Sentiments What could be useful for the analysis, is the valence of the words: ```{r sent, echo=TRUE} nr <- get_sentiments("nrc") nr[1:6,1:2] bi <- get_sentiments("bing") bi[10:16,1:2] ``` # Exercise 4 4. Join words with sentiments using inner_join # Clean data * exclude the stopwords with anti_join * Words stuck to other digits are extracted by str_extract * sentiments are joined ```{r ex_stop, echo=TRUE, message=FALSE} x_cleaned <- x_tidy %>% anti_join(get_stopwords()) %>% mutate(word = str_extract(word, "[a-z']+")) %>% # delete __that__ inner_join(get_sentiments("bing")) ``` # Analysis ```{r idmerge, echo= FALSE} total <- merge(x_cleaned,all_rel ,by=c("gutenberg_id")) ``` Now that the data is *tokenized* and *cleaned*: ```{r count, echo= TRUE} x_cleaned %>% count(word, sort = TRUE) ``` # Where are we now? Tidytext Flow: ![](C:/Users\flone\OneDrive\Desktop\Bilder Data Science\Bild 5.png) # Group by Grouping the words by religion, which results in frequencies of negative vs. positive words. ```{r group, echo= TRUE} frequency <- total %>% count (gutenberg_bookshelf, sentiment)%>% #number of positve and negative words group_by (gutenberg_bookshelf) %>% #group by religion mutate (proportion = n / sum (n)) ``` # Where are we now? Tidytext Flow: ![](C:/Users\flone\OneDrive\Desktop\Bilder Data Science\Bild 6.png) # Grouped Plot ```{r plotby} # plot proportion p <- ggplot(frequency, aes(x= sentiment,y= proportion)) p <- p + geom_bar(stat = "identity") p <- p + facet_wrap(~gutenberg_bookshelf,nrow=3) p <- p + theme_classic() p ``` # Another Plot (not grouped) ```{r ex1, message=FALSE} # attatch sentiments bing_word_counts <- x_tidy %>% inner_join(get_sentiments("nrc")) %>% count(word, sentiment, sort = TRUE) %>% ungroup() # plot sentiments in religious books ######################################## bing_word_counts %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales = "free_y") + labs(y = "Contribution to sentiment", x = NULL) + coord_flip() ########################################################################### ```