Data Cleaning

Author

Affiliation

Jubayer Hossain

Published

December 28, 2022

Warning: package 'ggplot2' was built under R version 4.2.2

Recap on summarization

summary(x): quantile information
summarize: creates a summary table of columns of interest
- combine with across() to programmatically select columns
count(variable): how many of each unique value do you have
group_by(): changes all subsequent functions
- combine with summarize() to get statistics per group
plot() and hist() are great for a quick snapshot of the data

📃Cheatsheet

Recap on data classes

tibbles show column classes!
as.CLASS_NAME(x) can be used to change the class of an object x
Logic class objects only have TRUE or False (without quotes)
The repeat rep() and seq() functions help you create vectors with to and from arguments (and others)
sample() makes random vectors. Can be used for integers or double depending on what it is sampling from.
matrix has columns and rows but is all one data class
lists can contain multiples of any other class of data including lists!
The lubridate package is helpful for dates and times
📃Cheatsheet

Data Cleaning

In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.

⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️

Dealing with Missing Data

Missing data types

One of the most important aspects of data cleaning is missing values.

Types of “missing” data:

NA - general missing data
NaN - stands for “Not a Number”, happens when you do 0/0.
Inf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.

Finding Missing data

is.na - looks for NAN and NA
is.nan- looks for NAN
is.infinite - looks for Inf or -Inf

test<-c(0,NA, -1)
test/0

[1]  NaN   NA -Inf

test <-test/0
is.na(test)

[1]  TRUE  TRUE FALSE

is.nan(test)

[1]  TRUE FALSE FALSE

is.infinite(test)

[1] FALSE FALSE  TRUE

Useful checking functions

any will be TRUE if ANY are true
- any(is.na(x)) - do we have any NA’s in x?

A = c(1, 2, 4, NA)
B = c(1, 2, 3, 4)
any(is.na(A)) # are there any NAs - YES/TRUE

[1] TRUE

any(is.na(B)) # are there any NAs- NO/FALSE

[1] FALSE

Fining `NA` values with `count()`

Check the values for your variables, are they what you expect?

count() is a great option because it gives you:

The unique values
The amount of these values

Check if rare values make sense.

bike <-jhur::read_bike()

bike %>% count(subType)

# A tibble: 4 × 2
  subType     n
  <chr>   <int>
1 STCLN       1
2 STRALY      3
3 STRPRD   1623
4 <NA>        4

`naniar`

Sometimes you need to look at lots of data though… the naniar package is a good option.

The pct_complete() function shows the percentage that is complete for a given data object.

#install.packages("naniar")
library(naniar)

Warning: package 'naniar' was built under R version 4.2.2

x = c(0, NA, 2, 3, 4, -0.5, 0.2)
naniar::pct_complete(x)

[1] 85.71429

test

[1]  NaN   NA -Inf

naniar::pct_complete(test) # doesn't count infinite values as missing

[1] 33.33333

Air quality data

The airquality dataset comes with R about air quality in New York in 1973.

?airquality # use this to find out more about the data
airqual <-tibble(airquality)
airqual

# A tibble: 153 × 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118   8      72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    NA      NA  14.3    56     5     5
 6    28      NA  14.9    66     5     6
 7    23     299   8.6    65     5     7
 8    19      99  13.8    59     5     8
 9     8      19  20.1    61     5     9
10    NA     194   8.6    69     5    10
# … with 143 more rows

naniar: `pct_complete()`

pct_complete(airquality)

[1] 95.20697

`naniar` plots

The gg_miss_var() function creates a nice plot about the number of missing values for each variable.

naniar::gg_miss_var(airqual)

Missing Data Issues

Recall that mathematical operations with NA often result in NAs.

sum(c(1,2,3,NA))

[1] NA

mean(c(2,4,NA))

[1] NA

median(c(1,2,3,NA))

[1] NA

Missing Data Issues

This is also true for logical data. Recall that TRUE is evaluated as 1 and FALSE is evaluated as 0.

x = c(TRUE, TRUE, TRUE, TRUE, FALSE, NA)
sum(x)

[1] NA

sum(x, na.rm = TRUE)

[1] 4

filter() and missing data

Be careful with missing data using subsetting:

filter() removes missing values by default. Because R can’t tell for sure if an NA value meets the condition. To keep them need to add is.na() conditional.

filter() and missing data

df

# A tibble: 6 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2    NA     8
3     2     6
4     3    NA
5     1     2
6     1    NA

df %>% filter(Dog < 3)

# A tibble: 4 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2     2     6
3     1     2
4     1    NA

filter() and missing data

df %>% filter(Dog < 3 | is.na(Dog))

# A tibble: 5 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2    NA     8
3     2     6
4     1     2
5     1    NA

To remove rows with `NA` values for a variable use `drop_na()`

A function from the tidyr package.

df %>% drop_na(Dog)

# A tibble: 5 × 2
    Dog   Cat
  <dbl> <dbl>
1     0    NA
2     2     6
3     3    NA
4     1     2
5     1    NA

To remove rows with `NA` values for a data frame use `drop_na()`

This function of the tidyr package drops rows with any missing data in any column when used on a df.

df %>% drop_na()

# A tibble: 2 × 2
    Dog   Cat
  <dbl> <dbl>
1     2     6
2     1     2

Drop columns with any missing values

Use the miss_var_which() function from naniar

df<-df %>% mutate(test =c(1,2,3,4,5,6))
df

# A tibble: 6 × 3
    Dog   Cat  test
  <dbl> <dbl> <dbl>
1     0    NA     1
2    NA     8     2
3     2     6     3
4     3    NA     4
5     1     2     5
6     1    NA     6

miss_var_which(df) # which columns have missing values

[1] "Dog" "Cat"

Drop columns with any missing values

df %>% select(!miss_var_which(df))

# A tibble: 6 × 1
   test
  <dbl>
1     1
2     2
3     3
4     4
5     5
6     6

Removing columns with threshold of percent missing row values

is.na(df) %>% head(n = 3)

       Dog   Cat  test
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE FALSE
[3,] FALSE FALSE FALSE

colMeans(is.na(df))#TRUE and FALSE treated like 0 and 1

      Dog       Cat      test 
0.1666667 0.5000000 0.0000000

df %>% select(which(colMeans(is.na(df)) < 0.2))# remove if over 20% missing

# A tibble: 6 × 2
    Dog  test
  <dbl> <dbl>
1     0     1
2    NA     2
3     2     3
4     3     4
5     1     5
6     1     6

Change a value to be `NA`

The na_if() function of dplyr can be helpful for this. Let’s say we think that all 0 values should be NA.

df %>% head(n = 3)

# A tibble: 3 × 3
    Dog   Cat  test
  <dbl> <dbl> <dbl>
1     0    NA     1
2    NA     8     2
3     2     6     3

df %>% mutate(Dog = na_if(Dog, 0))

# A tibble: 6 × 3
    Dog   Cat  test
  <dbl> <dbl> <dbl>
1    NA    NA     1
2    NA     8     2
3     2     6     3
4     3    NA     4
5     1     2     5
6     1    NA     6

Think about `NA`

Sometimes removing NA values leads to distorted math - be careful! Think about what your NA means for your data (are you sure ?).

Is an NA for values so low they could not be reported? Or is it this and also if there was a different issue?

Think about `NA`

If it is something more like a zero then you might want it included in your data like a zero.

Example: - survey reports NA if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week

You might want to keep the NA values so that you know the original sample size.

Word of caution

Calculating percentages will give you a different result depending on your choice to include NA values.

red_blue

# A tibble: 3 × 2
  color col_count
  <chr>     <int>
1 blue          3
2 red           3
3 <NA>          3

red_blue %>% mutate(percent = 
                      col_count/sum(pull(red_blue, col_count)))

# A tibble: 3 × 3
  color col_count percent
  <chr>     <int>   <dbl>
1 blue          3   0.333
2 red           3   0.333
3 <NA>          3   0.333

Word of caution

red_blue %>% mutate(percent = 
                      col_count/sum(pull(drop_na(red_blue), col_count)))

# A tibble: 3 × 3
  color col_count percent
  <chr>     <int>   <dbl>
1 blue          3     0.5
2 red           3     0.5
3 <NA>          3     0.5

Should you be dividing by 9 or 6?
It depends on your data and what NA might mean.
Pay attention to your data and your NA values!

Summary

is.na(),any(is.na()), count(), and functions from naniar like gg_miss_var() can help determine if we have NA values
filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)
drop_na() can help you remove NA values from a variable or an entire data frame
NA values can change your calculation results
think about what NA values represent

Lab Part 1

🏠 Class Website
💻Lab

Recoding Variables

Example of Recoding

Say we have some data about samples in a diet study:

data_diet

# A tibble: 12 × 4
   Diet  Gender Weight_start Weight_change
   <chr> <chr>         <int>         <int>
 1 A     Male            161            18
 2 B     m               242            20
 3 B     Other           141           -10
 4 A     F               200            11
 5 B     Female          142            -9
 6 B     M               152            -4
 7 A     f               245            12
 8 B     O               117             7
 9 B     Man             120             2
10 A     f               132             0
11 B     F               197             4
12 B     O               154            15

Oh dear…

This needs lots of recoding.

data_diet %>%
  count(Gender)

# A tibble: 9 × 2
  Gender     n
  <chr>  <int>
1 f          2
2 F          2
3 Female     1
4 m          1
5 M          1
6 Male       1
7 Man        1
8 O          2
9 Other      1

`dplyr` can help!

Using Excel to find all of the different ways gender has been coded, could be hectic!

In dplyr you can use the recode function (need mutate here too!):

# General Format - this is not code!
{data_input} %>%
  mutate({variable_to_fix} = {Variable_fixing}, {old_value} = {new_value},
                                       {another_old_value} = {new_value})

data_diet %>% 
  mutate(Gender = recode(Gender, M = "Male", 
                                 m = "Male", 
                               Man = "Male",
                                 O = "Other",
                                 f = "Female",
                                 F = "Female")) %>%
  count(Gender, Diet)

`recode()`

data_diet %>% 
  mutate(Gender = recode(Gender, M = "Male", 
                                 m = "Male", 
                               Man = "Male",
                                 O = "Other",
                                 f = "Female",
                                 F = "Female")) %>%
  count(Gender, Diet)

# A tibble: 5 × 3
  Gender Diet      n
  <chr>  <chr> <int>
1 Female A         3
2 Female B         2
3 Male   A         1
4 Male   B         3
5 Other  B         3

Or you can use `case_when()`

The case_when() function of dplyr can help us to do this as well.

# General Format - this is not code!
{data_input} %>%
  mutate({variable_to_fix} = case_when({Variable_fixing}   
                                /some condition/  ~ {value_for_con})

Note that automatically values not reassigned explicitly by case_when() will be NA unless otherwise specified.

data_diet %>% 
  mutate(Gender = case_when(Gender == "M" ~ "Male"))

# A tibble: 12 × 4
   Diet  Gender Weight_start Weight_change
   <chr> <chr>         <int>         <int>
 1 A     <NA>            161            18
 2 B     <NA>            242            20
 3 B     <NA>            141           -10
 4 A     <NA>            200            11
 5 B     <NA>            142            -9
 6 B     Male            152            -4
 7 A     <NA>            245            12
 8 B     <NA>            117             7
 9 B     <NA>            120             2
10 A     <NA>            132             0
11 B     <NA>            197             4
12 B     <NA>            154            15

Use of `case_when()` without automatic `NA`

# General Format - this is not code!
{data_input} %>%
  mutate({variable_to_fix} = case_when({Variable_fixing}   
             /some condition/ ~ {value_for_con},
                         TRUE ~ {value_for_not_meeting_condition})

Here we use the original values of Gender to replace all values of Gender that do not meet the condition == "M".

data_diet %>% 
  mutate(Gender = case_when(Gender == "M" ~ "Male",
                                     TRUE ~ Gender))

# A tibble: 12 × 4
   Diet  Gender Weight_start Weight_change
   <chr> <chr>         <int>         <int>
 1 A     Male            161            18
 2 B     m               242            20
 3 B     Other           141           -10
 4 A     F               200            11
 5 B     Female          142            -9
 6 B     Male            152            -4
 7 A     f               245            12
 8 B     O               117             7
 9 B     Man             120             2
10 A     f               132             0
11 B     F               197             4
12 B     O               154            15

More complicated case_when()

data_diet %>% 
  mutate(Gender = case_when(
    Gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    Gender %in% c("F", "Female", "f", "female")~ "Female",
    Gender %in% c("O", "Other") ~ "Other"))

# A tibble: 12 × 4
   Diet  Gender Weight_start Weight_change
   <chr> <chr>         <int>         <int>
 1 A     Male            161            18
 2 B     Male            242            20
 3 B     Other           141           -10
 4 A     Female          200            11
 5 B     Female          142            -9
 6 B     Male            152            -4
 7 A     Female          245            12
 8 B     Other           117             7
 9 B     Male            120             2
10 A     Female          132             0
11 B     Female          197             4
12 B     Other           154            15

Another reason for `case_when()`

case_when can do very sophisticated comparisons

data_diet <-data_diet %>% 
      mutate(Effect = case_when(Weight_change > 0 ~ "Increase",
                                Weight_change == 0 ~ "Same",
                                Weight_change < 0 ~ "Decrease"))

head(data_diet)

# A tibble: 6 × 5
  Diet  Gender Weight_start Weight_change Effect  
  <chr> <chr>         <int>         <int> <chr>   
1 A     Male            161            18 Increase
2 B     m               242            20 Increase
3 B     Other           141           -10 Decrease
4 A     F               200            11 Increase
5 B     Female          142            -9 Decrease
6 B     M               152            -4 Decrease

# A tibble: 4 × 3
  Diet  Effect       n
  <chr> <chr>    <int>
1 A     Increase     3
2 A     Same         1
3 B     Decrease     3
4 B     Increase     5

Working with strings

Strings in R

R can do much more than find exact matches for a whole string!

The `stringr` package

The stringr package:

Modifying or finding part or all of a character string
We will not cover grep or gsub - base R functions
- are used on forums for answers
Almost all functions start with str_*

`stringr`

str_detect, and str_replace search for matches to argument pattern within each element of a character vector (not data frame or tibble!).

str_detect - returns TRUE if pattern is found
str_replace - replaces pattern with replacement

`str_detect()`

The string argument specifies what to check
The pattern argument specifies what to check for

x<-c("cat", "dog", "mouse")
str_detect(string = x, pattern = "d")

[1] FALSE  TRUE FALSE

`str_replace()`

The string argument specifies what to check
The pattern argument specifies what to check for
The replacement argument specifies what to replace the pattern with

x<-c("cat", "dog", "mouse")
str_replace(string = x, pattern = "d", replacement = "D")

[1] "cat"   "Dog"   "mouse"

Subsetting part of a string

str_sub() allows you to subset part of a string
The string argument specifies what strings to work with The start argument specifies position of where to start
The start argument specifies position of where to end

x<-c("cat", "dog", "mouse")
str_sub(string = x, start = 1, end = 2)

[1] "ca" "do" "mo"

`filter` and `stringr` functions

head(data_diet,n = 4)

# A tibble: 4 × 5
  Diet  Gender Weight_start Weight_change Effect  
  <chr> <chr>         <int>         <int> <chr>   
1 A     Male            161            18 Increase
2 B     m               242            20 Increase
3 B     Other           141           -10 Decrease
4 A     F               200            11 Increase

data_diet %>% 
  filter(str_detect(string = Gender,
                    pattern = "M"))

# A tibble: 3 × 5
  Diet  Gender Weight_start Weight_change Effect  
  <chr> <chr>         <int>         <int> <chr>   
1 A     Male            161            18 Increase
2 B     M               152            -4 Decrease
3 B     Man             120             2 Increase

`case_when()` improved with `stringr`

count(data_diet, Gender)

# A tibble: 9 × 2
  Gender     n
  <chr>  <int>
1 f          2
2 F          2
3 Female     1
4 m          1
5 M          1
6 Male       1
7 Man        1
8 O          2
9 Other      1

data_diet %>% 
  mutate(Gender = case_when(
    Gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
    Gender %in% c("F", "Female", "f", "female")~ "Female",
    Gender %in% c("O", "Other") ~ "Other"))

`case_when()` improved with `stringr`

^ indicates the beginning of a character string $ indicates the end

data_diet %>% 
  mutate(Gender = case_when(
    str_detect(string = Gender, pattern = "^m|^M") ~ "Male",
    str_detect(string = Gender, pattern = "^f|^F") ~ "Female",
    str_detect(string = Gender, pattern = "^o|^O") ~ "Other")) %>%
  count(Gender)

# A tibble: 3 × 2
  Gender     n
  <chr>  <int>
1 Female     5
2 Male       4
3 Other      3

Separating and uniting data

What if our data looked like this?

diet_comb

# A tibble: 4 × 2
  change         n
  <chr>      <int>
1 A_Increase     3
2 A_Same         1
3 B_Decrease     3
4 B_Increase     5

Separating columns based on a separator

The separate() function from tidyr can split a column into multiple columns.
The col argument specifies what column to work with
The into argument specifies names of new columns
The sep argument specifies what to separate by

diet_comb %>% 
  separate(col = change, into = c("Diet", "Change"), sep = "_" )

# A tibble: 4 × 3
  Diet  Change       n
  <chr> <chr>    <int>
1 A     Increase     3
2 A     Same         1
3 B     Decrease     3
4 B     Increase     5

Uniting columns

The unite() function can instead help combine columns.
The col argument specifies new column name
The sep argument specifies what separator to use when combining

# A tibble: 4 × 2
     id visit
  <int> <int>
1     1     1
2     2     1
3     3     1
4     4     1

df_united <- df %>% unite(col = "unique_id", id, visit, sep = "_")
head(df_united, 4)

# A tibble: 4 × 1
  unique_id
  <chr>    
1 1_1      
2 2_1      
3 3_1      
4 4_1

Summary

recode() can help with simple recoding (not based on condition but simple swap)
case_when() can recode entire values based on conditions
- remember case_when() needs TRUE ~ varaible to keep values that aren’t specified by conditions, otherwise will be NA
stringr package has great functions for looking for specific parts of values especially filter() and str_detect() combined
- also has other useful string manipulation functions like str_replace() and more!
- separate() can split columns into additional columns
- unite() can combine columns

Lab Part 2

🏠 Class Website
💻Lab

Extra Slides

String Splitting

str_split(string, pattern) - splits strings up - returns list!

library(stringr)
x <- c("I really like writing R code")
df = tibble(x = c("I really", "like writing", "R code programs"))
y <- unlist(str_split(x, " "))
y

[1] "I"       "really"  "like"    "writing" "R"       "code"

length(y)

[1] 6

A bit on Regular Expressions

http://www.regular-expressions.info/reference.html
They can use to match a large number of strings in one statement
. matches any single character
* means repeat as many (even if 0) more times the last character
? makes the last thing optional
^ matches start of vector ^a - starts with “a”
$ matches end of vector b$ - ends with “b”

Let’s look at modifiers for `stringr`

?modifiers

fixed - match everything exactly
ignore_case is an option to not have to use tolower

Using a fixed expression

One example case is when you want to split on a period “.”. In regular expressions . means ANY character, so we need to specify that we want R to interpret “.” as simply a period.

str_split("I.like.strings", ".")

[[1]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

str_split("I.like.strings", fixed("."))

[[1]]
[1] "I"       "like"    "strings"

str_split("I.like.strings", "\\.")

[[1]]
[1] "I"       "like"    "strings"

Pasting strings with `paste` and `paste0`

Paste can be very useful for joining vectors together:

paste("Visit", 1:5, sep = "_")

[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"

paste("Visit", 1:5, sep = "_", collapse = "_")

[1] "Visit_1_Visit_2_Visit_3_Visit_4_Visit_5"

# and paste0 can be even simpler see ?paste0 
paste0("Visit",1:5) # no space!

[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"

!– # Before Cleaning - Subsetting with Brackets –>

–>

–> –> –>

Comparison of `stringr` to base R - not covered

Splitting Strings

Substringing

stringr

str_split(string, pattern) - splits strings up - returns list!

Splitting String:

In stringr, str_split splits a vector on a string into a list

x <- c("I really", "like writing", "R code programs")
y <- stringr::str_split(x, pattern =  " ") # returns a list
y

[[1]]
[1] "I"      "really"

[[2]]
[1] "like"    "writing"

[[3]]
[1] "R"        "code"     "programs"

‘Find’ functions: stringr compared to base R

Base R does not use these functions. Here is a “translator” of the stringr function to base R functions

str_detect - similar to grepl (return logical)
grep(value = FALSE) is similar to which(str_detect())
str_subset - similar to grep(value = TRUE) - return value of matched
str_replace - similar to sub - replace one time
str_replace_all - similar to gsub - replace many times

Important Comparisons

Base R:

Argument order is (pattern, x)
Uses option (fixed = TRUE)

stringr

Argument order is (string, pattern) aka (x, pattern)
Uses function fixed(pattern)

some data to work with

Sal = jhur::read_salaries() # or

Showing difference in `str_extract`

str_extract extracts just the matched string

ss = str_extract(Sal$Name, "Rawling")

Warning: Unknown or uninitialised column: `Name`.

head(ss)

character(0)

ss[ !is.na(ss)]

character(0)

Showing difference in `str_extract` and `str_extract_all`

str_extract_all extracts all the matched strings

head(str_extract(Sal$AgencyID, "\\d"))

[1] "0" "2" "6" "9" "4" "9"

head(str_extract_all(Sal$AgencyID, "\\d"), 2)

[[1]]
[1] "0" "3" "0" "3" "1"

[[2]]
[1] "2" "9" "0" "4" "5"

Using Regular Expressions

Look for any name that starts with:
- Payne at the beginning,
- Leonard and then an S
- Spence then capital C

head(grep("^Payne.*", x = Sal$name, value = TRUE), 3)

[1] "Payne El,Boaz L"         "Payne El,Jackie"        
[3] "Payne Johnson,Nickole A"

head(grep("Leonard.?S", x = Sal$name, value = TRUE))

[1] "Payne,Leonard S"      "Szumlanski,Leonard S"

head(grep("Spence.*C.*", x = Sal$name, value = TRUE))

[1] "Spencer,Charles A"  "Spencer,Clarence W" "Spencer,Michael C"

Using Regular Expressions: `stringr`

head(str_subset( Sal$name, "^Payne.*"), 3)

[1] "Payne El,Boaz L"         "Payne El,Jackie"        
[3] "Payne Johnson,Nickole A"

head(str_subset( Sal$name, "Leonard.?S"))

[1] "Payne,Leonard S"      "Szumlanski,Leonard S"

head(str_subset( Sal$name, "Spence.*C.*"))

[1] "Spencer,Charles A"  "Spencer,Clarence W" "Spencer,Michael C"

Replace

Let’s say we wanted to sort the data set by Annual Salary:

class(Sal$AnnualSalary)

[1] "character"

sort(c("1", "2", "10")) #  not sort correctly (order simply ranks the data)

[1] "1"  "10" "2"

order(c("1", "2", "10"))

[1] 1 3 2

Replace

So we must change the annual pay into a numeric:

head(Sal$AnnualSalary, 4)

[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"

head(as.numeric(Sal$AnnualSalary), 4)

Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion

[1] NA NA NA NA

R didn’t like the $ so it thought turned them all to NA.

sub() and gsub() can do the replacing part in base R.

Replacing and subbing

Now we can replace the $ with nothing (used fixed=TRUE because $ means ending):

Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="", 
                              Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ] 
Sal[1:5, c("name", "AnnualSalary", "JobTitle")]

# A tibble: 5 × 3
  name            AnnualSalary JobTitle              
  <chr>                  <dbl> <chr>                 
1 Mosby,Marilyn J       238772 STATE'S ATTORNEY      
2 Batts,Anthony W       211785 Police Commissioner   
3 Wen,Leana             200000 Executive Director III
4 Raymond,Henry J       192500 Executive Director III
5 Swift,Michael         187200 CONTRACT SERV SPEC II

Replacing and subbing: `stringr`

We can do the same thing (with 2 piping operations!) in dplyr

dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate( 
  AnnualSalary = AnnualSalary %>%
    str_replace(
      fixed("$"), 
      "") %>%
    as.numeric) %>%
  arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)

[1] TRUE

Website

Extra slides

Creating Two-way Tables

A two-way table. If you pass in 2 vectors, table creates a 2-dimensional table.

tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
             c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3),
              useNA = "always")
tab

      
       0 1 2 3 4 <NA>
  0    1 0 0 0 0    0
  1    0 1 0 0 0    0
  2    0 0 2 0 2    0
  3    0 0 0 4 0    0
  <NA> 0 0 0 0 0    0

Creating Two-way Tables

tab_df = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
             y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3))
tab_df %>% count(x, y)

# A tibble: 5 × 3
      x     y     n
  <dbl> <dbl> <int>
1     0     0     1
2     1     1     1
3     2     2     2
4     2     4     2
5     3     3     4

Creating Two-way Tables

tab_df %>% 
  count(x, y) %>% 
  group_by(x) %>% mutate(pct_x = n / sum(n))

# A tibble: 5 × 4
# Groups:   x [4]
      x     y     n pct_x
  <dbl> <dbl> <int> <dbl>
1     0     0     1   1  
2     1     1     1   1  
3     2     2     2   0.5
4     2     4     2   0.5
5     3     3     4   1

Creating Two-way Tables

library(scales)
tab_df %>% 
  count(x, y) %>% 
  group_by(x) %>% mutate(pct_x = percent(n / sum(n)))

# A tibble: 5 × 4
# Groups:   x [4]
      x     y     n pct_x
  <dbl> <dbl> <int> <chr>
1     0     0     1 100% 
2     1     1     1 100% 
3     2     2     2 50%  
4     2     4     2 50%  
5     3     3     4 100%

Recap on summarization

Recap on data classes

Data Cleaning

Dealing with Missing Data

Missing data types

Finding Missing data

Useful checking functions

Fining NA values with count()

naniar

Air quality data

naniar: pct_complete()

naniar plots

Missing Data Issues

Missing Data Issues

filter() and missing data

filter() and missing data

filter() and missing data

To remove rows with NA values for a variable use drop_na()

To remove rows with NA values for a data frame use drop_na()

Drop columns with any missing values

Drop columns with any missing values

Removing columns with threshold of percent missing row values

Change a value to be NA

Think about NA

Think about NA

Word of caution

Word of caution

Summary

Lab Part 1

Recoding Variables

Example of Recoding

Oh dear…

dplyr can help!

recode()

Or you can use case_when()

Use of case_when() without automatic NA

More complicated case_when()

Another reason for case_when()

Working with strings

Strings in R

The stringr package

stringr

str_detect()

str_replace()

Subsetting part of a string

filter and stringr functions

case_when() improved with stringr

case_when() improved with stringr

Separating and uniting data

What if our data looked like this?

Separating columns based on a separator

Uniting columns

Summary

Lab Part 2

Extra Slides

String Splitting

A bit on Regular Expressions

Let’s look at modifiers for stringr

Using a fixed expression

Pasting strings with paste and paste0

Comparison of stringr to base R - not covered

Splitting Strings

Substringing

Splitting String:

‘Find’ functions: stringr compared to base R

Important Comparisons

some data to work with

Showing difference in str_extract

Showing difference in str_extract and str_extract_all

Using Regular Expressions

Using Regular Expressions: stringr

Replace

Replace

Replacing and subbing

Replacing and subbing: stringr

Website

Extra slides

Creating Two-way Tables

Creating Two-way Tables

Creating Two-way Tables

Fining `NA` values with `count()`

`naniar`

naniar: `pct_complete()`

`naniar` plots

To remove rows with `NA` values for a variable use `drop_na()`

To remove rows with `NA` values for a data frame use `drop_na()`

Change a value to be `NA`

Think about `NA`

Think about `NA`

`dplyr` can help!

`recode()`

Or you can use `case_when()`

Use of `case_when()` without automatic `NA`

Another reason for `case_when()`

The `stringr` package

`stringr`

`str_detect()`

`str_replace()`

`filter` and `stringr` functions

`case_when()` improved with `stringr`

`case_when()` improved with `stringr`

Let’s look at modifiers for `stringr`

Pasting strings with `paste` and `paste0`

Comparison of `stringr` to base R - not covered

Showing difference in `str_extract`

Showing difference in `str_extract` and `str_extract_all`

Using Regular Expressions: `stringr`

Replacing and subbing: `stringr`