Warning: package 'ggplot2' was built under R version 4.2.2
Data Cleaning
Recap on summarization
summary(x)
: quantile informationsummarize
: creates a summary table of columns of interest- combine with
across()
to programmatically select columns
- combine with
count(variable)
: how many of each unique value do you havegroup_by()
: changes all subsequent functions- combine with
summarize()
to get statistics per group
- combine with
plot()
andhist()
are great for a quick snapshot of the data
Recap on data classes
- tibbles show column classes!
as.CLASS_NAME(x)
can be used to change the class of an object x- Logic class objects only have
TRUE
orFalse
(without quotes) - The repeat
rep()
andseq()
functions help you create vectors withto
andfrom
arguments (and others) sample()
makes random vectors. Can be used for integers or double depending on what it is sampling from.- matrix has columns and rows but is all one data class
- lists can contain multiples of any other class of data including lists!
- The
lubridate
package is helpful for dates and times
📃Cheatsheet
Data Cleaning
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️
Dealing with Missing Data
Missing data types
One of the most important aspects of data cleaning is missing values.
Types of “missing” data:
NA
- general missing dataNaN
- stands for “Not a Number”, happens when you do 0/0.Inf
and-Inf
- Infinity, happens when you divide a positive number (or negative number) by 0.
Finding Missing data
is.na
- looks forNAN
andNA
is.nan
- looks forNAN
is.infinite
- looks for Inf or -Inf
<-c(0,NA, -1)
test/0 test
[1] NaN NA -Inf
<-test/0
test is.na(test)
[1] TRUE TRUE FALSE
is.nan(test)
[1] TRUE FALSE FALSE
is.infinite(test)
[1] FALSE FALSE TRUE
Useful checking functions
any
will beTRUE
if ANY are trueany(is.na(x))
- do we have anyNA
’s inx
?
= c(1, 2, 4, NA)
A = c(1, 2, 3, 4)
B any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE
Fining NA
values with count()
Check the values for your variables, are they what you expect?
count()
is a great option because it gives you:
- The unique values
- The amount of these values
Check if rare values make sense.
<-jhur::read_bike()
bike
%>% count(subType) bike
# A tibble: 4 × 2
subType n
<chr> <int>
1 STCLN 1
2 STRALY 3
3 STRPRD 1623
4 <NA> 4
naniar
Sometimes you need to look at lots of data though… the naniar
package is a good option.
The pct_complete()
function shows the percentage that is complete for a given data object.
#install.packages("naniar")
library(naniar)
Warning: package 'naniar' was built under R version 4.2.2
= c(0, NA, 2, 3, 4, -0.5, 0.2)
x ::pct_complete(x) naniar
[1] 85.71429
test
[1] NaN NA -Inf
::pct_complete(test) # doesn't count infinite values as missing naniar
[1] 33.33333
Air quality data
The airquality
dataset comes with R about air quality in New York in 1973.
# use this to find out more about the data
?airquality <-tibble(airquality)
airqual airqual
# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# … with 143 more rows
naniar: pct_complete()
pct_complete(airquality)
[1] 95.20697
naniar
plots
The gg_miss_var()
function creates a nice plot about the number of missing values for each variable.
::gg_miss_var(airqual) naniar
Missing Data Issues
Recall that mathematical operations with NA
often result in NA
s.
sum(c(1,2,3,NA))
[1] NA
mean(c(2,4,NA))
[1] NA
median(c(1,2,3,NA))
[1] NA
Missing Data Issues
This is also true for logical data. Recall that TRUE
is evaluated as 1 and FALSE
is evaluated as 0.
= c(TRUE, TRUE, TRUE, TRUE, FALSE, NA)
x sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 4
filter() and missing data
Be careful with missing data using subsetting:
filter()
removes missing values by default. Because R can’t tell for sure if an NA
value meets the condition. To keep them need to add is.na()
conditional.
filter() and missing data
df
# A tibble: 6 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 NA 8
3 2 6
4 3 NA
5 1 2
6 1 NA
%>% filter(Dog < 3) df
# A tibble: 4 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 2 6
3 1 2
4 1 NA
filter() and missing data
%>% filter(Dog < 3 | is.na(Dog)) df
# A tibble: 5 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 NA 8
3 2 6
4 1 2
5 1 NA
To remove rows with NA
values for a variable use drop_na()
A function from the tidyr
package.
%>% drop_na(Dog) df
# A tibble: 5 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 2 6
3 3 NA
4 1 2
5 1 NA
To remove rows with NA
values for a data frame use drop_na()
This function of the tidyr
package drops rows with any missing data in any column when used on a df.
%>% drop_na() df
# A tibble: 2 × 2
Dog Cat
<dbl> <dbl>
1 2 6
2 1 2
Drop columns with any missing values
Use the miss_var_which()
function from naniar
<-df %>% mutate(test =c(1,2,3,4,5,6))
df df
# A tibble: 6 × 3
Dog Cat test
<dbl> <dbl> <dbl>
1 0 NA 1
2 NA 8 2
3 2 6 3
4 3 NA 4
5 1 2 5
6 1 NA 6
miss_var_which(df) # which columns have missing values
[1] "Dog" "Cat"
Drop columns with any missing values
%>% select(!miss_var_which(df)) df
# A tibble: 6 × 1
test
<dbl>
1 1
2 2
3 3
4 4
5 5
6 6
Removing columns with threshold of percent missing row values
is.na(df) %>% head(n = 3)
Dog Cat test
[1,] FALSE TRUE FALSE
[2,] TRUE FALSE FALSE
[3,] FALSE FALSE FALSE
colMeans(is.na(df))#TRUE and FALSE treated like 0 and 1
Dog Cat test
0.1666667 0.5000000 0.0000000
%>% select(which(colMeans(is.na(df)) < 0.2))# remove if over 20% missing df
# A tibble: 6 × 2
Dog test
<dbl> <dbl>
1 0 1
2 NA 2
3 2 3
4 3 4
5 1 5
6 1 6
Change a value to be NA
The na_if()
function of dplyr
can be helpful for this. Let’s say we think that all 0 values should be NA
.
%>% head(n = 3) df
# A tibble: 3 × 3
Dog Cat test
<dbl> <dbl> <dbl>
1 0 NA 1
2 NA 8 2
3 2 6 3
%>% mutate(Dog = na_if(Dog, 0)) df
# A tibble: 6 × 3
Dog Cat test
<dbl> <dbl> <dbl>
1 NA NA 1
2 NA 8 2
3 2 6 3
4 3 NA 4
5 1 2 5
6 1 NA 6
Think about NA
Sometimes removing NA
values leads to distorted math - be careful! Think about what your NA
means for your data (are you sure ?).
Is an NA
for values so low they could not be reported? Or is it this and also if there was a different issue?
Think about NA
If it is something more like a zero then you might want it included in your data like a zero.
Example: - survey reports NA
if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week
You might want to keep the NA
values so that you know the original sample size.
Word of caution
Calculating percentages will give you a different result depending on your choice to include NA values.
red_blue
# A tibble: 3 × 2
color col_count
<chr> <int>
1 blue 3
2 red 3
3 <NA> 3
%>% mutate(percent =
red_blue /sum(pull(red_blue, col_count))) col_count
# A tibble: 3 × 3
color col_count percent
<chr> <int> <dbl>
1 blue 3 0.333
2 red 3 0.333
3 <NA> 3 0.333
Word of caution
%>% mutate(percent =
red_blue /sum(pull(drop_na(red_blue), col_count))) col_count
# A tibble: 3 × 3
color col_count percent
<chr> <int> <dbl>
1 blue 3 0.5
2 red 3 0.5
3 <NA> 3 0.5
Should you be dividing by 9 or 6?
It depends on your data and what NA
might mean.
Pay attention to your data and your NA
values!
Summary
is.na()
,any(is.na())
,count()
, and functions fromnaniar
likegg_miss_var()
can help determine if we haveNA
valuesfilter()
automatically removesNA
values - can’t confirm or deny if condition is met (need| is.na()
to keep them)drop_na()
can help you removeNA
values from a variable or an entire data frameNA
values can change your calculation results- think about what
NA
values represent
Lab Part 1
🏠 Class Website
💻Lab
Recoding Variables
Example of Recoding
Say we have some data about samples in a diet study:
data_diet
# A tibble: 12 × 4
Diet Gender Weight_start Weight_change
<chr> <chr> <int> <int>
1 A Male 161 18
2 B m 242 20
3 B Other 141 -10
4 A F 200 11
5 B Female 142 -9
6 B M 152 -4
7 A f 245 12
8 B O 117 7
9 B Man 120 2
10 A f 132 0
11 B F 197 4
12 B O 154 15
Oh dear…
This needs lots of recoding.
%>%
data_diet count(Gender)
# A tibble: 9 × 2
Gender n
<chr> <int>
1 f 2
2 F 2
3 Female 1
4 m 1
5 M 1
6 Male 1
7 Man 1
8 O 2
9 Other 1
dplyr
can help!
Using Excel to find all of the different ways gender
has been coded, could be hectic!
In dplyr
you can use the recode
function (need mutate
here too!):
# General Format - this is not code!
%>%
{data_input} mutate({variable_to_fix} = {Variable_fixing}, {old_value} = {new_value},
= {new_value}) {another_old_value}
%>%
data_diet mutate(Gender = recode(Gender, M = "Male",
m = "Male",
Man = "Male",
O = "Other",
f = "Female",
F = "Female")) %>%
count(Gender, Diet)
recode()
%>%
data_diet mutate(Gender = recode(Gender, M = "Male",
m = "Male",
Man = "Male",
O = "Other",
f = "Female",
F = "Female")) %>%
count(Gender, Diet)
# A tibble: 5 × 3
Gender Diet n
<chr> <chr> <int>
1 Female A 3
2 Female B 2
3 Male A 1
4 Male B 3
5 Other B 3
Or you can use case_when()
The case_when()
function of dplyr
can help us to do this as well.
# General Format - this is not code!
%>%
{data_input} mutate({variable_to_fix} = case_when({Variable_fixing}
/some condition/ ~ {value_for_con})
Note that automatically values not reassigned explicitly by case_when()
will be NA
unless otherwise specified.
%>%
data_diet mutate(Gender = case_when(Gender == "M" ~ "Male"))
# A tibble: 12 × 4
Diet Gender Weight_start Weight_change
<chr> <chr> <int> <int>
1 A <NA> 161 18
2 B <NA> 242 20
3 B <NA> 141 -10
4 A <NA> 200 11
5 B <NA> 142 -9
6 B Male 152 -4
7 A <NA> 245 12
8 B <NA> 117 7
9 B <NA> 120 2
10 A <NA> 132 0
11 B <NA> 197 4
12 B <NA> 154 15
Use of case_when()
without automatic NA
# General Format - this is not code!
%>%
{data_input} mutate({variable_to_fix} = case_when({Variable_fixing}
/some condition/ ~ {value_for_con},
TRUE ~ {value_for_not_meeting_condition})
Here we use the original values of Gender
to replace all values of Gender
that do not meet the condition == "M"
.
%>%
data_diet mutate(Gender = case_when(Gender == "M" ~ "Male",
TRUE ~ Gender))
# A tibble: 12 × 4
Diet Gender Weight_start Weight_change
<chr> <chr> <int> <int>
1 A Male 161 18
2 B m 242 20
3 B Other 141 -10
4 A F 200 11
5 B Female 142 -9
6 B Male 152 -4
7 A f 245 12
8 B O 117 7
9 B Man 120 2
10 A f 132 0
11 B F 197 4
12 B O 154 15
More complicated case_when()
%>%
data_diet mutate(Gender = case_when(
%in% c("M", "male", "Man", "m", "Male") ~ "Male",
Gender %in% c("F", "Female", "f", "female")~ "Female",
Gender %in% c("O", "Other") ~ "Other")) Gender
# A tibble: 12 × 4
Diet Gender Weight_start Weight_change
<chr> <chr> <int> <int>
1 A Male 161 18
2 B Male 242 20
3 B Other 141 -10
4 A Female 200 11
5 B Female 142 -9
6 B Male 152 -4
7 A Female 245 12
8 B Other 117 7
9 B Male 120 2
10 A Female 132 0
11 B Female 197 4
12 B Other 154 15
Another reason for case_when()
case_when
can do very sophisticated comparisons
<-data_diet %>%
data_diet mutate(Effect = case_when(Weight_change > 0 ~ "Increase",
== 0 ~ "Same",
Weight_change < 0 ~ "Decrease"))
Weight_change
head(data_diet)
# A tibble: 6 × 5
Diet Gender Weight_start Weight_change Effect
<chr> <chr> <int> <int> <chr>
1 A Male 161 18 Increase
2 B m 242 20 Increase
3 B Other 141 -10 Decrease
4 A F 200 11 Increase
5 B Female 142 -9 Decrease
6 B M 152 -4 Decrease
# A tibble: 4 × 3
Diet Effect n
<chr> <chr> <int>
1 A Increase 3
2 A Same 1
3 B Decrease 3
4 B Increase 5
Working with strings
Strings in R
- R can do much more than find exact matches for a whole string!
The stringr
package
The stringr
package:
- Modifying or finding part or all of a character string
- We will not cover
grep
orgsub
- base R functions- are used on forums for answers
- Almost all functions start with
str_*
stringr
str_detect
, and str_replace
search for matches to argument pattern within each element of a character vector (not data frame or tibble!).
str_detect
- returnsTRUE
ifpattern
is foundstr_replace
- replacespattern
withreplacement
str_detect()
The string
argument specifies what to check
The pattern
argument specifies what to check for
<-c("cat", "dog", "mouse")
xstr_detect(string = x, pattern = "d")
[1] FALSE TRUE FALSE
str_replace()
The string
argument specifies what to check
The pattern
argument specifies what to check for
The replacement
argument specifies what to replace the pattern with
<-c("cat", "dog", "mouse")
xstr_replace(string = x, pattern = "d", replacement = "D")
[1] "cat" "Dog" "mouse"
Subsetting part of a string
str_sub()
allows you to subset part of a string
The string
argument specifies what strings to work with The start
argument specifies position of where to start
The start
argument specifies position of where to end
<-c("cat", "dog", "mouse")
xstr_sub(string = x, start = 1, end = 2)
[1] "ca" "do" "mo"
filter
and stringr
functions
head(data_diet,n = 4)
# A tibble: 4 × 5
Diet Gender Weight_start Weight_change Effect
<chr> <chr> <int> <int> <chr>
1 A Male 161 18 Increase
2 B m 242 20 Increase
3 B Other 141 -10 Decrease
4 A F 200 11 Increase
%>%
data_diet filter(str_detect(string = Gender,
pattern = "M"))
# A tibble: 3 × 5
Diet Gender Weight_start Weight_change Effect
<chr> <chr> <int> <int> <chr>
1 A Male 161 18 Increase
2 B M 152 -4 Decrease
3 B Man 120 2 Increase
case_when()
improved with stringr
count(data_diet, Gender)
# A tibble: 9 × 2
Gender n
<chr> <int>
1 f 2
2 F 2
3 Female 1
4 m 1
5 M 1
6 Male 1
7 Man 1
8 O 2
9 Other 1
%>%
data_diet mutate(Gender = case_when(
%in% c("M", "male", "Man", "m", "Male") ~ "Male",
Gender %in% c("F", "Female", "f", "female")~ "Female",
Gender %in% c("O", "Other") ~ "Other")) Gender
case_when()
improved with stringr
^
indicates the beginning of a character string $
indicates the end
%>%
data_diet mutate(Gender = case_when(
str_detect(string = Gender, pattern = "^m|^M") ~ "Male",
str_detect(string = Gender, pattern = "^f|^F") ~ "Female",
str_detect(string = Gender, pattern = "^o|^O") ~ "Other")) %>%
count(Gender)
# A tibble: 3 × 2
Gender n
<chr> <int>
1 Female 5
2 Male 4
3 Other 3
Separating and uniting data
What if our data looked like this?
diet_comb
# A tibble: 4 × 2
change n
<chr> <int>
1 A_Increase 3
2 A_Same 1
3 B_Decrease 3
4 B_Increase 5
Separating columns based on a separator
The separate()
function from tidyr
can split a column into multiple columns.
The col
argument specifies what column to work with
The into
argument specifies names of new columns
The sep
argument specifies what to separate by
%>%
diet_comb separate(col = change, into = c("Diet", "Change"), sep = "_" )
# A tibble: 4 × 3
Diet Change n
<chr> <chr> <int>
1 A Increase 3
2 A Same 1
3 B Decrease 3
4 B Increase 5
Uniting columns
The unite()
function can instead help combine columns.
The col
argument specifies new column name
The sep
argument specifies what separator to use when combining
# A tibble: 4 × 2
id visit
<int> <int>
1 1 1
2 2 1
3 3 1
4 4 1
<- df %>% unite(col = "unique_id", id, visit, sep = "_")
df_united head(df_united, 4)
# A tibble: 4 × 1
unique_id
<chr>
1 1_1
2 2_1
3 3_1
4 4_1
Summary
recode()
can help with simple recoding (not based on condition but simple swap)case_when()
can recode entire values based on conditions- remember
case_when()
needsTRUE ~ varaible
to keep values that aren’t specified by conditions, otherwise will beNA
- remember
stringr
package has great functions for looking for specific parts of values especiallyfilter()
andstr_detect()
combined- also has other useful string manipulation functions like
str_replace()
and more! separate()
can split columns into additional columnsunite()
can combine columns
- also has other useful string manipulation functions like
Lab Part 2
🏠 Class Website
💻Lab
Extra Slides
String Splitting
str_split(string, pattern)
- splits strings up - returns list!
library(stringr)
<- c("I really like writing R code")
x = tibble(x = c("I really", "like writing", "R code programs"))
df <- unlist(str_split(x, " "))
y y
[1] "I" "really" "like" "writing" "R" "code"
length(y)
[1] 6
A bit on Regular Expressions
- http://www.regular-expressions.info/reference.html
- They can use to match a large number of strings in one statement
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector^a
- starts with “a”$
matches end of vectorb$
- ends with “b”
Let’s look at modifiers for stringr
?modifiers
fixed
- match everything exactlyignore_case
is an option to not have to usetolower
Using a fixed expression
One example case is when you want to split on a period “.
”. In regular expressions .
means ANY character, so we need to specify that we want R to interpret “.” as simply a period.
str_split("I.like.strings", ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]]
[1] "I" "like" "strings"
str_split("I.like.strings", "\\.")
[[1]]
[1] "I" "like" "strings"
Pasting strings with paste
and paste0
Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = "_")
[1] "Visit_1_Visit_2_Visit_3_Visit_4_Visit_5"
# and paste0 can be even simpler see ?paste0
paste0("Visit",1:5) # no space!
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
!– # Before Cleaning - Subsetting with Brackets –>
–>
–> –> –>
Comparison of stringr
to base R - not covered
Splitting Strings
Substringing
stringr
str_split(string, pattern)
- splits strings up - returns list!
Splitting String:
In stringr
, str_split
splits a vector on a string into a list
<- c("I really", "like writing", "R code programs")
x <- stringr::str_split(x, pattern = " ") # returns a list
y y
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
‘Find’ functions: stringr compared to base R
Base R does not use these functions. Here is a “translator” of the stringr
function to base R functions
str_detect
- similar togrepl
(return logical)grep(value = FALSE)
is similar towhich(str_detect())
str_subset
- similar togrep(value = TRUE)
- return value of matchedstr_replace
- similar tosub
- replace one timestr_replace_all
- similar togsub
- replace many times
Important Comparisons
Base R:
- Argument order is
(pattern, x)
- Uses option
(fixed = TRUE)
stringr
- Argument order is
(string, pattern)
aka(x, pattern)
- Uses function
fixed(pattern)
some data to work with
= jhur::read_salaries() # or Sal
Showing difference in str_extract
str_extract
extracts just the matched string
= str_extract(Sal$Name, "Rawling") ss
Warning: Unknown or uninitialised column: `Name`.
head(ss)
character(0)
!is.na(ss)] ss[
character(0)
Showing difference in str_extract
and str_extract_all
str_extract_all
extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]]
[1] "0" "3" "0" "3" "1"
[[2]]
[1] "2" "9" "0" "4" "5"
Using Regular Expressions
- Look for any name that starts with:
- Payne at the beginning,
- Leonard and then an S
- Spence then capital C
head(grep("^Payne.*", x = Sal$name, value = TRUE), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie"
[3] "Payne Johnson,Nickole A"
head(grep("Leonard.?S", x = Sal$name, value = TRUE))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(grep("Spence.*C.*", x = Sal$name, value = TRUE))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
Using Regular Expressions: stringr
head(str_subset( Sal$name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie"
[3] "Payne Johnson,Nickole A"
head(str_subset( Sal$name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
Replace
Let’s say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "character"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
Replace
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"
head(as.numeric(Sal$AnnualSalary), 4)
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA
R didn’t like the $
so it thought turned them all to NA
.
sub()
and gsub()
can do the replacing part in base R.
Replacing and subbing
Now we can replace the $
with nothing (used fixed=TRUE
because $
means ending):
$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal<- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ]
Sal 1:5, c("name", "AnnualSalary", "JobTitle")] Sal[
# A tibble: 5 × 3
name AnnualSalary JobTitle
<chr> <dbl> <chr>
1 Mosby,Marilyn J 238772 STATE'S ATTORNEY
2 Batts,Anthony W 211785 Police Commissioner
3 Wen,Leana 200000 Executive Director III
4 Raymond,Henry J 192500 Executive Director III
5 Swift,Michael 187200 CONTRACT SERV SPEC II
Replacing and subbing: stringr
We can do the same thing (with 2 piping operations!) in dplyr
= Sal
dplyr_sal = dplyr_sal %>% mutate(
dplyr_sal AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
%>%
as.numeric) arrange(desc(AnnualSalary))
= Sal
check_Sal rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE
Website
Extra slides
Creating Two-way Tables
A two-way table. If you pass in 2 vectors, table
creates a 2-dimensional table.
<- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
tab c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3),
useNA = "always")
tab
0 1 2 3 4 <NA>
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 2 0 2 0
3 0 0 0 4 0 0
<NA> 0 0 0 0 0 0
Creating Two-way Tables
= tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
tab_df y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3))
%>% count(x, y) tab_df
# A tibble: 5 × 3
x y n
<dbl> <dbl> <int>
1 0 0 1
2 1 1 1
3 2 2 2
4 2 4 2
5 3 3 4
Creating Two-way Tables
%>%
tab_df count(x, y) %>%
group_by(x) %>% mutate(pct_x = n / sum(n))
# A tibble: 5 × 4
# Groups: x [4]
x y n pct_x
<dbl> <dbl> <int> <dbl>
1 0 0 1 1
2 1 1 1 1
3 2 2 2 0.5
4 2 4 2 0.5
5 3 3 4 1
Creating Two-way Tables
library(scales)
%>%
tab_df count(x, y) %>%
group_by(x) %>% mutate(pct_x = percent(n / sum(n)))
# A tibble: 5 × 4
# Groups: x [4]
x y n pct_x
<dbl> <dbl> <int> <chr>
1 0 0 1 100%
2 1 1 1 100%
3 2 2 2 50%
4 2 4 2 50%
5 3 3 4 100%