- Work with Strings Cheatsheet The stringr package provides an easy to use toolkit for working with strings, i.e. Character data, in R. This cheatsheet guides you.
- Basics Cheat Sheets Specialization Cheat Sheets Specialization Cheat Sheets Table of contents. Big Data Parallel Computing sparklyr Data mining and modeling data.table dplyr forcats sjmisc Import and Tidy up Machine Learning caret estimatr h2o Keras Machine Learning mlr Regressions Survival Analysis NLP quanteda Regex stringr.
There are four main families of functions in stringr:
Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.
Whitespace tools to add, remove, and manipulate whitespace.
Locale sensitive operations whose operations will vary from locale to locale.
Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.
Getting and setting individual characters
You can get the length of the string with
Combining “mutate” and “across” from dplyr and functions from stringr. How to preserve regex using stringr (strsplit) or strsplit with r. Hot Network Questions Can Ice Cream Maker Wall be Too Cold? Animated movie (or series). A robot gives someone a flower. Floating islands.
This is now equivalent to the base R function
nchar(). Previously it was needed to work around issues with
nchar() such as the fact that it returned 2 for
nchar(NA). This has been fixed as of R 3.3.0, so it is no longer so important.
You can access individual character using
str_sub(). It takes three arguments: a character vector, a
start position and an
end position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.
You can also use
str_sub() to modify strings:
To duplicate individual strings, you can use
Three functions add, remove, or modify whitespace:
str_pad()pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.
(You can pad with other characters by using the
str_pad()will never make a string shorter:
So if you want to ensure that all strings are the same length (often useful for print methods), combine
The opposite of
str_trim(), which removes leading and trailing whitespace:
You can use
str_wrap()to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.
A handful of stringr functions are locale-sensitive: they will perform differently in different regions of the world. These functions are case transformation functions:
String ordering and sorting:
The locale always defaults to English to ensure that the default behaviour is identical across systems. Locales always include a two letter ISO-639-1 language code (like “en” for English or “zh” for Chinese), and optionally a ISO-3166 country code (like “en_UK” vs “en_US”). You can see a complete list of available locales by running
The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.
Each pattern matching function has the same first two arguments, a character vector of
strings to process and a single
pattern to match. stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings. I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
str_detect()detects the presence or absence of a pattern and returns a logical vector (similar to
str_subset()returns the elements of a character vector that match a regular expression (similar to
value = TRUE)`.
str_count()counts the number of matches:
str_locate()locates the first position of a pattern and returns a numeric matrix with columns start and end.
str_locate_all()locates all matches, returning a list of numeric matrices. Similar to
str_extract()extracts text corresponding to the first match, returning a character vector.
str_extract_all()extracts all matches and returns a list of character vectors.
str_match()extracts capture groups formed by
()from the first match. It returns a character matrix with one column for the complete match and one column for each group.
str_match_all()extracts capture groups from all matches and returns a list of character matrices. Similar to
str_replace()replaces the first matched pattern and returns a character vector.
str_replace_all()replaces all matches. Similar to
str_split_fixed()splits a string into a fixed number of pieces based on a pattern and returns a character matrix.
str_split()splits a string into a variable number of pieces and returns a list of character vectors.
There are four main engines that stringr can use to describe patterns:
R Regex Cheatsheet
Regular expressions, the default, as shown above, and described in
Fixed bytewise matching, with
Locale-sensitive character matching, with
Text boundary analysis with
fixed(x) only matches the exact sequence of bytes specified by
x. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using
fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent:
They render identically, but because they’re defined differently,
fixed() doesn’t find a match. Instead, you can use
coll(), explained below, to respect human character comparison rules:
Tidyverse Stringr Cheat Sheet
coll(x) looks for a match to
x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a
The downside of
coll() is speed. Because the rules for recognising which characters are the same are complicated,
coll() is relatively slow compared to
fixed(). Note that when both
ignore_case arguments, they perform a much simpler comparison than
Stringr Cheat Sheet Excel
boundary() matches boundaries between characters, lines, sentences or words. It’s most useful with
str_split(), but can be used with all pattern matching functions:
R Dataframe Cheat Sheet
' is treated as