5. Text Processing

5.1. Strings or Character Vectors

Forming a character vector:

> 'abc'
[1] "abc"
> "abc"
[1] "abc"

A vector of character vectors:

> c("abc", "xyz")
[1] "abc" "xyz"

Pasting

paste takes multiple character vectors as input, combines them element by element using a separator, and collapses the resulting character vector if required.

Combining two character vectors:

> paste(c('a', 'b', 'c'), c('x', 'y', 'z'))
[1] "a x" "b y" "c z"

The default separator is blank space. Changing the separator:

> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='')
[1] "ax" "by" "cz"
> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='-')
[1] "a-x" "b-y" "c-z"
>

By default the result is another character vector. It is possible to collapse the result into a single string:

> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='', collapse=' ')
[1] "ax by cz"
> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='', collapse='')
[1] "axbycz"
> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='', collapse='o')
[1] "axobyocz"
> paste(c('a', 'b', 'c'), c('x', 'y', 'z'), sep='-', collapse='o')
[1] "a-xob-yoc-z"

The smaller vector is recycled if necessary:

> paste(c("a", "b", "c"), 1:8)
[1] "a 1" "b 2" "c 3" "a 4" "b 5" "c 6" "a 7" "b 8"
> paste(c("a", "b", "c"), 1:8, sep="")
[1] "a1" "b2" "c3" "a4" "b5" "c6" "a7" "b8"

Two strings can be combined easily:

> paste('hello', 'world')
[1] "hello world"
> paste('hello', 'world', sep='')
[1] "helloworld"
> paste('hello', 'world', sep='-')
[1] "hello-world"
> paste('hello', 'world', collapse='')
[1] "hello world"

It automatically converts any type into strings:

> paste(1:4, 5:8)
[1] "1 5" "2 6" "3 7" "4 8"

Can work with more than two vectors too:

> paste(1:5, 11:15, 21:25, 31:35, 41:45)
[1] "1 11 21 31 41" "2 12 22 32 42" "3 13 23 33 43" "4 14 24 34 44" "5 15 25 35 45"
> paste(1:5, 11:15, 21:25, 31:35, 41:45, sep='')
[1] "111213141" "212223242" "313233343" "414243444" "515253545"

This approach can be used for combining multiple strings into one:

> paste("a", "b", "c", sep="")
[1] "abc"

It is also possible to collapse a character vector into a single string easily:

> paste(c('a', 'b', 'c'))
[1] "a" "b" "c"
> paste(c('a', 'b', 'c'), collapse='')
[1] "abc"
> paste(c("abc", "xyz"), collapse="")
[1] "abcxyz"

Splitting strings

The function strsplit is the workhorse for splitting strings. It takes a character vector (a vector of strings) and returns a list of character vectors. Each input string corresponds to one output character vector in the output list.

Simple examples:

> strsplit('hello world', split=' ')
[[1]]
[1] "hello" "world"

> strsplit('hello world again', split=' ')
[[1]]
[1] "hello" "world" "again"

> strsplit('hello-world-again', split='-')
[[1]]
[1] "hello" "world" "again"

> strsplit('hello-world-again', split=' ')
[[1]]
[1] "hello-world-again"

Multiple character strings in input:

> strsplit(c('hello world', 'amazing world'), split=' ')
[[1]]
[1] "hello" "world"

[[2]]
[1] "amazing" "world"

It is possible to create a unified character vector at output:

> unlist(strsplit(c('hello world', 'amazing world'), split=' '))
[1] "hello"   "world"   "amazing" "world"
> unlist(strsplit(rep('c a', 10), split=' '))
 [1] "c" "a" "c" "a" "c" "a" "c" "a" "c" "a" "c" "a" "c" "a" "c" "a"
[17] "c" "a" "c" "a"

The split argument supports regular expressions:

> unlist(strsplit('hello world-again', split='[ -]'))
[1] "hello" "world" "again"

Splitting a name:

> unlist(strsplit("Ryan, Mr. Edward" , split='[, .]'))
[1] "Ryan"   ""       "Mr"     ""       "Edward"

Removing the blank strings:

> parts <- unlist(strsplit("Ryan, Mr. Edward" , split='[, .]'))
> parts
[1] "Ryan"   ""       "Mr"     ""       "Edward"
> ?filter
> parts[parts != ""]
[1] "Ryan"   "Mr"     "Edward"

White Space

> trimws(' hello')
[1] "hello"
> trimws(' hello ')
[1] "hello"
> trimws('hello  ')
[1] "hello"
> trimws(' hello ', which='left')
[1] "hello "
> trimws(' hello ', which='right')
[1] " hello"
> trimws(' hello ', which='both')
[1] "hello"

5.2. Pattern Matching and Replacement

sub replaces the first match of a pattern with replacement string.

Trimming whitespace at the beginning:

> sub(' ', '', ' hello')
[1] "hello"
> sub(' ', '', ' hello ')
[1] "hello "