Strings in Rust

Contents

Introduction
Creating Strings
Size of Strings
Accessing Single Characters
Trimming a String
Replacing Characters in Strings
Splitting Strings
Finding Substrings
Accessing Substrings
Byte Strings and Byte Arrays

Introduction

There are two important data types related to strings in Rust that can cause some nasty debugging if mistaken though the compiler usually points this out clearly.

String is a string object that is stored as a vector (Vec<u8>) and can be mutable if declared mut
&str (&[u8]) is a string slice (think of a memory view/pointer or maybe a char *[] in C) sometimes called string literals and is always immutable

Rust strings are heap allocated and guaranteed to be valid UTF-8 sequences which makes handling them a bit different than strings using the C programming language which stores raw bytes (e.g. ASCII characters) and are null terminated.

Creating Strings

We are mainly interested in String for all examples below. However, for the sake of completeness this how string slices are created:

let test_str = "test";

With explicit typing it would look like this:

let test_str: &'static str = "test";
let test_str: &str = "test";

There are basically three options to create a string. First, we can apply to_string or into to a literal or sequence of letters which we may call string in a casual way (to_owned seems to do the same thing). Second we could create a string from such a string or literal using String::from or we use option number three. In that case we would need to create an empty string object (String::new()) which has to be mutable and push a literal (push) or string slice (push_str) to the end of it. Combining these different ways yield a “Hello, World!” that looks like this:

fn main(){
    let string_0: String = "Hello, ".to_string();
    let string_1: String = String::from("Worl");
    let string_2: String = "d".into();
    let mut string_3: String = String::new();
    string_3.push_str(&string_0);
    string_3.push_str(&string_1);
    string_3.push_str(&string_2);
    string_3.push('!');

    println!("{}",string_3);
}

Compiling and running the code above will output

Hello, World!

It is important to understand that a literal (char) is not the same as &str even when using a single character. Double quotes are always understood as &str whereas single quotes are of type char.

error[E0308]: mismatched types
 --> src/main.rs:9:19
  |
9 |     string_3.push("!");
  |                   ^^^ expected `char`, found `&str`
  |
help: if you meant to write a `char` literal, use single quotes
  |
9 |     string_3.push('!');
  |     

Another way to concatenate strings would be using the + operator:

let string_hello: String = String::from("Hello,");
let string_world: String = String::from(" World!");
let string_hw: String = string_hello + &string_world;

The first variable is required to be a String whereas every thing added to it needs to be a string view. If string_hello is mutable the += operator would work as well:

let mut string_hello: String = String::from("Hello,");
let string_world: String = String::from(" World!");
string_hello += &string_world;

If a String is declared mutable (mut), then we could insert characters or string views into a string. This however will copy every element in the string and we have not know the byte position which might be a bit tricky to figure out (see Size of Strings. Inserting a leading whitespace would look like this:

// inserting char
string_3.insert(0, ' ');

// inserting string view
let whitespace = " ";
string_3.insert_str(0, whitespace);

There are scenarios in which we need to create a string from a byte array (vector) assuming that the byte array contains ascii or UTF-8 character sequences. A simply example would be the output of a deciphered text. In such a case String::from_utf8(byte_array) or String::from_utf16(byte_array) could be of use. However, if it contains invalid characters using String::from_utf8_lossy or String::from_utf16_lossy might be a better choices as it will replace invalid characters with a ‘�’. There exists unsafe option to convert byte arrays to a string (String::from_utf8_unchecked) which is questionable to use at all… .

Size of Strings

This one is tricky as any string is guaranteed to be a valid UTF-8 sequence. Calling string_0.len() will return the size of the vector but not the number of characters in a string. We often want to know the number of characters of a string and necessarily its size. In such a case we would have to convert it to chars first and apply the length count afterwards:

let string_0_num_chars = string_0.chars().count();

This approach seems to be more stable when non-ASCII characters are used. However, it does not solve the problem for every character. UTF-8 chars may be composed of multiple bytes if a character is not pre-composed. Using UnicodeSegmentation::graphemes solves this issue as this is a unicode related issue.

Accessing Single Characters

The chars method allows accessing single characters as well. To return the “H” of the “ Hello, World!” example above we need to access the nth character which means that the position needs to be known:

let string_h: String = string_3.chars().nth(1).unwrap().to_string();
println!("{}", string_h);

Trimming a String

Trimming a string will result in a string view &str and therefore no memory copy is performed and the trimmed string view remains immutable. Removing matches is not merged into a stable Rust version yet.

Calling trim() will remove all leading and trailing whitespace and removes e.g. carriage returns and newlines.
Applying trim like this

let string_4: String = String::from(" Hello, World!  \r\n");
let string_4_trimmed: &str = string_4.trim()

println!("{}", string_4.chars().count());
println!("{}", string_4_trimmed.chars().count());

yields

17
13

Let’s assume that we want to remove all leading and trailing null (0x00 not 0x30) from a string. In such a case we could used trim_matches:

let string_5: String = String::from("12000592390\0\0");
let string_5_trimmed: &str = string_5.trim_matches(char::from(0));

println!("{}",string_5);
println!("{}",string_5.chars().count());
println!("{}",string_5_trimmed);
println!("{}",string_5_trimmed.chars().count());

12000592390
13
12000592390
11

Removing characters like 0x00, 0 and 2 would be done like this:

let string_6: String = String::from("212000592390\0\0");
let chars_to_be_removed_0: &[char] = &[char::from(0), '0', '2'];
let string_6_trimmed: &str = string_6.trim_matches(chars_to_be_removed_0);

println!("{}",string_6);
println!("{}",string_6.chars().count());
println!("{}",string_6_trimmed);
println!("{}",string_6_trimmed.chars().count());

However, this will removed leading and trailing characters only:

212000592390
14
1200059239
10

The order of &['0', '2',char::from(0)] does not seem to matter:

let string_7: String = String::from("212000592390\0\0");
let chars_to_be_removed_1: &[char] = &['0', '2',char::from(0)];
let string_7_trimmed: &str = string_7.trim_matches(chars_to_be_removed_1);

println!("{}",string_7);
println!("{}",string_7.chars().count());
println!("{}",string_7_trimmed);
println!("{}",string_7_trimmed.chars().count());

212000592390
14
1200059239
10

Replacing Characters in Strings

Replacing characters requires new memory allocation. We can use replace to do so:

let string_for_repl: String = String::from("212000592390");
let string_repl = string_for_repl.replace(&['0','2'][..],"");
println!("{}", string_repl);

In this cased the characters in question are removed but replacing them with nothing.

Using regular expressions is possible as well. Iterators and match (link) has some nice advantages for dealing with non-UTF-8 compliant things and allows more complex operations with better readability for everyone without in-depth knowledge of regular expressions.

Splitting Strings

Assuming that we have a comma separated string (one line of a csv file). String provides the method split which will return an iterator over the initial string that itself returns a &str when accessing individual entries.

let csv_string: String = String::from("a,23,b,90");
    let csv_items = csv_string.split(",");
    for entry in csv_items {
        println!("{}",entry);
    }

a
23
b
90

There exists a method called split_whitespace which does the same as .split(" ").

Finding Substrings

There are a couple of ways to find a substring. Unlike when splitting a string (see above), we are often only interested in the existence of a substring. contains is just the right method to figure out if a substring exists in a string.

let pangram: String = String::from("The quick brown fox jumps over the lazy dog");
let substr_found: bool = pangram.contains("fox");
println!("{}", substr_found);

The find method yields the start position of a substring that is searched for:

let start_pos: usize;
let find_res: Option<usize> =  pangram.find("fox");
if find_res.is_some() {
    start_pos = find_res.unwrap();
} else {
    start_pos = pangram.len()+1;
}
    println!("{}", start_pos);

let start_pos: usize;
let find_res: Option<usize> =  pangram.find("foo");
if find_res.is_some() {
    start_pos = find_res.unwrap();
} else {
    start_pos = pangram.len()+1;
}

There exists a reverse find method as well (rfind) which yields the start position searching from end (right side).

Accessing Substrings

Knowing the start position and length of a substring it can be accessed as every vector in Rust:

let pangram_substr_0: &str = &pangram[start_pos..];
let pangram_substr_1: &str = &pangram[start_pos..start_pos+3];
println!("{}", pangram_substr_0);
println!("{}", pangram_substr_1);

fox jumps over the lazy dog
fox

Byte Strings and Byte Arrays

String and &str are valid UTF-8 sequences and that makes perfect sense if we look at a string that represents something we would consider a readable text (displayable/printable). There are other situations in which we may need a sequence of bytes sometimes called byte string and sometimes it is an array of bytes. A common application for this are comms via serial port (RS-232) where command and response codes are often not “human readable” but plain sequences of bytes. Personally, I recommend using byte arrays for this but that might not be always possible depending on what other function/class we’re interacting with.

Byte vectors are of type Vec<u8> and each element is added accordingly. There is not much to say about this.
A byte arrays or byte strings (both immutable) are a bit tricky to define. They are not of type &str but of type &[u8] and does not imply a valid UTF-8 sequence as &str would.

let byte_string: &[u8] = b"A byte string\x00";
println!("{:?}", byte_string);

Since it is not a valid string only raw values are printed:

[65, 32, 98, 121, 116, 101, 32, 115, 116, 114, 105, 110, 103, 0]