Contents
- Introduction
- Creating Strings
- Size of Strings
- Accessing Single Characters
- Trimming a String
- Replacing Characters in Strings
- Splitting Strings
- Finding Substrings
- Accessing Substrings
- Byte Strings and Byte Arrays
Introduction
There are two important data types related to strings in Rust that can cause some nasty debugging if mistaken though the compiler usually points this out clearly.
String
is a string object that is stored as a vector (Vec<u8>
) and can be mutable if declaredmut
&str
(&[u8]
) is a string slice (think of a memory view/pointer or maybe achar *[]
in C) sometimes called string literals and is always immutable
Rust strings are heap allocated and guaranteed to be valid UTF-8 sequences which makes handling them a bit different than strings using the C programming language which stores raw bytes (e.g. ASCII characters) and are null terminated.
Creating Strings
We are mainly interested in String
for all examples below. However, for the sake of completeness this how string slices are created:
let test_str = "test";
With explicit typing it would look like this:
let test_str: &'static str = "test";
let test_str: &str = "test";
There are basically three options to create a string. First, we can apply to_string
or into
to a literal or sequence of letters which we may call string in a casual way (to_owned
seems to do the same thing). Second we could create a string from such a string or literal using String::from
or we use option number three. In that case we would need to create an empty string object (String::new()
) which has to be mutable and push a literal (push
) or string slice (push_str
) to the end of it. Combining these different ways yield a “Hello, World!” that looks like this:
fn main(){
let string_0: String = "Hello, ".to_string();
let string_1: String = String::from("Worl");
let string_2: String = "d".into();
let mut string_3: String = String::new();
string_3.push_str(&string_0);
string_3.push_str(&string_1);
string_3.push_str(&string_2);
string_3.push('!');
println!("{}",string_3);
}
Compiling and running the code above will output
Hello, World!
It is important to understand that a literal (char
) is not the same as &str
even when using a single character. Double quotes are always understood as &str
whereas single quotes are of type char
.
error[E0308]: mismatched types
--> src/main.rs:9:19
|
9 | string_3.push("!");
| ^^^ expected `char`, found `&str`
|
help: if you meant to write a `char` literal, use single quotes
|
9 | string_3.push('!');
|
Another way to concatenate strings would be using the +
operator:
let string_hello: String = String::from("Hello,");
let string_world: String = String::from(" World!");
let string_hw: String = string_hello + &string_world;
The first variable is required to be a String
whereas every thing added to it needs to be a string view. If string_hello
is mutable the +=
operator would work as well:
let mut string_hello: String = String::from("Hello,");
let string_world: String = String::from(" World!");
string_hello += &string_world;
If a String is declared mutable (mut
), then we could insert characters or string views into a string. This however will copy every element in the string and we have not know the byte position which might be a bit tricky to figure out (see Size of Strings. Inserting a leading whitespace would look like this:
// inserting char
string_3.insert(0, ' ');
// inserting string view
let whitespace = " ";
string_3.insert_str(0, whitespace);
There are scenarios in which we need to create a string from a byte array (vector) assuming that the byte array contains ascii or UTF-8 character sequences. A simply example would be the output of a deciphered text. In such a case String::from_utf8(byte_array)
or String::from_utf16(byte_array)
could be of use. However, if it contains invalid characters using String::from_utf8_lossy
or String::from_utf16_lossy
might be a better choices as it will replace invalid characters with a ‘�’. There exists unsafe option to convert byte arrays to a string (String::from_utf8_unchecked
) which is questionable to use at all… .
Size of Strings
This one is tricky as any string is guaranteed to be a valid UTF-8 sequence. Calling string_0.len()
will return the size of the vector but not the number of characters in a string. We often want to know the number of characters of a string and necessarily its size. In such a case we would have to convert it to chars
first and apply the length count afterwards:
let string_0_num_chars = string_0.chars().count();
This approach seems to be more stable when non-ASCII characters are used. However, it does not solve the problem for every character. UTF-8 chars may be composed of multiple bytes if a character is not pre-composed. Using UnicodeSegmentation::graphemes
solves this issue as this is a unicode related issue.
Accessing Single Characters
The chars
method allows accessing single characters as well. To return the “H” of the “ Hello, World!” example above we need to access the nth
character which means that the position needs to be known:
let string_h: String = string_3.chars().nth(1).unwrap().to_string();
println!("{}", string_h);
H
Trimming a String
Trimming a string will result in a string view &str
and therefore no memory copy is performed and the trimmed string view remains immutable. Removing matches is not merged into a stable Rust version yet.
Calling trim()
will remove all leading and trailing whitespace and removes e.g. carriage returns and newlines.
Applying trim like this
let string_4: String = String::from(" Hello, World! \r\n");
let string_4_trimmed: &str = string_4.trim()
println!("{}", string_4.chars().count());
println!("{}", string_4_trimmed.chars().count());
yields
17
13
Let’s assume that we want to remove all leading and trailing null
(0x00
not 0x30
) from a string. In such a case we could used trim_matches
:
let string_5: String = String::from("12000592390\0\0");
let string_5_trimmed: &str = string_5.trim_matches(char::from(0));
println!("{}",string_5);
println!("{}",string_5.chars().count());
println!("{}",string_5_trimmed);
println!("{}",string_5_trimmed.chars().count());
12000592390
13
12000592390
11
Removing characters like 0x00
, 0
and 2
would be done like this:
let string_6: String = String::from("212000592390\0\0");
let chars_to_be_removed_0: &[char] = &[char::from(0), '0', '2'];
let string_6_trimmed: &str = string_6.trim_matches(chars_to_be_removed_0);
println!("{}",string_6);
println!("{}",string_6.chars().count());
println!("{}",string_6_trimmed);
println!("{}",string_6_trimmed.chars().count());
However, this will removed leading and trailing characters only:
212000592390
14
1200059239
10
The order of &['0', '2',char::from(0)]
does not seem to matter:
let string_7: String = String::from("212000592390\0\0");
let chars_to_be_removed_1: &[char] = &['0', '2',char::from(0)];
let string_7_trimmed: &str = string_7.trim_matches(chars_to_be_removed_1);
println!("{}",string_7);
println!("{}",string_7.chars().count());
println!("{}",string_7_trimmed);
println!("{}",string_7_trimmed.chars().count());
212000592390
14
1200059239
10
Replacing Characters in Strings
Replacing characters requires new memory allocation. We can use replace
to do so:
let string_for_repl: String = String::from("212000592390");
let string_repl = string_for_repl.replace(&['0','2'][..],"");
println!("{}", string_repl);
15939
In this cased the characters in question are removed but replacing them with nothing.
Using regular expressions is possible as well. Iterators and match (link) has some nice advantages for dealing with non-UTF-8 compliant things and allows more complex operations with better readability for everyone without in-depth knowledge of regular expressions.
Splitting Strings
Assuming that we have a comma separated string (one line of a csv file). String
provides the method split
which will return an iterator over the initial string that itself returns a &str
when accessing individual entries.
let csv_string: String = String::from("a,23,b,90");
let csv_items = csv_string.split(",");
for entry in csv_items {
println!("{}",entry);
}
a
23
b
90
There exists a method called split_whitespace
which does the same as .split(" ")
.
Finding Substrings
There are a couple of ways to find a substring. Unlike when splitting a string (see above), we are often only interested in the existence of a substring. contains
is just the right method to figure out if a substring exists in a string.
let pangram: String = String::from("The quick brown fox jumps over the lazy dog");
let substr_found: bool = pangram.contains("fox");
println!("{}", substr_found);
The find
method yields the start position of a substring that is searched for:
let start_pos: usize;
let find_res: Option<usize> = pangram.find("fox");
if find_res.is_some() {
start_pos = find_res.unwrap();
} else {
start_pos = pangram.len()+1;
}
println!("{}", start_pos);
16
let start_pos: usize;
let find_res: Option<usize> = pangram.find("foo");
if find_res.is_some() {
start_pos = find_res.unwrap();
} else {
start_pos = pangram.len()+1;
}
44
There exists a reverse find method as well (rfind
) which yields the start position searching from end (right side).
Accessing Substrings
Knowing the start position and length of a substring it can be accessed as every vector in Rust:
let pangram_substr_0: &str = &pangram[start_pos..];
let pangram_substr_1: &str = &pangram[start_pos..start_pos+3];
println!("{}", pangram_substr_0);
println!("{}", pangram_substr_1);
fox jumps over the lazy dog
fox
Byte Strings and Byte Arrays
String
and &str
are valid UTF-8 sequences and that makes perfect sense if we look at a string that represents something we would consider a readable text (displayable/printable). There are other situations in which we may need a sequence of bytes sometimes called byte string and sometimes it is an array of bytes. A common application for this are comms via serial port (RS-232) where command and response codes are often not “human readable” but plain sequences of bytes. Personally, I recommend using byte arrays for this but that might not be always possible depending on what other function/class we’re interacting with.
Byte vectors are of type Vec<u8>
and each element is added accordingly. There is not much to say about this.
A byte arrays or byte strings (both immutable) are a bit tricky to define. They are not of type &str
but of type &[u8]
and does not imply a valid UTF-8 sequence as &str
would.
let byte_string: &[u8] = b"A byte string\x00";
println!("{:?}", byte_string);
Since it is not a valid string only raw values are printed:
[65, 32, 98, 121, 116, 101, 32, 115, 116, 114, 105, 110, 103, 0]