Jul 23, 2015

Strings in Swift 2

Swift provides a performant, Unicode-compliant string implementation as part of its Standard Library. In Swift 2, the String type no longer conforms to the CollectionType protocol, where String was previously a collection of Character values, similar to an array. Now, String provides a characters property that exposes a character collection view.

Why the change? Although it may seem natural to model a string as a collection of characters, the String type behaves quite differently from collection types like Array, Set, or Dictionary. This has always been true, but with the addition of protocol extensions to Swift 2 these differences made it necessary to make several fundamental changes.

Different Than the Sum of Its Parts

When you add an element to a collection, you expect that the collection will contain that element. That is, when you append a value to an array, the array then contains that value. The same applies to a dictionary or a set. However, when you append a combining mark character to a string, the contents of the string itself change.

Consider the string cafe, which has four characters: c, a, f, and e:

var letters: [Character] = ["c", "a", "f", "e"]
var string: String = String(letters)

print(letters.count) // 4
print(string) // cafe
print(string.characters.count) // 4

If you append the combining acute accent character U+0301 ´ the string still has four characters, but the last character is now é:

let acuteAccent: Character = "\u{0301}" // ´ COMBINING ACUTE ACCENT' (U+0301)

string.append(acuteAccent)
print(string.characters.count) // 4
print(string.characters.last!) // é

The string’s characters property does not contain the original lowercase e, nor does it contain the combining acute accent ´ that was just appended. Instead, the string now contains a lowercase “e” with acute accent é:

string.characters.contains("e") // false
string.characters.contains("´") // false
string.characters.contains("é") // true

If we were to treat strings like any other collection, this result would be as surprising as adding UIColor.redColor() and UIColor.greenColor() to a set and the set then reporting that it contains UIColor.yellowColor().

Judged by the Contents of Its Characters

Another difference between strings and collections is the way they determine equality.

Two arrays are equal only if both have the same count, and each pair of elements at corresponding indices are equal.
Two sets are equal only if both have the same count, and each element contained in the first set is also contained in the second.
Two dictionaries are equal only if they have the same set of key, value pairs.

However, String determines equality based on being canonically equivalent. Characters are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode scalars behind the scenes.

Consider the Korean writing system, which consists of 24 letters, or Jamo, representing individual consonants and vowels. When written out these letters are combined into characters for each syllable. For example, the character “가” ([ga]) is composed of the letters “ᄀ” ([g]) and “ᅡ” [a]. In Swift, strings are considered equal regardless of whether they are constructed from decomposed or precomposed character sequences:

let decomposed = "\u{1100}\u{1161}" // ᄀ + ᅡ
let precomposed = "\u{AC00}" // 가

decomposed == precomposed // true

Again, this behavior differs greatly from any of Swift’s collection types. It would be as surprising as an array with values 🐟 and 🍚 being considered equal to 🍣.

Depends on Your Point of View

Strings are not collections. But they do provide views that conform to CollectionType:

characters is a collection of Character values, or extended grapheme clusters.
unicodeScalars is a collection of Unicode scalar values.
utf8 is a collection of UTF–8 code units.
utf16 is a collection of UTF–16 code units.

If we take the previous example of the word “café”, comprised of the decomposed characters [ c, a, f, e ] and [ ´ ], here's what the various string views would consist of:

The characters property segments the text into extended grapheme clusters, which are an approximation of user-perceived characters (in this case: c, a, f, and é). Because a string must iterate through each of its positions within the overall string (each position is called a code point) in order to determine character boundaries, accessing this property is executed in linear O(n) time. When processing strings that contain human-readable text, high-level locale-sensitive Unicode algorithms, such as those used by the localizedStandardCompare(_:) method and the localizedLowercaseString property, should be preferred to character-by-character processing.
The unicodeScalars property exposes the underlying scalar values stored in the string. If the original string were created with the precomposed character é instead of the decomposed e + ´, this would be reflected by the Unicode scalars view. Use this API when you are performing low-level manipulation of character data.
The utf8 and utf16 properties provide code points for the UTF–8 and UTF–16 representations, respectively. These values correspond to the actual bytes written to a file when translated to and from a particular encoding. UTF-8 code units are used by many POSIX string processing APIs, whereas UTF-16 code units are used throughout Cocoa & Cocoa Touch to represent string lengths and offsets.

For more information about working with Strings and Characters in Swift, read The Swift Programming Language and the Swift Standard Library Reference.

All Blog Posts