How do I convert a UTF-8 code into its string representation?

Question

Created Oct ’17

Replies 6

Boosts 0

Views 2.1k

Participants 3

How do I convert a UTF-8 code to its string representation?

The following code prints the UTF-8 code of each character in string. I would like to instead print the string representation of char.

        for char in string.utf8 {
            print(char)

        }

Boost

Answer 1

Claude31 OP

Oct ’17

You can use

    let c = UnicodeScalar(char)

0

Answer 2

QuinceyMorris OP

Oct ’17

Accepted Answer

Whoa, no, no, no!

The original code loops over elements of the string's utf8 view, which are byte-sized code units. These are not individual characters, and they don't have a representation, except as binary values. In some cases, such as ASCII characters, the UTF8 representation is a single code unit (e.g. 0x41 for "A"), but any byte with its high order bit set is part of a longer sequence and meaningless in itself.

Note that the term "character" doesn't have a well-defined meaning in the world in general. Instead, we have the following terms:

— Code unit. This is a single value of one of several sizes (8-bits for UTF-8, 16-bits for UTF-16, etc), which may or may not be meaningful in isolation. These are what the Swift "Utf8View" and "Utf16View" types represent.

— Code point. This is a 21-bit value representing one of the "characters" in the Unicode standard. In the current Unicode standard, this often is, but may not be, meaningful in isolation. These are what the Swift "UnicodeScalars" type represents.

— Grapheme cluster. This is a a sequence of one or more code points that is regarded as a meaningful unit of writing in its language or script. This is what the Swift "Character" type represents, and probably what we mean by "character" in general use.

So, Claude, your suggestion works for ASCII codes, but for other code units it basically converts a random number into a Unicode code point — which is a "character", just not one that has any relationship to the UTF8 value.

0

Answer 3

Claude31 OP

Oct ’17

Oops, I effectively assumed that string was ascii. Not very international at least.

So we could just explore the string with myString.unicodeScalars

let myString = "   Hello   "
for char in myString.unicodeScalars {
     print(char)
     if !char.isASCII {
          print("char is not Ascii")

     }
}

0

Answer 4

QuinceyMorris OP

Oct ’17

Or just iterate through the string's "characters", which would produce the same result for ASCII. It's not clear why the OP is deliberately starting from the UTF8 representation, which is why it's not clear what the correct approach is.

0

Answer 5

ShinehahGnolaum OP

Oct ’17

If I stick with my original code using string.utf8, can I still use UnicodeScalar(char), or do I need to change that?

0

Answer 6

Claude31 OP

Oct ’17

You can, but for non ascii it is meaningless.

Test with this string, with emojis

let myString = " 🙂 Hello 😁 "

0