Incorrect results from `isSuperset` when used with specific emoji symbols and prefix.

Hi,

It seams that the public func isSuperset(of other: CharacterSet) -> Bool API gives inconsistent results for some emoji symbols when uses with and without prefix text. Here is a playground example:

import Foundation

let input1 = "🥀"
let input11 = "a🥀"

let input2 = "😀"
let input22 = "a😀"

let letters = CharacterSet.letters

print("'\(input1)' is part of 'CharacterSet.letters': \(letters.isSuperset(of: CharacterSet(charactersIn: input1)))") // Gives false
print("'\(input11)' is part of 'CharacterSet.letters': \(letters.isSuperset(of: CharacterSet(charactersIn: input11)))") // INCORRECT: Should give false, but it gives true

print("'\(input2)' is part of 'CharacterSet.letters': \(letters.isSuperset(of: CharacterSet(charactersIn: input2)))") // Gives false
print("'\(input22)' is part of 'CharacterSet.letters': \(letters.isSuperset(of: CharacterSet(charactersIn: input22)))") // Gives false

Output:

'🥀' is part of 'CharacterSet.letters': false
'a🥀' is part of 'CharacterSet.letters': true
'😀' is part of 'CharacterSet.letters': false
'a😀' is part of 'CharacterSet.letters': false

Has anyone observed this?

The behavior you're observing is due to the way Swift handles characters in a string when checking if they are members of a CharacterSet. When you create a CharacterSet using CharacterSet(charactersIn: input), it considers the individual characters in the input string, not the entire string as a single unit.

In your example:

let input11 = "a🥀"

When you create a CharacterSet from input11, it includes the characters "a" and "🥀". When you check if this character set is a subset of CharacterSet.letters, it evaluates to true because "a" is indeed a letter.

In contrast, when you directly check if CharacterSet.letters is a superset of the character set created from "a🥀", it correctly evaluates to false because "a🥀" is not entirely composed of letters.

This behavior is consistent with how Swift handles characters and character sets. If you want to check if all characters in a string are letters, you can iterate over the string and check each character individually. For example:

let input11 = "a🥀"
let allLetters = input11.allSatisfy { $0.isLetter }
print("'\(input11)' contains only letters: \(allLetters)")

This will correctly output false for "a🥀" because it checks each character individually.

I think you’re seeing a bug in initializing a CharacterSet with a multi-character string containing characters outside the Unicode Basic Multilingual Plane (BMP). In this case the emoji lives in the Supplementary Multilingual Plane with a 17-bit encoding, but the character set seems to lose the first bit. I get this in Swift REPL:

  1> import Foundation
  2> print(CharacterSet(charactersIn: "a"))
<CFCharacterSet Items(U+0061)>
  3> print(CharacterSet(charactersIn: "ab"))
<CFCharacterSet Items(U+0061 U+0062)>
  4> print(CharacterSet(charactersIn: "a\u{1F940}"))
<CFCharacterSet Items(U+0061 U+F940)>

If we assume the printed description is accurate, then the last line is clearly wrong: the string contains U+1F940 WILTED FLOWER but the resulting character set contains U+F940 CJK COMPATIBILITY IDEOGRAPH-F940 (a Chinese character). Given this broken character set, your diagnostic tests produce expected results.

Oddly, this doesn’t happen if the string contains exactly one character. The description is formatted differently but appears to be correct, as 129344 == 0x1F940:

  5> print(CharacterSet(charactersIn: "\u{1F940}")) 
<CFCharacterSet Range(129344, 1)>

All this may relate to the known issue that NSString works differently from Swift String for characters outside the BMP because... reasons. Read the documentation around “extended grapheme clusters” if you want to dive into it.

trying to understand there is a different results when checking the two strings "a🥀" and "a😀"

And those two emoji would produce different results in your letters test because:

  • U+1F940 WILTED FLOWER gets corrupted to U+F940 CJK COMPATIBILITY IDEOGRAPH-F940 which is actually categorized as a letter in Unicode:
  1> print("\u{F940}".unicodeScalars.first!.properties.generalCategory)
otherLetter
  • U+1F600 GRINNING FACE gets corrupted to U+F600 which is not a valid Unicode character (it’s in a “private use area”) so it’s not a letter:
  2> print("\u{F600}".unicodeScalars.first!.properties.generalCategory) 
privateUse

CharacterSet is badly named )-: It’s a Swift wrapper around NSCharacterSet, which is more like a ‘UTF-16 code point set’. That makes sense in the Objective-C world, because the elements of an NSString are UTF-16 code points. OTOH, the elements in a Swift String are extended grapheme clusters.

In retrospect, I think we should have imported NSCharacterSet into Swift as NSCharacterSet, to make it clear that it doesn’t have the semantics you’d expect from a Swift perspective.

IMO it’s best to reserve CharacterSet for situations that play to its strengths.


And to be clear, CharacterSet doesn’t produce great results even if you stick to the BMP. Consider this snippet:

let letters = CharacterSet.letters
let s1 = "i\u{0308}"
let s2 = "5\u{0308}"
for c in s1.unicodeScalars {
    print(c, letters.contains(c))
}
print("--")
for c in s2.unicodeScalars {
    print(c, letters.contains(c))
}

It prints:

i true
̈ true
--
5 false
̈ true

So the U+0308 COMBINING DIAERESIS is a ‘letter’ even when it’s used in conjunction with a digit.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Arguably it’s even more confusing: not even considering combining sequences, NSString is indeed basically “UTF-16 code unit string” and doesn’t process surrogate pairs (for code points outside the BMP) for operations such as length. But then NSCharacterSet does actually support surrogate pairs and all the supplementary planes, aside from the bug that @Amiorkov encountered here.

BTW, you can actually see the bug in Apple open source in CFCharacterSet.c. In the specific scenario of creating a character set of both BMP and non-BMP characters, you end up in CFCharacterSetAddCharactersInRange() which assumes the range is in BMP and silently truncates the range’s location (starting code point) to 16 bits while appending to the existing string of BMP code points.

Incorrect results from `isSuperset` when used with specific emoji symbols and prefix.
 
 
Q