NSJSONSerialization silently drops U+FEFF from JSON string content — keys merge, characters vanish

NSJSONSerialization silently drops U+FEFF from JSON string content — keys merge, characters vanish

TL;DR: NSJSONSerialization deletes U+FEFF (ZERO WIDTH NO-BREAK SPACE / BOM) from anywhere inside parsed JSON strings — not just a leading document BOM, and even when written as the \uFEFF escape (it's removed after unescaping). Distinct strings/keys silently collapse onto their U+FEFF-less twins. If you're seeing JSON keys mysteriously merge or a character disappear from a parsed value, this is probably why. It is not your code. Workaround and exhaustive scope below.

The workaround

Two options, depending on how attached you are to Foundation:
A. Stay on NSJSONSerialization — swap U+FEFF for a private-use sentinel before parsing, restore after. You must handle both the raw bytes and the \uFEFF escape (the escape bites too, since deletion happens post-unescape):

// 1. Pick a private-use scalar you've verified is absent from the source text.
// 2. Replace every in-content U+FEFF (raw char AND \uFEFF escape) with it.
// 3. Parse. NSJSONSerialization preserves the sentinel.
// 4. Recursively restore the sentinel -> U+FEFF in the parsed tree.
static id RestoreSentinel(id o, NSString *s, NSString *bom) {
    if ([o isKindOfClass:NSString.class])
        return [o rangeOfString:s].location == NSNotFound ? o
             : [o stringByReplacingOccurrencesOfString:s withString:bom];
    if ([o isKindOfClass:NSArray.class]) {
        NSMutableArray *a = [NSMutableArray arrayWithCapacity:[o count]];
        for (id e in o) [a addObject:RestoreSentinel(e, s, bom)];
        return a;
    }
    if ([o isKindOfClass:NSDictionary.class]) {
        NSMutableDictionary *d = [NSMutableDictionary dictionary];
        [o enumerateKeysAndObjectsUsingBlock:^(id k, id v, BOOL *stop) {
            d[RestoreSentinel(k, s, bom)] = RestoreSentinel(v, s, bom);
        }];
        return d;
    }
    return o;
}

Swap the escape form with a backslash-parity-aware regex so \uFEFF (escaped backslash + literal "uFEFF") is left intact:

(?<!\\)((?:\\\\)*)\\u[Ff][Ee][Ff][Ff]   ->   $1<sentinel>

B. Don't use Foundation for this file — a spec-compliant C parser like ++yyjson++ preserves U+FEFF and is faster on large files. (This is the route swift-transformers took for tokenizer.json.)

Minimal repro

// Object keys collapse:
NSData *d1 = [@"{\"\\uFEFF#\":1,\"#\":2}" dataUsingEncoding:NSUTF8StringEncoding];
id o1 = [NSJSONSerialization JSONObjectWithData:d1 options:0 error:nil];
// EXPECTED: 2 keys ("\uFEFF#" and "#");  ACTUAL: 1 key ("#") — \uFEFF stripped, keys merged

// String content lost:
NSData *d2 = [@"[\"\\uFEFF\"]" dataUsingEncoding:NSUTF8StringEncoding];
id o2 = [NSJSONSerialization JSONObjectWithData:d2 options:0 error:nil];
// EXPECTED: ["\uFEFF"] (one code point);  ACTUAL: [""] (empty string)

Same outcome whether U+FEFF arrives as raw EF BB BF bytes or the \uFEFF escape.

Why this is a bug, not a quirk

Per RFC 8259 §7, a JSON string is a sequence of Unicode code points; U+FEFF is ordinary content and doesn't require escaping. Tolerating a leading document BOM is fine — deleting U+FEFF from string content is not. U+FEFF leads a double life (BOM signal vs. ZERO WIDTH NO-BREAK SPACE character); Foundation treats every occurrence as a stray BOM to scrub.

Scope — exhaustive, not anecdotal

I swept all 1,112,064 valid Unicode scalars (U+0000–U+10FFFF minus surrogates) through a parse round-trip, in both the \uFEFF-escape and raw-UTF-8 forms:

  • U+FEFF is the only scalar altered. Every other scalar round-trips byte-identically — including the other zero-widths (U+200B, U+2060, U+00A0), which all survive.
  • No Unicode normalization occurs (NFD stays decomposed, combining sequences and compatibility characters are preserved).

So this is a deliberate BOM-stripping heuristic applied too broadly to string content — narrow and fixable, not general mangling.

Why it's nasty in practice

U+FEFF is zero-width, so the corruption is invisible — no trace in a diff or editor. Real-world hit: ML tokenizer vocabularies (e.g. Google's Gemma) legitimately contain U+FEFF-bearing tokens; loading tokenizer.json via NSJSONSerialization collapses those keys and assigns wrong token IDs, with zero visible symptom until output is subtly wrong.
Filed as FB23271905 — please dupe if this has bitten you. More duplicates is what gets it triaged.

NSJSONSerialization silently drops U+FEFF from JSON string content — keys merge, characters vanish
 
 
Q