CharacterSet for ASCII chars >= 0x80

I need to scan some text files and separate the bytes into those with the sign bit set and those without. I'd like to use a scanner to do this. How would I define a character set for this scanner?


(The reason I need to do this is that the metadata for the text is encoded into bytes with the sign bit set. These files date back to the 80's.)

So, you did not find Odyssey in html… 😁


I do not see what the problem is exactly.


Does the scanner return an array of bytes ?

If so, you can filter it on their value ?

See

https://stackoverflow.com/questions/40250670/swift-string-character-to-ascii-int-value


I must miss somethging.

You are right! The legacy code is difficult to work with, but is much more complete.

`Scanner` works on UTF-16 code unit sequence, not on byte sequence.


One possible workaround may be using String.Encoding.isoLatin1 to read the text.

It maps 0x00...0xFF to U+0000...U+00FF respectively.


do {
    let textUrl = URL(/*...*/)
    let text = try String(contentsOf: textUrl, encoding: .isoLatin1)
    let scanner = Scanner(string: text)
    let asciiSet = CharacterSet(charactersIn: "\u{0}"..."\u{7F}")
    let nonAsciiSet = asciiSet.inverted
    //...
} catch {
    print(error)
}

I managed to do it with this code:


public func metadataLines(from content: String) -> [(String, String)] {
  var lines: Array<(String,String)> = []
  let reader = Scanner(string: content)
  var textData = "", metadata = ""
  let sevenBit = CharacterSet(charactersIn: UnicodeScalar(0) ..< UnicodeScalar(0x80))
  while reader.isAtEnd == false {
  metadata = reader.scanUpToCharactersFrom(sevenBit) ?? ""
  textData = reader.scanCharactersFrom(sevenBit) ?? ""
  lines.append((metadata, textData))
  }
  return lines
}


Line 5 is where I construct the character set I need.


Now I'm trying to find a way to "unset" the sign bit of the metadata bytes, in order to recover the characters. I'll post a solution to that problem if I find one.

You'll almost certainly get into dire problems if you try to do this starting from a String. That especially true if the file contains single-byte characters (with or without the sign bit).


Instead, you should read the file in as Data (that is, bytes), process the sign bit yourself, then convert the result to a String as necessary.


Note also that ASCII characters are 0 ... 0x7F. Values 0x80 and above are not ASCII. You can assume that 0 ... 0x7F are "embedded" in Unicode as their ASCII values, but you cannot assume this for 0x80 ... .

Yes, I just realized I'll get different bytes depending on which encoding I use to read the file. That would definitely lead to problems! Thanks very much for your comment.

Accepted Answer

After reading the advice here, I came up with some code. I really like scanners, so I made something similar for Data objects. I process my legacy data starting on line 31.


struct DataScanner {
  var index: Int
  let data: Data
  var isAtEnd: Bool {
    return index == data.count
  }
  init(data: Data) {
    self.data = data
    index = 0
  }

  mutating func scanByesInRange(_ range: Range) -> Array {
    var bytes: Array = []
    while index < data.count && range.contains(data[index]) {
     bytes.append(data[index])
     index += 1
    }
    return bytes
  }

  mutating func scanUpToByesInRange(_ range: Range) -> Array {
    var bytes: Array = []
    while index < data.count && range.contains(data[index]) == false {
      bytes.append(data[index])
      index += 1
    }
    return bytes
  }
}

func linesFrom(legacyData: Data) -> Array<(Data, String)> {
  var array: Array<(Data, String)> = []
  var reader = DataScanner(data: legacyData)
  let asciiRange = UInt8(0x0) ..< UInt8(0x80)
  while reader.isAtEnd == false {
    let metadata = reader.scanUpToByesInRange(asciiRange)
    let textData = reader.scanByesInRange(asciiRange)
    array.append((
      Data(metadata),
      textData.reduce("") {(s, b) -> String in
        s + (String(UnicodeScalar(b)))
      }))
  }
  return array
}


This gives me an array of tuples, each of which pairs some metadata with a text string. So far it works well on the legacy files I have. But if I've made any more blunders, please let me know!


One thing I'm curious about: I can instantiate a Scanner as a let constant. But when I do this with my data scanner, I get a "Cannot use mutating member on immutable value" error when I try to use it. I know it's a minor point, but is there some way of defining DataScanner so that I can instatiate it with let?

I can instantiate a Scanner as a let constant. But when I do this with my data scanner, I get a "Cannot use mutating member on immutable value" error when I try to use it. I know it's a minor point, but is there some way of defining DataScanner so that I can instatiate it with let?

Scanner is just a Swift name of Objective-C class `NSScanner`, it's a reference type.


You define your DataScanner as a struct, it's a value type.


If you want to use `let` even when you have some mutating operations, you need to define it as a class.


But class or struct problem should be considered more carefully.

CharacterSet for ASCII chars &gt;= 0x80
 
 
Q