Get substring from String / string slicing?

Question

Created Sep ’15

Replies 6

Boosts 0

Views 2.1k

Participants 4

What is the recommended/best/obvious way to get a substring (a range of characters) from a String?

let str = "abcdefghi"
// Say I'd like to get the substring str.characters[3 ..< 6], ie "def".
let substr = ???
print(substr) // "def"

These seem a bit long winded:

let substrMethod1 = String(str.characters.dropFirst(3).prefix(3))
let substrMethod2 = str.substringWithRange(Range(start: str.startIndex.advancedBy(3), end: str.startIndex.advancedBy(6)))

Boost

Answer 1

OOPer OP

Sep ’15

I would write it as:

let substr = str[str.startIndex.advancedBy(3)..<str.startIndex.advancedBy(6)]

Or this may be a little more efficient:

let s = str.startIndex.advancedBy(3)
let e = s.advancedBy(3)
let substr = str[s..<e]

0

Answer 2

Jens OP

Sep ’15

Assuming there is no less cumbersome way to do this very common task, I feel the need to write something like this:

extension String {
    // A subscript that adds Pythonesque string slicing to String.
    // (Only with a comma instead of a colon, and nil instead of nothing.)
    subscript(start: Int?, end: Int?) -> String {
        let ssi = start ?? 0
        let sei = end ?? startIndex.distanceTo(endIndex)
        let si = (ssi >= 0 ? startIndex : endIndex).advancedBy(ssi)
        let ei = (sei >= 0 ? startIndex : endIndex).advancedBy(sei)
        return self[si ..< ei]
    }
}

let str = "àβ©Δ∑⨍₲hℹ️"
let huh = str[str.startIndex.advancedBy(3) ..< str.startIndex.advancedBy(6)]
let ahh = str[3, 6]
print(huh) // Δ∑⨍
print(ahh) // Δ∑⨍

Why should the standard library not include something simpler like this?

The things that can go wrong when using this extension can also go wrong when working directly with startIndex.advancedBy(…) etc., so I can't really see the reason for letting something as simple as getting a substring from a string be so complicated.

Or is it to remind people about the complexities of working with strings (various encodings, the different views of strings and how to reason about their components)?

0

Answer 3

OOPer OP

Sep ’15

Why would the standard library not include something simpler like this?

I also want to hear the official statement from the Starndard Library team. In fact, I have already have one very similar to yours in my Utils folder.

Or is it to remind people about the complexities of working with strings (the various views of strings)?

Just my guess, one reason may be `the complexity`. It seems all subscripts included in the Standad Library having O(1) complexity, and advanceBy() needs O(n). Even if that was true, the Standard Library team could give us some more convenient methods.

For me, the Standard Library team looks playing a `theoretically-bettter` game, and never written a practical string manipulation apps in Swift.

0

Answer 4

Jens OP

Sep ’15

Ah, yes, you are probably right, and I like the fact that complexity is written out and dealt with like that.

Of course this practical-substring-thing wouldn't have to be implemented as a subscript, there are probably lots of ways to include it without breaking any rules or conventions of the standard library.

0

Answer 5

QuinceyMorris OP

Sep ’15

>> Why would the standard library not include something simpler like this?

Of course, I don't know, but I have a possible explanation that I replay to myself whenever I start cursing at String's awkward syntax.

The explanation is historical. The class NSString is jam-packed with useful functionality, but it's always been weird that the cananical underlying data representation is UTF-16. UTF-16 is fine, but there's plenty of stuff stored as UTF-8, and even now plenty of sources of pure 8-bit character arrays. This means that there are are lots of actual data conversions to and from UTF-16. (At the time NSString adopted UTF-16, around 1990, the text world was quite different. At that time, it looked like 16-bit units for text would obviously "win" over 8-bit units — "wchar clobbers char, photos at 11!". UTF-16 seemed like the forward-looking option, and UTF-8 looked like a compatibibility footnote. Obviously, that's not what happened.)

In practice, concrete subclasses of NSString may actually store data in forms other than UTF-16, which is often great for efficiency, but is hampered by the need to present UTF-16 semantics publicly.

If you think about Swift's String class, you'll realize that it goes to extreme lengths to avoid being specific about underlying data representation. That design avoids the NSString problem of producing "surprising" performance bottlenecks when we write loops with simple-looking code that involves unwanted data conversion.

The abstraction also helps it to define characters in terms of grapheme clusters (which are variable length information units) instead of code units like UTF-16 or UTF-8, thus bridging the remaining difficulties in using Unicode to represent text globally and getting it right.

I don't see any simple syntax representation of String subscripting that doesn't promise to turn into a horrible performance bottleneck in some piece of code or other, or with some particular string's underlying representation oe other. There's no readily-accessible directing-indexing notation that works well without advance knowledge or assumptions about the string implementation.

Even the related classes that conceptualize strings as UTF-8, 16 or 32 code unit arrays are arranged so that they don't assume a particular representation. (I think that's why these classes have changed so often. The original APIs had hidden biases about the relationship of the client's view of the data, as if wanting to access string data as UTF-8 implied the likelihood of its originating as UTF-8, which is obviously not true.)

FWIW.

0

Answer 6

Wallacy OP

Sep ’15

Just to play a little:

extension String {
    subscript(range: Range<Int>) -> String {
        let s = startIndex.advancedBy(range.startIndex);
        let e = s.advancedBy(range.count);
        return self[s ..< e]
    }
}

But like OOPer says, just hide the problem, still a expensive call.

0