How to `split` without consuming values

Swift’s built-in split function returns an array of subsequences, split from a sequence or collection’s elements. For example, you may have a string and want to split it into words. Here’s a simple approach that consumes whitespace, newline, and punctuation characters:

import Foundation

let string = """
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Praesent in purus a ante semper congue posuere at lacus.
    Nullam faucibus sem vel sem vestibulum, a ullamcorper nunc
    auctor.  
    """

extension Character {
    var firstScalar: UnicodeScalar {
        return self.unicodeScalars.first ??  UnicodeScalar(0)
    }
}

// NOTE: This code demonstrates a simple way to split a collection.
// It's not a great way to actually tokenize a string, lacking both
// localization and in-word punctuation handling such as
// hyphens and single quotes.
let targetSet = CharacterSet.whitespacesAndNewlines
    .union(CharacterSet.punctuationCharacters)
let isWordSeparator: (Character) -> Bool = { c in targetSet.contains(c.firstScalar) }
let words = string.split(whereSeparator: isWordSeparator)

But what happens when you want to split a sequence or collection without consuming values? There’s no built-in Swift solution to turn to.

A few months ago, Soroush Khanlou and I were playing around with this problem, trying to return subsequences around predicate boundaries. When the predicate failed, a new subsequence would begin. We wanted to break sequences  along logical lines, for example, where a value changed or a slope trend updated from increasing to decreasing:

let x = [1, 2, 2, 3, 3, 3, 1]
 .sliced(where: { $0 != $1 })
// [ArraySlice([1]), ArraySlice([2, 2]), ArraySlice([3, 3, 3]), ArraySlice([1])]
let z = [1, 2, 2, 1, 3, 3, 1].sliced(where: >)
// [ArraySlice([1, 2, 2]), ArraySlice([1, 3, 3]), ArraySlice([1])]

We decided to work with arrays and use ArraySlice instances. Each ArraySlice provides a view onto a larger array:

Instead of copying over the elements of a slice to new storage, an `ArraySlice` instance presents a view onto the storage of a larger array. And because `ArraySlice` presents the same interface as `Array`, you can generally perform the same operations on a slice as you could on the original array.

Slices don’t allocate new storage so the splits are efficient. Be careful not to hold onto slices long term. Each slice holds a reference to the array it describes, so you can run into memory management issues if the slice outlives the array it points to.

For safety, reference the start and end of each slice with startIndex and endIndex. These values are often not zero or the array count, the way you’d expect with actual array instances:

let array = ["a", "b", "c", "d", "e"]
let slice = array[1...3]
print(slice.startIndex) // 1
print(slice.endIndex) // 4

Here’s the code I eventually settled on. It extends ArraySlice with a recursive slicing function that ends when it runs out of data or exceeds the maximum partition count. For each slice, it walks the elements until it finds a member element that fails the predicate test, at which point it recurses with a new slice. The ArraySlice code is hidden behind an Array entry point, which wraps the slice functionality away from public view.

Here are some examples of how the code could be used to break apart arrays. They show creating subsequences and run length encoding. You could easily adapt this to more complex data. For example, you might use keypaths to chunk data structures based on time stamps or look for inflection points for rising and falling values.

// Same value
let x = [1, 2, 2, 3, 3, 3, 1]
    .sliced(where: { $0 != $1 })
// [ArraySlice([1]), ArraySlice([2, 2]), ArraySlice([3, 3, 3]), ArraySlice([1])]

// Run-length encoding
let xx = [1, 2, 2, 3, 3, 3, 3, 3, 1, 1, 2, 1, 1]
  .sliced(where: !=)
  .map { ($0.count, $0.first!) }
// [(1, 1), (2, 2), (5, 3), (2, 1), (1, 2), (2, 1)]

// Strings
let y = Array("aaaabbbbcccdef")
  .sliced(where: !=)
  .map({ String($0) })
// ["aaaa", "bbbb", "ccc", "d", "e", "f"]

// Increasing subsequences
let z = [1, 2, 2, 1, 3, 3, 1].sliced(where: >)
// [ArraySlice([1, 2, 2]), ArraySlice([1, 3, 3]), ArraySlice([1])]

// Decreasing subsequences
let w = [1, 2, 2, 1, 3, 3, 1].sliced(where: <)
// [ArraySlice([1]), ArraySlice([2, 2, 1]), ArraySlice([3, 3, 1])]

Got suggestions for improvements? Drop a note in the comments. Thanks!

One Comment

  • Isn’t the predicate for this function usually positive? (i.e., the inverse of this one) Like in itertools or Data.List.

    I think you can make it lazy, also, and generic over collections:

    extension Collection {
      func group(_ eq: @escaping (Element, Element) -> Bool) -> UnfoldSequence {
        return sequence(state: startIndex) { i in
          guard i < self.endIndex else { return nil }
          let s = i
          var x: Self.Element
          repeat {
            x = self[i]
            self.formIndex(after: &i)
          } while i < self.endIndex && eq(x,self[i])
          return self[s..<i]
        }
      }
    }