Fun with Unicode Names

I don’t use Unicode all that often but I tend to use the character picker copypasta or hex codes when I do:

var ghoti = "🐟" // from character picker
print(ghoti) // 🐟
ghoti = String(UnicodeScalar(0x1F41F)!)
print(ghoti) // 🐟

Cocoa also supports loading unicode characters by name using the \N{UNICODE CHARACTER NAME} escape sequence. You can use patterns to construct unicode characters as in the following example:

// Make sure to escape the backslash with a second
// backslash to allow proper string construction
        reverse: true) // 🐟
let constructed = "I want to eat a \\N{FISH} sandwich"
        reverse: true)
print(constructed!) // "I want to eat a 🐟 sandwich"

This is a reverse transform, in that it converts from escaped names to the symbol it represents. The forward transform takes a string and inserts name escape sequences in place of unicode characters:

let transformed = "🐶🐮💩"
    reverse: false)

Unicode escapes are also usable in Cocoa regex matching. This example searches for the little blue fish in a string, printing out the results from that point:

let fishPattern = "\\N{FISH}"
let regex = try! NSRegularExpression(pattern: fishPattern, options: [])

let string = "I wish I had a 🐟 to eat"

// You have to use Cocoa-style ranges. Ugh.
let range = NSRange(location: 0, length: string.characters.count)

// There's a fair degree of turbulence between 
// the Cocoa API and Swift here, especially with
// the Boolean stop pointer
regex.enumerateMatches(in: string, options: [], range: range) {
    (result, flags, stopBoolPtr) in
    guard let result = result
        else { print("Missing text checking result"); return }
    let substring = string.substring(from: 
        string.index(string.startIndex, offsetBy: result.range.location))
    print(substring) // "🐟 to eat"

It’s hard going back from Swift’s string indexing model to Cocoa’s NSRange system. Native regex can’t arrive soon enough.

You can also break down unicode scalars to components:

var utf16View = UnicodeScalar("🐟")!.utf16
print(utf16View[0], utf16View[1]) // 55357 56351
print(String(utf16View[0], radix:16), 
    String(utf16View[1], radix: 16)) // d83d dc1f

This scalar approach goes boom  when you try to push into highly composed characters:


let utf16View = "👨‍👩‍👦‍👦".utf16
for c in utf16View {
    print(c, "\t", String(c, radix: 16))
// 55357 	 d83d
// 56424 	 dc68
// 8205 	 200d
// 55357 	 d83d
// 56425 	 dc69
// 8205 	 200d
// 55357 	 d83d
// 56422 	 dc66
// 8205 	 200d
// 55357 	 d83d
// 56422 	 dc66

It’s interesting to see the four d83d components in there.

Got any fun little Unicode tricks? Drop a comment, a tweet, or an email and let me know.


One Comment

  • d83d is a High Surrogate code point. “d83d dc66” is the surrogate pair for U+1F466, BOY. Because utf16 is so much fun. 😢