PDFKit findString(_:withOptions) regular expression without results

I'm building a custom machine learning algorithm to get parts of an invoice. So I need to feed the words and bounding boxes of them into a model. To achieve that I tokenize the pdf page string and then use

func findString(_ string: String, withOptions options: NSString.CompareOptions = []) -> [PDFSelection]

In NSStringCompareOptions I use .RegularExpression. But I'm not getting any results.

Here's my code.

// tokenize string and remove empty arrays 
var dummy = pdfString!.components(separatedBy: "\n").joined(separator: " ").components(separatedBy: " ").filter{$0 !=
 ""}
// loop over every token and search for it with its position
var resultsFound = [[PDFSelection]]()
for word in dummy {
     let pattern = "\\b" + word + "\\b"
     resultsFound.append(facturaPDF!.findString(pattern, withOptions: .regularExpression))
}
resultsFound.count

// add the results to the page for results in resultsFound
for result in results {
     let highlight = PDFAnnotation(bounds: result.bounds(for: paginaPDF!), forType: .highlight, withProperties: nil)
     highlight.endLineStyle = .square
     highlight.color = UIColor.orange.withAlphaComponent(0.5)
     paginaPDF!.addAnnotation(highlight)
}

If anyone has any suggestions I'll be grateful 🙂

What does not work ?


Is it the regex ? Please post it.

Is it adding annotations ?

As far as I tried, The method `findString(_:withOptions:)` does not respect the option `.regularExpression`.


And the header doc of the method shows as follows:

    // Searches entire document for string and returns an array of PDFSelections representing all instances found. May 
    // return an empty array (if not found). Supported options are: NSCaseInsensitiveSearch, NSLiteralSearch, and 
    // NSBackwardsSearch.
    open func findString(_ string: String, withOptions options: NSString.CompareOptions = []) -> [PDFSelection]

Seems regular expression is not supported in `findString(_:withOptions:)` (neither in any other `find`-methods). You may need to find another way to detect what you want.

But the optoin does appears to me. But if it is in the documentation I guess you're wright.

Is there any way you figure I can find the position of every word in a pdf?


Thanks in advance

Doc for findString(_:fromSelection:withOptions:), NSString.CompareOptions says:


static var regularExpression: NSString.CompareOptions

The search string is treated as an ICU-compatible regular expression. If set, no other options can apply except

caseInsensitive
and
anchored
. You can use this option only with the
rangeOfString:
… methods and
replacingOccurrences(of:with:options:range:


Did you consider using enumerateMatches(in:options:range:using:)

on the text in facturaPDF!

Accepted Answer

But the optoin does appears to me.

`findString(_:withOptions:)` just uses the same option type as usual String operations, that does not mean all the options defined in the type are supported.

Appears or not has nothing to do with supported or not.


Is there any way you figure I can find the position of every word in a pdf?

I do not know what would be the best way. (You can send a feature request and wait...)

Buf if you want to get an Array of PDFSelection, you can convert usual NSRanges in this way:

let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)
var foundSelections: [PDFSelection] = []
let numPages = document.pageCount
for i in 0..<numpages {
    guard let page = document.page(at: i), let text = page.string else {continue}
    
    let results = regex.matches(in: text, range: NSRange(0..<text.utf16.count))
    for result in results {
        let startIndex = result.range.location
        let endIndex = result.range.location + result.range.length - 1
        let selection = document.selection(from: page, atCharacterIndex: startIndex,
                                  to: page, atCharacterIndex: endIndex)!
        foundSelections.append(selection)
    }
}

(Calcutation of `startIndex` and `endIndex` may be wrong if your PDF contains non-BMP characters.)

Great! I’ll try it.


Thanks

PDFKit findString(_:withOptions) regular expression without results
 
 
Q