Using Swift to get (scrape) data from a web page

I want to get a number off a webpage, to use in my app. Scraping, I guess it's called. I've done a little research on this and found this code:


        let url = NSURL(string: "https://website.com")
        let task = NSURLSession.sharedSession().dataTaskWithURL(url!) {(data, response, error) in
        print(NSString(data: data!, encoding: NSUTF8StringEncoding))
        }    
        task!.resume()
       }


This works beautifully and sends the source code of the page to the output console in Xcode. But the source code doesn't contain the number I want to grab. So instead of the code that calls a script to get the number from the server, I need the result of that code. I guess that's the HTML or XML that the browser normally renders to display the number itself.


I can't seem to find the right command for this.

Replies

I've done a little more research and playing around and came across an example of using WKWebView. This is very interesting since it gives the rendered page instead of the code. The example fills the View with the webpage and covers up my user interface, so it is not yet a solution for me.


I'm at a loss how to extract what I need but I feel I'm getting closer.


Here is the code from that example:

In the ViewController:


import WebKit
@IBOutlet var containerView : UIView? = nil
var webView: WKWebView?


and

override func loadView(){
super.loadView()
self.webView = WKWebView()
self.view = self.webView
}


override func viewDidLoad(){
super.viewDidLoad()
var url = NSURL(string:"http://www.kinderas.com/")
var req = NSURLRequest(URL:url)
self.webView!.loadRequest(req)
}

This really depends on how the web site is structured. For simple web sites you can download the HTML and parse that but a lot of web sites are complex in a way that makes this impossible (for example, if they generate their content dynamically via JavaScript). If you're dealing with a complex site you must run the site in a web view in order for it to render correctly. Once you do that you can run JavaScript within that web view to get out the data you want.

Needless to say this isn't particularly easy. Neither is it particularly related to Swift (-: If you have follow-up questions I recommend you post them to either:

  • Core OS > Networking for the networking side, or

  • one of the Safari and Web topic areas, for the JavaScript side of things

Share and Enjoy

Quinn "The Eskimo!"
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1@apple.com"

One important point: Do the owners of the website being scraped allow screen scraping?

Thanks for the reply. You have confirmed my suspicion that this is going to be challenging, even I if I knew what I was doing, which I don't.


Right now I want data from two sites. One thing I need is just weather data, and I'm assuming this will be more straightforward.


The more difficult one is my local power company. I want to get their real-time-pricing for electricity. See rrtp.comed.com. They offer the hourly electricity price publicly, but the page containing the value is generated dynamically. A view of the source does not contain the value, but of course the value appears in a browser or in WKWebView.


I've communicated with Com Ed and learned that they get the pricing value from a third-party. They expressed a desire to support home automation efforts like mine. However I need to make-do in the meanwhile. They offered no schedule for making this easier. They DO support the If-This-Than-That tool, so perhaps there would be a way to feed myself the vaule. Hmmm....

That's a grey area today. It's going to be quite some time before my app moves beyond development and personal use, and I'm hoping the grey area is resolved by the time that happens.


As I said in the other reply, I'm assuming the weather data I need is "scrape-able" without issue, perhaps from the National Weather Service.

If the web page has scripts, its possible for you to inject your own script into the downloaded page, then call it. You script can call one of the existing scripts, and return a value to you in a "post back" message. This is all terribly complicated (for me it was), and daunting as there are few examples to go by.


One the WWDC 2014 sessions covered this topic, I believe it was "Introducing the Modern WebKit API". In the end I froze the video (or slide) and did a screen print to access the otherwise unavailable source code.


In brief, you use a WKWebView (and perhaps you can make it invisible or offscreen), you tell it to connect to a URL, at some point you add your own script, then when the page has loaded, you invoke your script, which posts back some data. I don't know JavaScript so this was a real PITA. In the end I was able to get a form listener installed, so when a user logged in I could determine the email address used.


Good luck!


PS: in the example of how to inject a script, they used "Wikipedia" as an example site, so you can search for that on the asciiwwdc2014 site to find sessions that used that term.

Not only terribly complicated; it's also terribly fragile. The slightest change in the way the web site is implemented and your app will break. Your users will have a couple of weeks to leave lots of one star reviews and bad mouth you on forums while you frantically try and get a patch through app review.


It's really best to use published APIs for production code. Scraping HTML usually ends in tears at some point.

I try to separate out the "how do I" from the "should I". My point is that it is possible to do some pretty astounding things with WKWebView. Apple hyped this capability in 2014.


Unfortunately in the real world, its not possible to get a supported API for much if not most information shown on the web. Obviously if you use web scraping, you'd better position any users with the knowledge that your app might cease working for a few weeks while you figure out how to adapt to some change, then get your updated app into the store.


Even better would be someway to dynamically supply the recepe for getting the data into some blob you download from CloudKit daily.


So no disagreement from me on the fragile nature of web scraping.

Well I've gotten a lot closer but I'm still struggling.


I've since learned that the data I need can be accessed by directly linking to a servlet in a browser, see here for instance. So the "page" I need to extract the data from is much smaller, just a couple lines of text.


The code in my first post nicely prints these couple lines to the console, and the numerical result I need is in the printed output


      print(NSString(data: data!, encoding: NSUTF8StringEncoding))


But while I can print it, I can't seem to get the NSString data into a Swift string. Once I get past that, I can use the following code to clip out just the data I need.


    var myString  = something something from NSString(data: data!, encoding: NSUTF8StringEncoding)
    var range = myString.rangeOfString("per kWh")
    var starter = range!.startIndex
    let startIndex = advance(distance(comed.startIndex, starter), -6)   // backs up the index from where the range was found
    let myShortenedString = myString.substringFromIndex(advance(myString.startIndex, startIndex))  // gets the right-hand part of the string
    let myFinalString = myShortenedString.substringToIndex(advance(string1.startIndex, 4))  // deletes all but the first 4 chars of the right-hand part of the string
    let myNumber = NSNumberFormatter().numberFromString(myFinalString)!.doubleValue   // converts the string to a number


Almost there!

I ended up using a combination of AppleScript and Google Chrome (since Safari doesn't seem to like dealing with 'do JavaScript' very well) to run various Javascripts against a web page to grab information (in this case, insurance claims from a dynamically-generated web page).

var myString = String(NSString(data: data!, endocding: NSUTF8StringEncoding))

Edited for misspelled parameter value and missing parenthesis.

You can often check the robots.txt file most major websites have. This shows you which pages you are allowed to scrape and which you shouldn't (mostly all they can do is block you accessing the site in the future). To check which ones you aren't allowed to use check the list following "Dissallow: *". I hope this helps in answering the"allowing" issue.