Parsing HTML
The process of reading an HTML file is no different than the process of reading a standard text file—see Reading a File to learn how to do it. However, it’s often necessary to extract specific bits of information from HTML files, such as links, images, and table data, for further processing.
Parsing an HTML File
The handler in Listing 33-1 extracts specific tags and their content from HTML text. Provide an HTML file to read, a closing and ending tag, and indicate whether to return only content between the tags, or the tags with their enclosed content. If no closing tag is provided, the handler extracts the opening tag data only. This feature could be used to extract image tags from HTML content, for example, which don’t have a separate closing tag.
APPLESCRIPT
on parseHTMLFile(theFile, theOpeningTag, theClosingTag, returnContentsOnly)tryset theFile to theFile as stringset theFile to open for access file theFileset theCombinedResults to ""set theCurrentOpeningTag to ""repeatread theFile before "<"set theCurrentTag to read theFile until ">"if theCurrentTag does not start with "<" then set theCurrentTag to ("<" & theCurrentTag) as stringif theCurrentTag begins with theOpeningTag thenset theCurrentOpeningTag to theCurrentTagif theClosingTag is "" thenif theCombinedResults is "" thenset theCombinedResults to theCombinedResults & theCurrentOpeningTagelseset theCombinedResults to theCombinedResults & return & theCurrentOpeningTagend ifelseset theTextBuffer to ""repeatset theTextBuffer to theTextBuffer & (read theFile before "<")set theTagBuffer to read theFile until ">"if theTagBuffer does not start with "<" then set theTagBuffer to ("<" & theTagBuffer)if theTagBuffer is theClosingTag thenif returnContentsOnly is false thenset theTextBuffer to theCurrentOpeningTag & theTextBuffer & theTagBufferend ifif theCombinedResults is "" thenset theCombinedResults to theCombinedResults & theTextBufferelseset theCombinedResults to theCombinedResults & return & theTextBufferend ifexit repeatelseset theTextBuffer to theTextBuffer & theTagBufferend ifend repeatend ifend ifend repeatclose access theFileon error theErrorMessage number theErrorNumbertryclose access theFileend tryif theErrorNumber is not -39 then return falseend tryreturn theCombinedResultsend parseHTMLFile
Listing 33-2 shows how to call the handler in Listing 33-1 to extract all hyperlinks within a chosen HTML file.
APPLESCRIPT
set theFile to choose file with prompt "Select an HTML file:"parseHTMLFile(theFile, "<A HREF=", "</A>", false)--> Example of Result: "<A HREF="http://www.apple.com/fileA.html">Click here to view fileA.</A><A HREF="http://www.apple.com/fileB.html">Click here to view fileB.</A>"
Listing 33-3 shows how to call the handler in Listing 33-1 to extract the destinations of all hyperlinks within a chosen HTML file.
APPLESCRIPT
set theFile to choose file with prompt "Select an HTML file:"parseHTMLFile(theFile, "<A HREF=", "</A>", true)--> Example of Result: "Click here to view fileA.Click here to view fileB."
Listing 33-4 shows how to call the handler in Listing 33-1 to extract all images within a chosen HTML file.
APPLESCRIPT
set theFile to choose file with prompt "Select an HTML file:"parseHTMLFile(theFile, "<IMG ", "", false)--> Example of Result: "<IMG SRC="gfx/clipboard.gif" BORDER="0"><IMG SRC="printer_stopped.gif" ALIGN=TOP WIDTH="32" HEIGHT="32" BORDER="0"><IMG SRC="printer_on.gif" ALIGN=TOP WIDTH="32" HEIGHT="32" BORDER="0">"
Listing 33-5 shows how to call the handler in Listing 33-1 to extract any tables within a file.
APPLESCRIPT
set theFile to choose file with prompt "Select an HTML file:"parseHTMLFile(theFile, "<TABLE", "</TABLE>", false)--> Example of Result:"<TABLE WIDTH="440"><TR><TD ALIGN="CENTER" VALIGN="TOP"><IMG SRC="gfx/clipboard.gif" BORDER="0"></TD></TR></TABLE>"
Parsing an HTML Tag
The handler in Listing 33-6 extracts the contents—first instance of text contained within quotes—of an HTML tag.
APPLESCRIPT
on parseHTMLTag(theHTMLTag)set AppleScript's text item delimiters to "\""set theHTMLTagElements to text items of theHTMLTagset AppleScript's text item delimiters to ""if length of theHTMLTagElements is greater than 1 then return item 2 of theHTMLTagElementsreturn ""end parseHTMLTag
Listing 33-7 shows how to call the handler in Listing 33-6 to extract the destination of a hyperlink tag.
APPLESCRIPT
set theHTMLTag to "<A HREF=\"http://www.apple.com/fileA.html\">Click here to view fileA.</A>"parseHTMLTag(theHTMLTag)--> Result: "http://www.apple.com/fileA.html"
Copyright © 2018 Apple Inc. All rights reserved. Terms of Use | Privacy Policy | Updated: 2016-06-13
