Another Approach to Java HTML Parsing

This page is dated. Please see Parsing Web Pages in Java.

The Problem

This Crowbar page talks about the difference in building web page scrapers that operate on "syntax space" vs. those that operate on "model space". A syntax space parser allows you to parse HTML into its corresponding tags and tag attributes. But there are many HTML pages that consist of both HTML and JavaScript. A browser like Firefox first parses the HTML into a Document Object Model (DOM), and then JavaScript in the page is executed, potentially changing the DOM.

To be able to deal with "in the wild" web pages, data mining software has to look at the modified DOM, i.e., model space, to capture the page that a person using a web browser would see. For example, in mining web comics, some sites create the IMG tag for today's comic in JavaScript. To find the IMG tag, you have to use software that not only parses HTML but also applies JavaScript.

Taking the process one step further, a web browser will inspect the DOM and render the page on a canvas. This step is needed if one wants to find the location of images on a page. This information is useful in some data mining operations.

I have used JRex and the Cobra Toolkit to mine web pages in Java. Unfortunately, JRex seems to be an abandoned project and is a bit unstable. The Cobra Toolkit will not handle some web pages with JavaScript and hasn't been updated recently. I started looking at using Java XPCOM, the XULRunner library that allows a Java application to control a Mozilla browser. But it appears that Java XPCOM will no longer be supported in the XULRunner 2.0 distribution.

Using Crowbar

I found the Crowbar project that is part of the MIT Simile project. Crowbar is a XULRunner application that forms a proxy server for web scraping software. One gives Crowbar a URL, it loads the corresponding page into a Mozilla browser, waits for JavaScript to be applied, and then sends HTML code that corresponds to the updated DOM to the caller.

Figure 1 - Crowbar Parser

Figure 1 shows how I used Crowbar to mine web pages in Java. Crowbar is started. By default, Crowbar listens for connections on port 10000. When the Java application needs to parse a page, it sends the request to http://127.0.0.1:10000/ with the URL to be parsed sent as an encoded query string. For example, to parse http://www.cnn.com/, the Java application would open a request to

http://127.0.0.1:10000/?url=http%3A%2F%2Fwww.cnn.com%2F

Crowbar understands several parameters. I suggest setting the delay value and the view option. Delay indicates how many milliseconds Crowbar waits after creating a DOM before it creates HTML from the DOM and returns it to the caller. This delay is the amount of time JavaScript has to update the DOM. Since JavaScript might download images or do other network tasks, I usually use a delay of 15000 (15 seconds). The view option should be set to "asis" to get undecorated HTML back. So, the URL for parsing CNN would be

http://127.0.0.1:10000/?url=http%3A%2F%2Fwww.cnn.com%2F&delay=15000&view=asis

The HTML returned by Crowbar can be parsed with a syntax space parser such as HTML Parser.

Crowbar Modifications

Crowbar embeds a Mozilla browser. By default, this browser pops up a modal window whenever a problem occurs while loading a web page. For example, an invalid URL will produce a modal window titled "Server Not Found." To avoid the necessity for having human intervension, the browser can be configured to present an error page rather than giving a modal popup. The caller can parse the page returned from Crowbar to determine if there were an error.

I added the following line to the debug.js file that is part of the Crowbar distribution:

pref("browser.xul.error_pages.enabled", true);

For my data mining project, I wanted to know the location of images as rendered by the browser on a medium size browser window, and I wanted to know the parent tag of IMG tags. Once a page is rendered, the offset in pixels of an image from its parent tag is available. To find the page location, one has to add all the offsets of all the parents back to the BODY tag. I used the code from Javascript: Calculate Element Position to calculate the offset. I pasted the code into the crowbar.js file right before the last two lines. I inserted a "var " string before the start of the code.

Before the "underscore" code, I pasted two functions I created:


function doDom(dom) {
   var root = dom.documentElement ;
   traverse(root) ;
}

function traverse(element) {
   var tagName = element.tagName ;
   if(tagName) 
      if(tagName.toLowerCase() == "img") {
	 var attrs = element.attributes ;
	 var p = underscore.position(element) ;
	 element.setAttribute('x', p.x) ;
	 element.setAttribute('y', p.y) ;
	 if(element.offsetParent)
	    element.setAttribute('parent', element.offsetParent.tagName) ;
      }
   var i = 0 ;
   var child = element.childNodes[i] ;
   while(child) {
      traverse(child) ;
      i++ ;
      child = element.childNodes[i] ;
   }
}

This code traverses the DOM looking for IMG tags. When it finds one, it calculates the page offset using the underscore.position function to calculate the offset, and it adds this to the IMG tag as attributes "x" and "y".

The parent tag is added as a "PARENT" attribute. No PARENT attribute is added if the IMG tag has no parent. Otherwise, it is the name of the parent tag, e.g., "TD" if the IMG tag is enclosed in a "TD" tag. A missing PARENT attribute and offsets of (0, 0) typically means the IMG was created in JavaScript and is not currently tied to the page. For example, one site creates an image that will hover when the user clicks a particular button.

A call to doDom() is placed in crowbar.js at line 208:

      doDom(browser.contentDocument) ;

The line before this line is 'var mime_type = "text/html"' and the line after is "var serializer = new XMLSerializer() ;".

The effect of this modification is that after the HTML is parsed by Crowbar and JavaScript has been applied, doDom() is called to add "X", "Y", and "PARENT" attributes to all IMG tags. Then the DOM is converted back to HTML and sent to the caller.

Calling Crowbar from Java

To call Crowbar, HTML Parser must be given the URL of Crowbar, along with the encoded target URL and arguments:

   String xulAddress = "http://127.0.0.1:10000/" ;
   String crowbarDelay = "15000" ;
   String url = "....." ; // The URL to parse.

   String xulURL = xulAddress + "?url=" ;
   try {
     xulURL += java.net.URLEncoder.encode(url, "UTF-8") ;
   }
   catch(Exception e) {
      System.out.println("URLEncoder error:  " + url + ", " + e) ;
   }
      
   xulURL += "&delay=" + crowbarDelay + "&view=asis" ;

Then an HTML Parser Parser object is created to send the request to Crowbar and parse the modified HTML that comes back from Crowbar.

      ConnectionManager manager;

      try {
	 manager = org.htmlparser.lexer.Page.getConnectionManager ();
	 parser = new Parser(manager.openConnection(xulURL)) ;
	 
	 doTree(parser.elements()) ;
      }
      catch(Exception e) {
      .....

I then use a recursive tree traversal, method doTree(), to inspect the parsed nodes and extract IMG tag information. When an IMG tag is found, the HTML Parser TagNode.getAttribute() method can be used to extract IMG attributes including the X, Y, and PARENT attributes inserted by my Crowbar modification.

Conclusion

The drawbacks to this approach are

It's a big kludge.
Crowbar is single-threaded.

On the plus side

Crowbar should handle all the pages Firefox does.
With the modifications I gave above, this approach allows one to find the location of images on a page.

I would rather have a stable, current HTML parser written in Java that can be run in a threaded environment. Given that there is no good open source "model space" parser, Crowbar is a good compromise.

Update 2/24/2011

I've been experimenting with HtmlUnit, and it might be a workable Java-based "modal space" parser. In initial tests, it handled pages with complicated JavaScript. It appears that unlike Crowbar, you cannot get the page locations of images with HtmlUnit since it doesn't render the page. But it appears to be a stable and supported package.

I've also been looking at the forklabs-javaxpcom code, which is skeleton code for building a web crawler. The code uses JavaXPCOM and SWT. Other than the problem with JavaXPCOM going away in XULRunner 2, this package has a lot of potential. Unlike the method of using Crowbar, JavaXPCOM allows tighter integration with the crawler application, and there is no need for the delay that Crowbar imposes waiting for JavaScript to run. (I have found a few pages where a delay is needed, but generally my forklabs-based crawler is much faster than the Crowbar version.)

Update 5/20/2011

Iker Jamardo Zugaza pointed out, and I confirmed, that Crowbar stalls on certain complicated pages. I did some limited debugging, and it appears that the "page loaded" event handler never fires. After the stall, giving Crowbar a URL for a simple page and then returning to the stalled page is a way to have Crowbar parse the page successfully. It is possible that a particular sequence of pages causes Crowbar to stall. Further investigation is needed.

Update 9/12/2012

Selenium and the Selenium WebDriver appear to be useful tools for accessing web pages using embedded browsers. The system supports mutliple languages, including Java, and multiple browsers, including Firefox. Although it is aimed more at automating site testing, it can be used for data mining.

I am currently using my own concurrent, distributed data mining software called the Hip Dragon Parser (HDP). It consists of two parts: A crowbar-like XUL application that controls a Gecko browser and a Java library that manages multiple instances of the XUL application. The XUL parsers run in multiple processes on one or more machines. Although the parsers do not run headless, it is possible to run them under a virtual frame buffer under Unix. The system can be run on a mixture of operating systems that support XUL and Java.

HDP is a lightweight system for data mining research. It can capture one of more of the following items from a web page: the full DOM after JavaScript has executed with the page locations of rendered objects added, a list of images and their attributes and locations, and/or a list of anchor tag HREF attributes.

Update 1/27/2015

I've encountered stability problems with XULRunner starting with XULRunner version 17. HDP is stable when using an eariler version, but it would be beneficial to use a newer version of the embedded browser. I've had trouble determining the cause of the problem as it occurs intermittently.

I'm currently working on a project to use the Chromium Embedded Framework to replace the XULRunner processes with Chromium. The intial prototype is underway. The embedded Chromium processes are usable from Java over a socket; however, the initial version of the parsers will only run under Windows. With rumors of .NET being avaliable on the Mac and Linux, portable parsers might be available in the future.

Update 4/10/2015

The Chromium version of HDP, CrHDP, has been completed and provides a stable parsing subsystem for my data mining project. See the writeup for more details.