Using JRex to Retrieve the HTML of Rendered Pages

This page is dated. Please see Parsing Web Pages in Java.

I was working on a web data mining project that looks for certain types of images in web pages. I used HTML Parser to parse the HTML and find IMG and OBJECT tags. In a few cases, I discovered that using the raw HTML of a page was not adequate. In particular, one page created the SRC attribute of an IMG tag at run time using a JavaScript:

I looked for an open source package that would allow me to access the HTML of the rendered page after JavaScript had been applied. I did not find anything suitable and was considering embedding the Mozilla Gecko engine into my own application. I resisted this approach due to the start up time of creating a development environment and learning the techniques for embedding a browser.

Then I discovered C.N.Medappa's JRex, a Java wrapper for Gecko. With JRex, I wrote a Java application that visits a page and then waits for the page to load and JavaScript to be executed. It then traverses the Document Object Model (DOM) created by the browser engine and reconstructs the HTML of the rendered page. In the example above, the value of the SRC attribute would be "abc.jpg".

Getting Started

I needed a simple example to get me started developing with JRex. I was fortunate to find Dietrich Kappe's How to REALLY do Page Preview in Java with Embedded HTML Rendering page with a description of setting up a JRex development environment and a simple example. This description got me to the point where I could launch Gecko from a Java app. After that, I used the JRex and org.w3c.dom APIs, along with some experimentation, to develop an application to extract IMG and OBJECT tags from a page.

I followed Kappe's suggestion of using the java command from the JRE, as opposed to java in the JDK, to run JRex Java applications. When I use the JDK version of Java, several DLL's are not found during JRex startup. I'm sure there's a way to fix the problem, but running the JRE version of Java works fine.

The JRex Mailing List is a good source of information about JRex.

An Example

I created a Java class called Render. The pageParse() method uses JRex to open a page and wait for it to load. It then calls a recursive method called doTree() that traverses the DOM created by the browser engine. For each tag, the doElement() method is called, and then the child tags of the tag are recursively processed. When processing for a tag completes, the doTagEnd() method is called.

In Render, doElement() prints the tag, ignoring all attributes, e.g., <IMG SRC="xyz.gif"> causes <IMG> to be printed. The doTagEnd() method in Render merely prints a closing tag, e.g., </IMG> in the example.

Consider the following HTML for a simple page:

<html> <head> <TITLE>Simple Page</TITLE> </head> <body> <table> <tr><td></td><td></td></tr> </table> </body> </html>

The following output is produced by Render when it is applied to the simple page:

Note that when the DOM is constructed, a node for the missing <TBODY> tag is created.

Here is the source code to Render:

package com.benjysbrain.htmlgrab ; /** Render - This object is a wrapper for JRex, the Java library that allows a Java application to embed the Mozilla Gecko browser. It uses JRex to load a page and then act on the DOM that Gecko constructs. The intent of Render is to access the DOM after a page is loaded and JavaScript has been applied for web data extraction projects. <p> Subclass this object and override the <i>doElement(org.w3c.dom.Element element)</i> and <i>doTagEnd(org.w3c.dom.Element element)</i> methods to do some real work. In the base class, doElement() prints the tag name and doTagEnd() prints a closing version of the tag. <p> Thanks to Dietrich Kappe for his JRex <A HREF="http://blogs.pathf.com/agileajax/2007/01/how_to_really_d.html"> article.</a> See my <A HREF="http://www.benjysbrain.com/misc/Render"> article</a> for more details. Thanks to Jason Baumgartner for the tip on how to disable JRex logging of debug information. <p> Copyright (c) 2007 by Ben E. Cline. This code is presented as a teaching aid. No warranty is expressed or implied. <p> http://www.benjysbrain.com/ @author Benjy Cline */ import org.mozilla.jrex.* ; import org.mozilla.jrex.ui.* ; import org.mozilla.jrex.window.* ; import org.mozilla.jrex.navigation.* ; import org.mozilla.jrex.event.progress.* ; import org.w3c.dom.* ; import java.lang.Exception.* ; import javax.swing.*; import java.net.*; public class Render implements org.mozilla.jrex.event.progress.ProgressListener { String url ; // The page to be processed. // These variables can be used in subclasses and are created from // url. baseURL can be used to construct the absolute URL of the // relative URL's in the page. hostBase is just the http://host.com/ // part of the URL and can be used to construct the full URL of // URLs in the page that are site relative, e.g., "/xyzzy.jpg". // Variable host is set to the host part of url, e.g., host.com. String baseURL ; String hostBase ; String host ; // The JRexCanvas is the main browser component. The WebNavigator // is used to access the DOM. JRexCanvas canvas = null ; WebNavigation navigation = null ; // An event handler sets "done" to true when the document is loaded. boolean done = false ; /** Create a Render object with a target URL. */ public Render(String URL) { url = URL ; } /** Load the given URL in Gecko. When the page is loaded, recurse on the DOM and call doElement()/doTagEnd() for each Element node. Execution can hang if the page causes a window to be popped up. Return false on error. */ public boolean parsePage() { // Parse the URL and build baseURL and hostURL for use by doElement() // and doTagEnd(). URI uri = null ; try { uri = new URI(url) ; } catch(Exception e) { System.out.println(e) ; return false ; } String path = uri.getPath() ; baseURL = "http://" + uri.getHost() + path + "/" ; hostBase = "http://" + uri.getHost() ; host = uri.getHost() ; // Start up JRex/Gecko. try { JRexFactory.getInstance().startEngine(); } catch (Exception e) { System.err.println("Unable to start up JRex Engine."); e.printStackTrace(); return false ; } // Get a window manager and put the browser in a Swing frame. // Based on Dietrich Kappe's code. JRexWindowManager winManager=(JRexWindowManager) JRexFactory.getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER); winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE); JPanel panel = new JPanel(); JFrame frame = new JFrame(); frame.getContentPane().add(panel); winManager.init(panel); // Get the JRexCanvas, set Render to handle progress events so // we can determine when the page is loaded, and get the // WebNavigator object. canvas = (JRexCanvas) winManager.getBrowserForParent(panel); canvas.addProgressListener(this) ; navigation = canvas.getNavigator() ; // Load and process the page. try { navigation.loadURI(url, WebNavigationConstants.LOAD_FLAGS_NONE, null, null, null); // Swing magic. frame.setSize(640, 480); frame.setVisible(false); // Check if the DOM has loaded every two seconds. while(!done) { Thread.sleep(2000) ; } // Get the DOM and recurse on its nodes. Document doc = navigation.getDocument() ; Element ex = doc.getDocumentElement() ; doTree((Node) ex) ; } catch(Exception e) { System.out.println("Trouble walking DOM: " + e) ; return false ; } return true ; } /** Recurse the DOM starting with Node node. For each Node of type Element, call doElement() with it and recurse over its children. The Elements refer to the HTML tags, and the children are tags contained inside the parent tag. */ public void doTree(Node node) { if(node instanceof Element) { Element element = (Element) node ; // Visit tag. doElement(element) ; // Visit all the children, i.e., tags contained in this tag. NodeList nl = element.getChildNodes() ; if(nl == null) return ; int num = nl.getLength() ; for(int i=0; i<num; i++) doTree(nl.item(i)) ; // Process the end of this tag. doTagEnd(element) ; } } /** Simple doElement() to print the tag name of the Element. Override to do something real. */ public void doElement(Element element) { System.out.println("<" + element.getTagName() + ">") ; } /** Simple doTagEnd() to print the closing tag of the Element. Override to do something real. */ public void doTagEnd(Element element) { System.out.println("</" + element.getTagName() + ">") ; } // org.mozilla.jrex.event.progress.ProgressListener methods. // onStateChange() seems the best place to watch for the // completion of the loading of the DOM. /** Noop */ public void onLinkStatusChange(ProgressEvent event) { } /** Noop */ public void onLocationChange(ProgressEvent event) { } /** Noop */ public void onProgressChange(ProgressEvent event) { } /** Noop */ public void onSecurityChange(ProgressEvent event) { } /** onStateChange is invoked several times when DOM loading is complete. Set the done flag the first time. */ public void onStateChange(ProgressEvent event) { if(!event.isLoadingDocument()) { if(done) return ; done = true ; } } /** Noop */ public void onStatusChange(ProgressEvent event) { } /** Main: java com.benjysbrain.htmlgrab.Render [url]. Run JRex on the given page, wait for the page to load, and traverse the DOM, printing tag names only. */ public static void main(String[] args) { String url ="http://www.cnn.com" ; if(args.length == 1) url = args[0] ; Render p = new Render(url) ; p.parsePage() ; System.exit(0) ; } }

To run the main() method of Render under Windows, you need to set the CLASSPATH to include JRex.jar and set two -D values:

-Djrex.dom.enable=true

-Djrex.gre.path=%JREX_GRE_PATH%

where the Windows %JREX_GRE_PATH% variable points to the JRex GRE. If you invoke Render with a URL, then it should visit the page and report the tags in the page. Otherwise, it displays the tags in the CNN home page.

To use Render to do real work, extend it in a subclass. You can override doElement() and doTagEnd() to extract information from the DOM. To extract tag attributes, first run the boolean method hasAttributes() of the Element object. If the tag has attributes, this method will return true. You can then use the getAttributes() method to obtain a NamedNodeMap object, which you can use to access the tag attributes. The Node objects referenced by the NamedNodeMap contain attribute/value pairs. The getNodeName() method of Node returns the attribute name while the getNodeValue() method returns the attribute value.

Limitations and Problems

Render is presented as an example of using JRex and isn't intended to handle all JRex situations. I use a subclass of Render daily, and for the pages I data mine, it is stable. But, the pages in my test suite are very well behaved.

I use an event listener in Render to determine when the page loads. The parsePage() method in Render repeatedly sleeps for two seconds and inspects a "done" flag that is set by the listener. If the page doesn't load, perhaps due to a modal window being displayed by the embedded browser, parsePage() will never complete.

When the Render class starts the embedded browser, sometimes the browser window is displayed and sometimes it is created and immediately disappears. I did not address this problem as I do not need to view the browser window during data mining.

Under Windows XP and Java 1.5.0_09, some pages cause the Java JVM to crash. I do not have this problem on a Windows XP system with an older version of Java.

If you have comments, suggestions, or questions, feel free to contact me at the e-mail address given in the footer of this page.

Update: See my Cobra page that describes another rendering engine.

Update (3/1/2011): It appears that JRex is no longer supported. I am now using Crowbar and HtmlUnit for HTML parsing from Java.