The first part of this article describes the history of a decade long search for a stable parsing method that can be used by Java. The final section describes my current effort, CrHDP, which embeds Chromium to parse web pages.
My quest started in the early 2000's when the local newspaper reorganized the comics page, removing some of my favorite strips. I started reading the missing strips online and decided to create a single page for them. A colleague had developed a tag processor as a servlet, so I used his tag that would extract the nth image from a page and embed it in a new page. This technique worked well for awhile and only needed adjustments occasionally.
The machine learning system is described in on The Comics Miner page and will not be discussed further.
In 2011, I tried to use two other packages: HtmlUnit and forklabs-javaxpcom. The former did not render pages, so I could not extract the page location of images. The forklabs code was built on JavaXPCOM. I started using the code at about the time Mozilla announced that JavaXPCOM would no longer be supported.
I finally used Crowbar as a model and produced a XULRunner application that would allow multiple concurrent, distributed Gecko parsers to run. The parsers were accessed via sockets and could run on any machine that supported XULRunner. A Java library was developed to manage the processes, send requests, capture output, and handle anomalous situations.
The parsers operated in three modes. In the first, the page would be rendered, and a list of images would be returned along with the image source URL, size, and page location. In the second mode, only the HREF attributes of anchor tags would be returned. In the final mode, the full DOM, modified to include the location of page objects, would be returned. I named this parser the Hip Dragon Parser (HDP) with the "dragon" referring to the Mozilla dragon icon.
The parsers could not be run headless on Windows XP, but under Linux, they could be run headless using a virtual frame buffer. Under Windows 7, the system runs headless when launched as a scheduled task, so I suspect there is a way to run the parsers in headless mode when launched in normal mode.
The system was stable with versions of XULRunner up to version 16. With version 17 and above, the parsers would intermittently crash or lock up. I made several attempts to debug the code, even building a version of XULRunner from scratch. But I eventually abandoned the hope of running newer versions of XULRunner. Embedding is mostly not supported by Mozilla, which makes obtaining help more difficult.
I am using the The Chromium Embedded Framework (CEF3) to build the parser. CEF3 is a C library with a C++ wrapper layer. There are also bindings for other languages including Java and C#. The best approach would have been for me to write my wrapper in C++ to maintain the high level of portability that HDP has; however, I decided not to relearn a language I hadn't used in years. The Java bindings did not seem to have the power I needed, so I decided to build my code in C#. The issue of portability might be resolved as Microsoft has announced plans to port .NET to Linux and OS/X or by using Mono.
Multiple parsers can run concurrently on one or more machines. The production version of the code runs headless; however, I have an experimental version that runs inside a Windows form. This version is useful for viewing the causes of slow page loads.
The image above this text shows the CrHDP monitor. CrHDP is running on two hosts. The first has eight parsers running, while the second has four parsers. The data mining application and the monitor GUI run on the first machine. Each row in the display gives the port number on which the parser is listening for requests, a status box, and a user-supplied tag identifying the site being parsed. A yellow box means the parser is busy while a white box means the parser is idle. Other colors represent error conditions.
The current version of the code is based on the CEF 3 2454 branch and was completed in November of 2015. The version fixed a bug where temporary files were not deleted.
I decided to build a version of CrHDP that would run under Linux. Because
of some support issues for CefGlue under Linux and questions about Linux
.Net support, I decided to implement Linux CrHDP in C++. The implementation
is based on the cefsimple sample program that comes with the distribution
of CEF3. As of 5/2017, I have a working version of Linux CrHDP that has
all the features of the Windows version.