[UPHPU] Extracting templates from web pages
justin
justin at justinhileman.info
Wed Apr 2 15:11:49 MDT 2008
On Wed, Apr 2, 2008 at 11:35 AM, Richard K Miller
<richardkmiller at gmail.com> wrote:
> Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python
> script called templatemaker[1][2], which in theory would do what I want. You
> feed it a bunch of similar web pages and it produces a template with "holes"
> where the data was different across each web page. In practice, it's too
> granular; it doesn't recognize HTML. It looks at every I don't care about
> spaces between tags. I only care about substantial content differences
> across pages. Everything else can be moved to the template.
you could try running everything through HTML Tidy first, see if that
normalizes whitespace and such. then run templatemaker and see how
that works out.
justin
--
http://justinhileman.com
More information about the UPHPU
mailing list