[UPHPU] Extracting templates from web pages

thebigdog bigdog at venticon.com
Wed Apr 2 17:43:50 MDT 2008


>>  Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python
>> script called templatemaker[1][2], which in theory would do what I want. You
>> feed it a bunch of similar web pages and it produces a template with "holes"
>> where the data was different across each web page. In practice, it's too
>> granular; it doesn't recognize HTML. It looks at every I don't care about
>> spaces between tags. I only care about substantial content differences
>> across pages. Everything else can be moved to the template.
> 
> you could try running everything through HTML Tidy first, see if that
> normalizes whitespace and such. then run templatemaker and see how
> that works out.

you could use a diff program to find out where they are different and the kinda 
do the reverse and come up with the similarities...however i would do it after 
running it all through tidy first.

If it was up to me then i would look at taking 1 page and creating a template 
from it and then extract all the data you need to populate other pages with that 
template.

--
thebigdog


More information about the UPHPU mailing list