[UPHPU] Extracting templates from web pages
thebigdog
bigdog at venticon.com
Wed Apr 2 17:43:50 MDT 2008
>> Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python
>> script called templatemaker[1][2], which in theory would do what I want. You
>> feed it a bunch of similar web pages and it produces a template with "holes"
>> where the data was different across each web page. In practice, it's too
>> granular; it doesn't recognize HTML. It looks at every I don't care about
>> spaces between tags. I only care about substantial content differences
>> across pages. Everything else can be moved to the template.
>
> you could try running everything through HTML Tidy first, see if that
> normalizes whitespace and such. then run templatemaker and see how
> that works out.
you could use a diff program to find out where they are different and the kinda
do the reverse and come up with the similarities...however i would do it after
running it all through tidy first.
If it was up to me then i would look at taking 1 page and creating a template
from it and then extract all the data you need to populate other pages with that
template.
--
thebigdog
More information about the UPHPU
mailing list