[UPHPU] Extracting templates from web pages
Richard K Miller
richardkmiller at gmail.com
Thu Apr 3 11:06:24 MDT 2008
On Apr 2, 2008, at 4:43 PM, thebigdog wrote:
>>> Adrian Holovaty (creator of ChicagoCrime.org and Django) has a
>>> Python
>>> script called templatemaker[1][2], which in theory would do what I
>>> want. You
>>> feed it a bunch of similar web pages and it produces a template
>>> with "holes"
>>> where the data was different across each web page. In practice,
>>> it's too
>>> granular; it doesn't recognize HTML. It looks at every I don't
>>> care about
>>> spaces between tags. I only care about substantial content
>>> differences
>>> across pages. Everything else can be moved to the template.
>> you could try running everything through HTML Tidy first, see if that
>> normalizes whitespace and such. then run templatemaker and see how
>> that works out.
>
> you could use a diff program to find out where they are different
> and the kinda do the reverse and come up with the
> similarities...however i would do it after running it all through
> tidy first.
>
> If it was up to me then i would look at taking 1 page and creating a
> template from it and then extract all the data you need to populate
> other pages with that template.
Thanks, Justin and Ray. Good ideas.
More information about the UPHPU
mailing list