[UPHPU] Extracting templates from web pages

Richard K Miller richardkmiller at gmail.com
Thu Apr 3 11:06:24 MDT 2008


On Apr 2, 2008, at 4:43 PM, thebigdog wrote:
>>> Adrian Holovaty (creator of ChicagoCrime.org and Django) has a  
>>> Python
>>> script called templatemaker[1][2], which in theory would do what I  
>>> want. You
>>> feed it a bunch of similar web pages and it produces a template  
>>> with "holes"
>>> where the data was different across each web page. In practice,  
>>> it's too
>>> granular; it doesn't recognize HTML. It looks at every I don't  
>>> care about
>>> spaces between tags. I only care about substantial content  
>>> differences
>>> across pages. Everything else can be moved to the template.
>> you could try running everything through HTML Tidy first, see if that
>> normalizes whitespace and such. then run templatemaker and see how
>> that works out.
>
> you could use a diff program to find out where they are different  
> and the kinda do the reverse and come up with the  
> similarities...however i would do it after running it all through  
> tidy first.
>
> If it was up to me then i would look at taking 1 page and creating a  
> template from it and then extract all the data you need to populate  
> other pages with that template.

Thanks, Justin and Ray. Good ideas.



More information about the UPHPU mailing list