[UPHPU] Web site scraping

Nathan Lane nathamberlane at gmail.com
Fri Sep 26 09:29:10 MDT 2008


Thank you everyone for your replies. I will look into tidy (which I've heard
of, but know nothing about) and my main purpose is to scrape data, not
complete XML, but using XPath seems to me to be the most reliable, as that
is what I've finally come to in C#.

On Thu, Sep 25, 2008 at 2:43 PM, Richard K Miller
<richardkmiller at gmail.com>wrote:

> Oops, I just noticed that the link Alvaro sent refers to the same
> SimpleTest (not SimpleUnit) framework that I mentioned. Well not exactly,
> but it uses the same base code. The owner of lastcraft.com is the creator
> of SimpleTest. My bad.
>
> Richard
>
>
>
>
>
>
> On Sep 25, 2008, at 2:40 PM, Richard K Miller wrote:
>
>  In the past I've used regular expressions, but after hearing Alvaro
>> mention tidy+xpath at a UPHPU meeting, I started using that. I've loved it.
>> SimpleXML is easy to use. I haven't ventured into XSLT, like Ray suggested,
>> but tidy+xpath has been great.
>>
>> On a similar note, I've been looking at SimpleUnit's Web Testing module
>> and it seems pretty powerful. You can use it for far more than unit testing.
>> It's like a scriptable browser, in which you can "click" links, fill out
>> forms, work with cookies, etc. The example on the website shows how to
>> perform an automated Google search:
>>
>> http://www.simpletest.org/en/start-testing.html#web
>>
>> Richard
>>
>>
>>
>> On Sep 25, 2008, at 9:44 AM, Alvaro Carrasco wrote:
>>
>>  I forgot one thing: Scriptable Browser.
>>> http://www.lastcraft.com/browser_documentation.php
>>>
>>> This makes it really easy to deal with forms, authentication, clicking
>>> on links, etc.
>>>
>>> Seriously, the combination of scriptable browser, tidy, and xpath makes
>>> scraping a piece of cake.
>>>
>>> Alvaro
>>>
>>> Alvaro Carrasco wrote:
>>>
>>>> In my experience, the easiest way is: run website through tidy, load it
>>>> into a DOMDocument, and use xpath.
>>>>
>>>> The xpath patterns are SO much easier to read and write than regex and
>>>> more resistant to changes to the website (if you write them correctly).
>>>> You can also use regex within xpath if you ever need it.
>>>>
>>>>
>>>>
>


-- 
Nathan Lane
Home, http://www.nathandelane.com
Blog, http://nathandelane.blogspot.com


More information about the UPHPU mailing list