Monday, 6 May 2013

How to Use PHP's DOMDocument to Scrape a Web Page

I've been working on an SEO addon for concrete5. The issue I'm trying to solve right now is 'how to strip all irrelevant tags and content form the HTML and just return the text', aka web page scraping.

Bang head on desk.
Many Different Paths

First, I tried PHP's striptags function. Um, no. That just doesn't work well.

Next, I tried regular expressions. They were really clumsy, long-winded and not 100% effective.

Then I tried PHP's Document Object Model classes. They seemed magical.
PHP's Document Object Model Classes

The DOM (Document Object Model) classes allow you to:

    Parse HTML and XML documents;
    Transverse the DOM of those documents;
    Add and remove nodes within the DOM
    Query the DOM using XPath

So after some time working with the DOM classes, I created this function to scrape the text from a real HTML document:

    function getTextFromHTML($html='') {
    // An array of words that should be removed
    //from the resultant text
    $stopWords = array(' ');
    
    // Initially remove the script tags using regex (there were some
    // issues if I didn't do this)
    $html = preg_replace('/<script.*?script>/is', '', $html);
    
    //Load the $html into a DOMDocument object
    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    //libxml_use_internal_errors (true);
    $dom->loadHTML(strtolower($html));
    
    // Strip out scripts if there are any left
    $scripts = $dom->getElementsByTagName('script');
    foreach ($scripts as $script) {
    $script->parentNode->removeChild($script);
    }
    
    // Strip out style blocks
    $styles = $dom->getElementsByTagName('style');
    foreach ($styles as $style) {
    $style->parentNode->removeChild($style);
    }
    
    // Go through the resultant $html and get all text nodes
    $xPath = new DOMXPath($dom);
    $textNodes = $xPath->evaluate('//text()');
    $text = "";
    foreach ($textNodes as $textNode) {
    // Do some magic on the gathered text
    $nodeValue = strtolower($textNode->nodeValue);
    $nodeValue = str_replace($stopWords,' ', $nodeValue);
    $nodeValue = preg_replace("/[.:()\/\$\'\#]/", ' ', $nodeValue);
    $nodeValue = preg_replace('/[^a-z0-9 -\\._]/', '', $nodeValue);
    $nodeValue = trim($nodeValue);
    if (!empty($nodeValue)) {
    $text .= $nodeValue." ";
    }
    }
    return $text;
    }

I'm almost sure there is a better way to do this, so if you have any suggestions, let me know in the comments.

Source: http://skybluesofa.com/blog/how-use-phps-domdocument-scrape-web-page/

No comments:

Post a Comment