Scraping Website: May 2013

Thursday, 30 May 2013

Unraveling data scraping: Understanding how to scrape data can facilitate journalists' work

Ever heard of "data scraping?" The term may seem new, but programmers have been using this technique for quite a while, and now it is attracting the attention of journalists who need to access and organize data for investigative reporting.

Scraping is a way of retrieving data from websites and placing it in a simple and flexible format so it can be cross-analyzed more easily. Many times the information necessary to support a story is available, however it is found in websites that are hard to navigate or in a data base that is hard to use. To automatically collect and display this information, reporters need to turn to computer programs known as "scrapers."

Even though it may seem like a "geek" thing, journalists don't need to take advanced courses in programming or know complicated language in order to scrape data. According to hacker Pedro Markun, who worked on several data scraping projects for the House of Digital Culture in Sao Paulo, the level of knowledge necessary to use this technique is "very basic."

“Scrapers are programs easy to handle. The big challenge and constant exercise is to find a pattern in the web pages' data - some pages are very simple, others are a never-ending headache," said Markun in an interview with the Knight Center for Journalism in the Americas.

Markun has a public profile on the website Scraperwiki, that allows you to create your own data scraper online, or to access those written by others.

Like Scraperwiki, other online tools exist to facilitate data scraping, such as Mozenda, a simple interface software that automates most of the work, and Screen Scraper, a more complex tool that works with several programming languages to extract data from the Web. Another similar useful software is Firebug for Firefox.

Likewise, Google offers the the program Google Refine for manipulating confusing data and converting it into more manageable formats.

Journalists also can download for free Ruby, a simple and efficient programming language, that can be run on Nokogirito do scrapings on documents and websites.

Data is not always available in open formats or easy to scrape. Scanned documents, for example, need to be converted to virtual text. To do this, there is a function that can be found in Tesseract, an OCR (Optic Character Recognizer) tool of Google that "reads" scanned texts and converts them to virtual texts.

Information and guidelines about the use of these tools are available on websites such as Propublica, which offers several articles and tutorials on scraping tools for journalism. YouTube videos also can prove a helpful source.

Even if you have adopted the hacker philosophy, and reading tutorials or working hands-on tends to be your way of learning, you may encounter some doubts or difficulties when using these tools. If this is the case, a good option is to get in contact with more experienced programmers via discussion groups such as Thackday and Scraperwiki Community, which offer both free and paid-for alternatives to find someone to help do a scraping.

While navigating databases might be old school for some journalists, better understanding how to retrieve and organize data has gained in importance as we've entered an age of information overload, making taking advatage of such data-scraping tips all the more worthwhile.

Source: https://knightcenter.utexas.edu/blog/00-9676-unraveling-data-scraping-understanding-how-scrape-data-can-facilitate-journalists-work

Monday, 27 May 2013

PHP HTTP Screen-Scraping Class with Caching

class_http.php is a "screen-scraping" utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. Caching makes you a good neighbor!

The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

The class cloaks itself as the User Agent of the user making the request to your script. It also sends your script as the Referer, since in essence, it is the referrer. This means you should be able to screen-scrape sites that normally block screen-scraping. This class is not meant to help you break any company's usage policies. Be a good neighbor, and always use caching when you can.

Need to access protected content? The class can do basic authentication. However, a lot of sites that require login do not use basic authentication.

Most current information and documentation and downloads found at
http://www.troywolf.com/articles/php/class_http.

There are three complete PHP files listed below. First is the class file, class_http.php. The second is example.php to show you how to use the class. The third file is image_cache.php--a companion script to cache images for use within the src attribute of img elements.

Troy Wolf operates ShinySolutions Webhosting, and is the author of SnippetEdit--a PHP application providing browser-based website editing that even non-technical people can use. "Website editing as easy as it gets." Troy has been a professional Internet and database application developer for over 10 years. He has many years' experience with ASP, VBScript, PHP, Javascript, DHTML, CSS, SQL, and XML on Windows and Linux platforms.

Source: http://www.daniweb.com/web-development/php/code/216547/php-http-screen-scraping-class-with-caching

Friday, 24 May 2013

Web Scraping Benefits SEO

Search Engine Optimization (SEO) is the process of improving the visibility of a website or a web page in a search engine’s “natural” or un-paid (“organic” or “algorithmic”) search results. The value of having sites highly ranked and visible in search engine results is widely known as they are the principle drivers of traffic to any website. The visibility of a site on search engine could very well be the difference in success or failure of a business. Although an old concept, but one by which SEO practioners still swear by is one of the most widely used SEO technique. It requires webmasters to insert relevant meta-tags of keywords and description apart from having the right page title. With numerous sites in same genre trying to outdo each other and algorithms of search engines changing by the day, it is extremely important to monitor your competitors content, keywords and title tags. Doing the task manually everyday is time consuming & tedious. A much faster and simpler way would be to automate this process using a technique of web scraping.

Web scraping or web data mining is a technique used to extract data from HTML web pages in documents. Web scraping can help a lot in monitoring of the titles, keywords, content, meta-tags of the competitors websites. One can quickly get an idea of which keywords are driving traffic to the competitors’ website, which content categories are attracting links and user engagement, and what resources will it take to rank your site higher than competition? This would allow you or your SEO practioner to take undertake necessary steps in making changes to site before its too late, and ensure that your site is always at the top of search engines and get traffic to keep your business on the growth trajectory.

For more information, please visit our dedicated site on web scraping.

Source: http://blog.itsyssolutions.com/web-scraping-benefits-seo/

Friday, 17 May 2013

Web Database Scraping

Web Data scraping can be defined as a data transfer computer software technique between programs using human- readable data structures. This technique simulates the human exploration of the World Wide Web by implementing low-level Hypertext Transfer Protocol. The technique can be considered suitable for end users. Web data scraping is also known as web data harvesting or web data extraction.
Data scraping usually ignores multimedia and images that are binary. When there is no other convenient API, it can be used to interface a third-party system. It can also be used as an interface to a legacy system, where there is no other compatible hardware available.
Marked up languages are used to build web pages. These languages include XHTML and HTML. They contain rich text information that has a low level Hypertext Transfer Protocol. Most web pages are programmed to be human readable and not for automation. A tool used for data scrapping can be termed as a data scraper.
Web data scraping uses include research, web integration, price monitoring, weather forecasting, website alter detection and web mashup. This technique may be against some of the conditions of some websites’ use. This can be viewed when it favors practical solutions based on existing techniques. It works through the provision of different kinds of automation that include:

    Human copy and paste: For websites with barriers still information can be copied and pasted for further examination.
    HTTP programming: This can be done by posting HTTP requests to a server.
    Data mining: This program detects templates containing the same data.
    Web-scraping software: This is used to extract and change form of content.
    Vertical aggregation
    Recognizing Semantic annotation
    HTML parsers
    DOM parsing: Programs here are enabled to extract required parts of pages.
    Text grepping: This method of extraction of information is based on the UNIX grep command.

Data scraping is most definitely considered as the last mechanism to use when other systems can not deliver.

Source: http://thewebscraping.com/web-database-scraping/

Monday, 6 May 2013

How to Use PHP's DOMDocument to Scrape a Web Page

I've been working on an SEO addon for concrete5. The issue I'm trying to solve right now is 'how to strip all irrelevant tags and content form the HTML and just return the text', aka web page scraping.

Bang head on desk.
Many Different Paths

First, I tried PHP's striptags function. Um, no. That just doesn't work well.

Next, I tried regular expressions. They were really clumsy, long-winded and not 100% effective.

Then I tried PHP's Document Object Model classes. They seemed magical.
PHP's Document Object Model Classes

The DOM (Document Object Model) classes allow you to:

    Parse HTML and XML documents;
    Transverse the DOM of those documents;
    Add and remove nodes within the DOM
    Query the DOM using XPath

So after some time working with the DOM classes, I created this function to scrape the text from a real HTML document:

    function getTextFromHTML($html='') {
    // An array of words that should be removed
    //from the resultant text
    $stopWords = array(' ');

    // Initially remove the script tags using regex (there were some
    // issues if I didn't do this)
    $html = preg_replace('/<script.*?script>/is', '', $html);

    //Load the $html into a DOMDocument object
    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    //libxml_use_internal_errors (true);
    $dom->loadHTML(strtolower($html));

    // Strip out scripts if there are any left
    $scripts = $dom->getElementsByTagName('script');
    foreach ($scripts as $script) {
    $script->parentNode->removeChild($script);
    }

    // Strip out style blocks
    $styles = $dom->getElementsByTagName('style');
    foreach ($styles as $style) {
    $style->parentNode->removeChild($style);
    }

    // Go through the resultant $html and get all text nodes
    $xPath = new DOMXPath($dom);
    $textNodes = $xPath->evaluate('//text()');
    $text = "";
    foreach ($textNodes as $textNode) {
    // Do some magic on the gathered text
    $nodeValue = strtolower($textNode->nodeValue);
    $nodeValue = str_replace($stopWords,' ', $nodeValue);
    $nodeValue = preg_replace("/[.:()\/\$\'\#]/", ' ', $nodeValue);
    $nodeValue = preg_replace('/[^a-z0-9 -\\._]/', '', $nodeValue);
    $nodeValue = trim($nodeValue);
    if (!empty($nodeValue)) {
    $text .= $nodeValue." ";
    }
    }
    return $text;
    }

I'm almost sure there is a better way to do this, so if you have any suggestions, let me know in the comments.

Source: http://skybluesofa.com/blog/how-use-phps-domdocument-scrape-web-page/

Wednesday, 1 May 2013

Analysis For Website Data Scraping

For personal or business use data extraction and web scraping techniques are important tools for finding relevant data and information. Many companies, self - employment data, and copy-paste them from web pages. This process is very reliable, but very expensive, because it's a waste of time and effort to achieve results.

A CSV file, database, XML file, or another source of information is required mall format. Understanding correlations and patterns in the data, so that policy decisions can be ready to help. Information can also be stored for future use.

The following are some common examples of data extraction process:

To answer a government portal, citizens who are trustworthy for a given survey name removed.
Competitive products and pricing data scraping websites
Website or web design stock photos and scratching video

Automatic data collection
It regularly collects data on a regular basis. By determining trends in the market, it is possible to understand and predict customer behavior will change in the probability of data.

Examples of automated data collection as follows:

Hourly rate for specific tasks to monitor
Many lenders, mortgage rates collected from daily
Check the time regularly is essential

Scraping By using Web services, you can access all information related to your business. Then the data can be downloaded to a spreadsheet or database analyzed and compared.

Data extraction services, possible price, email, database, information about the profile data and statistics are constant competitors.
Different techniques and processes for collecting and analyzing data, and have developed over time. Web scraping recently beaten in the market for business processes. Web pages and databases in different sources as a process that provides large amounts of data.

The most common web crawling, text, fun, scratches and expression analysis using DOM methods of e-mail. Only process analyzers, HTML pages or can be obtained through semantic annotation. Web scraping databases and websites for the use of the service is the main goal of the data collection. In business it is to remain relevant to the business.

The central questions about the relevance of web scraping contact. Relevant to the business? The answer is yes.

Source: http://www.selfgrowth.com/articles/analysis-for-website-data-scraping

Note:

Rose Marley is experienced web scraping consultant and writes articles on web data scraping, website data scraping, web scraping services, data scraping services, website scraping, eBay product scraping, Forms Data Entry etc.

PHP for Web scraping and bot development

Web scraping is a computer science technique for extracting information and data from web sites. In data mining research scraping and analysing of information is discussed. Practically web scraping is necessary if you want to develop a web application where you want to show customised information from various websites. For this you’ve to first scrap data from the sites and then apply some logic to filter the information.

Practically you can use different languages to write the program that will automatically search and collect the information. But if you’re PHP experts and want to use PHP for this kind of stuff here I am referring a book with PHP library. Practically I found this book is very helpful to learn the topic and their library is easy to use.

Source: http://thinkdiff.net/php/php-for-web-scraping-and-bot-development/

Note: