Extract data from Web Scraping C#

I am MVC ASP.NET developer.

I have received the contents from any url, i.e. http, https etc. using WebRequest class.

I have received all the content of that particular url. (for now I took

My next step is to extract buttons, header, footer, colors, text etc.

Here is my code for now:

public ActionResult GetContent(UrlModel model) //model having a string URL
which is entered in a text box and method hits using submit button.
    //WebRequest request = WebRequest.Create(model.URL);

    WebRequest request = WebRequest.Create(model.URL);

    request.Credentials = CredentialCache.DefaultCredentials;

    WebResponse response = request.GetResponse();

    Stream dataStream = response.GetResponseStream();

    StreamReader reader = new StreamReader(dataStream);

    string responseFromServer = reader.ReadToEnd();
    ViewBag.Response = responseFromServer;

    return View();

Can someone help me with writing the code ?

Also do suggest me with some techniques of data extraction in C#.


Scrapy, scraping price data from StubHub

I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = [""]
    start_urls = [""]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


How do you scrape AJAX pages?


All screen scraping first requires manual review of the page you want to extract

resources from. When dealing with AJAX you usually just need to analyze a bit more

than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial

HTML document that you requested, but that javascript will be exectued which asks the

server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the

javascript makes and just call this URL instead from the start.


Take this as an example, assume the page you want to scrape from has the following


<script type="text/javascript">
function ajaxFunction()
var xmlHttp;
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
catch (e)
  // Internet Explorer
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
  catch (e)
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
    catch (e)
      alert("Your browser does not support AJAX!");
      return false;

Then all you need to do is instead do an HTTP request to time.asp of the same server

instead. Example from w3schools.


using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975:

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!


use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = '';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 while (my $location = <L>) {
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.


Data Scraping using php

Here is my code









    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?


Curl Scrapping

curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20080623 Firefox/") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

echo $query;




<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>


What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here.

P.S. Please post full code. It would be easier for me to understand.


PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system


Obtaining reddit data

I am interested in obtaining data from different reddit subreddits. Does anyone know

if there is a reddit/other api similar like twitter does to crawl all the pages?

Yes, reddit has an API that can be used for a variety of purposes such as data

collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API

(follow the rules)
    automatically generated API docs -- provides information on the requests needed to

access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about

reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you

should check out the existing set of API wrappers for various languages. Despite my

bias (I am the package maintainer) I am quite certain PRAW, for python, has support

for the largest number of reddit API features.


Scraping data in dynamic sites

I'm trying to scrape data from our local government. What I want is address from kids adoption offices. Here, in Brazil, all adoptions go through the government. So I have the URL of one office, there are 2 or 3 thousands more. But if I can manage to get one, the others will be easy. I made many attempts, bellow I show three.

The problem could be related to a Javascript (Ajax maybe) that refresh the page.

Note: I am not a PHP developer.

First attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';

echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");

// I used this url for test
//$url = '';

//This is the URL that I really want
$url = '';

$html = file_get_contents($url);

echo '</body></html>';

// Output
// 11
// Warning:
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)

Second attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';

// I used this url for test
//$url = '';

//This is the URL that I really want
$url = '';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
   echo '<br>begin HTML[';
    echo  $html;
   echo '<br>]end html ';
echo '</body></html>';

// Output
// 1

third attempt

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "");

    $data = curl_exec($ch);
    return $data;

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';

// I used this url for test
//$url = '';

//This is the URL that I really want
$url = '';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    echo '<br>begin HTML[';
    echo  $html;
    echo '<br>]end html ';
echo '</body></html>';

// Output
// cURL error number:0
// cURL error:

If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.


What is the right way of storing screen-scraping data?

i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?


Scraping dynamic data

I am scraping profiles on for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

Or you might be able to do it with mechanize:

I have also heard good things about twill, but never used it myself:


Web Scraping data from different sites

I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!




Scrape Data Point Using Python

I am looking to scrape a data point using Python off of the url .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>66.65 CAD</b></td>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= ""
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title

Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid

print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".


If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    except smtplib.SMTPException,error:

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            if server.has_extn("starttls"):
        except (smtplib.SMTPException,ssl.SSLError),error:
    return server

def Disconnect(server):
    except smtplib.SMTPException,error:

serverDict = {
    "gmail"  :"",
    "yahoo"  :""


The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...


How to prevent data-scraping a valuable data web service?

I have a great idea for a windows store app. I'd like to make this app. However it requires a large and valuable database that I will need to create a service for so that people cannot easily steal it. My thinking is maybe host a mobile service on Azure (which I've never tried) and create a .net Web API project to take requests and dish out Json like candy to a windows 8 mvvmclient. However what I don't want is someone sniffing my traffic back and forth from app to service and figuring out how to get/post data from using my app and service then setting up their own app / website to display this data using my bandwidth to make them money.

How can I protect my app-to-db data access so it can't be reverse engineered on me.
Also is this the best setup for developing a high volume windows 8 app like this? Do you have a better suggestion?

EDIT: I know I can use SSL etc to encrypt traffic to and from. What I am trying to protect is someone using Firebug or Fiddler to figure out what parameters can be posted to get a particular record back. Then creating their own site that simply uses my service as the end point and siphons my data and whores my bandwidth. ie. Just using firebug I know I can use to search the word dallas on google. Even if I encrypt the page, they can see that much in their browser. so if someone does the same get/post in their own application they would get the same records back thus using my stuff.

3 Answers

The most straight forward thing you can do is to setup authentication for your users using something like OAuth. This will allow you to ensure no communication happens with your service in an anonymous fashion.

Once you have authenticated your requests you can place controls on those requests that won't impact a normal user. You could rate limit or throttle requests or any number of tactics to make it very expensive time wise to siphon off large portions of your data set.

For instance, you can start blocking requests when you notice a large number of users clustering from a single IP address. You could place sensible limits on each user (like 10 API calls per minute with a result set limited to 50). You get the idea I'm sure.

I think we met the same concern. I'm developing a windows 8 application which is contacting a web service built on top of Windows Azure Web Site. I don't want the bad guy fire some fake requests to my service by intercepting the traffic through some tools like Fiddler.

I asked this question in a mail group and got a tip. I've never tried but just for your information. If your application needs user login, then the user's password is a good seed for data/traffic protection. You can use the password to generate a key-pair, sign the request and send it to server as well as the public key. Then on the server side it can verify the sign by the public key.

Use HTTPS is another approach. But as you know, a bad guy can also know the actual data through Fiddler even though HTTPS.

Use certificate might be another solution I think. But I didn't find the relevant document on how to install and pick a certificate from client's machine.


just serve it over HTTPS, then they can't sniff it.


Business Intelligence Data Mining

Data mining can be technically defined as the automated extraction of hidden information from large databases for predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making.

Data mining requires the use of mathematical algorithms and statistical techniques integrated with software tools. The final product is an easy-to-use software package that can be used even by non-mathematicians to effectively analyze the data they have. Data Mining is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications.

Business intelligence data mining is used in market research, industry research, and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. BI uses various technologies like data mining, scorecarding, data warehouses, text mining, decision support systems, executive information systems, management information systems and geographic information systems for analyzing useful information for business decision making.

Business intelligence is a broader arena of decision-making that uses data mining as one of the tools. In fact, the use of data mining in BI makes the data more relevant in application. There are several kinds of data mining: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining, that are all used in business intelligence applications.

Some data mining tools used in BI are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-means and hierarchical clustering, Markov models and so on.


Importance of Data Cleansing Services

In companies, there is huge amount of data that is available and essential in the decision making and strategies.  Unfortunately, the data is sometimes inaccurate or incomplete because of the updates that are available from time to time. With this, companies are looking for ways to eradicate the information that is not needed by the company. Cleansing of data is one of the processes that can eliminate unnecessary data of the companies. Data cleansing identifies the information that is fraudulent or inaccurate and deletes them or replaces them with the accurate information. Unclean facts have no place in companies because they can also cause inefficiencies and inaccuracies in the decisions. After the cleaning of data, there are no inconsistencies and the data sets are already the same with each other.

There are different techniques used in data cleansing data transformation, parsing or detecting the syntax errors, duplicate eradication, and statistical method. These techniques will ensure that the data are clean and good. There are also criteria to tell if the data set is clean. This are the things that companies look for when getting data cleansing services.

Data should be accurate in which density, integrity, and consistency are there. They should also be complete in order to ensure that there are no differences in the data set. The density will show the relationship of the omitted and the total number of values in the data set. You can tell that the data set is good if it has a good density. Data should also be uniform and the irregularities should be eliminated in the set. Consistency should also be present that eliminates the syntactical errors in the set. Cleaning the data should also give the uniqueness of the set in order to tell the number of duplicates that were present before the cleaning. Lastly, the data should have integrity in combining the criteria of soundness and completeness. If the above criteria are met, it is ensured that the data set is in the best state.

Considering in getting a data cleansing service will offer you different available services. Removal of duplicate ideas is one of the most common features of data cleansing. Same records or data sets are tagged and identified and the duplicates are eradicated. Data are also validated and the bogus data are eliminated. The set will also be checked for outdated data because outdated ones are removed by data cleansing. Incomplete figures are also identified so that they will be given attention. If the incomplete data are identified, the facts will be improved in such a way that they are assembled in order and organized as a set.

Aside from the benefits that companies get from data cleansing services, there are also problems present in data cleansing. Sometimes, some data are lost because of the eradication of limited information. As for the companies that offer the services, they should maintain good service since data cleansing is expensive and time consuming.
