One of my blog readers emailed me and asked about Screen Scraping. In this post I will describe how to perform screen scraping by scraping a simple page and extracting all the links out of that page.
It only takes a few lines of code to achieve this. Take a look at the complete code below:
protected void ScrapeScreen_Clicked(object sender, EventArgs e)
{
string url = "http://www.azamsharp.com";
WebClient client = new WebClient();
byte[] data = client.DownloadData(url);
UTF8Encoding utf8 = new UTF8Encoding();
string scrapedText = utf8.GetString(data);
// use regular expression to take out the links!
string pattern = @"(((file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(www\.))+(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(/[a-zA-Z0-9\&%_\./-~-]*)?";
Regex reg = new Regex(pattern);
DataTable linkDT = new DataTable();
linkDT.Columns.Add("Url");
foreach (Match m in reg.Matches(scrapedText))
{
DataRow row = linkDT.NewRow();
row["Url"] = m.Value;
linkDT.Rows.Add(row);
}
// bind to the gridview control
gvLinks.DataSource = linkDT;
gvLinks.DataBind();
}
Basically, I am using the WebClient class to download the webpage which in this case is www.azamsharp.com. The page is downloaded as a byte array which is later converted to string by using Utf8Encoding. Finally, a regular expression is applied to the string and all the extracted links are populated into a DataTable which is later bind to a GridView control.
The output looks something like this:
