KoffeeKoder


  • Screen Scraping Using ASP.NET
    published on 7/22/2008 10:19:09 AM
  • One of my blog readers emailed me and asked about Screen Scraping. In this post I will describe how to perform screen scraping by scraping a simple page and extracting all the links out of that page.
    It only takes a few lines of code to achieve this. Take a look at the complete code below:

    protected void ScrapeScreen_Clicked(object sender, EventArgs e)
            {
                string url = "http://www.azamsharp.com";

                WebClient client = new WebClient();
                byte[] data = client.DownloadData(url);
                UTF8Encoding utf8 = new UTF8Encoding();

                string scrapedText = utf8.GetString(data);

                // use regular expression to take out the links!

                string pattern = @"(((file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(www\.))+(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(/[a-zA-Z0-9\&%_\./-~-]*)?";

                Regex reg = new Regex(pattern);

                DataTable linkDT = new DataTable();
                linkDT.Columns.Add("Url");

                foreach (Match m in reg.Matches(scrapedText))
                {
                    DataRow row = linkDT.NewRow();
                    row["Url"] = m.Value;
                    linkDT.Rows.Add(row);
                }

                // bind to the gridview control
                gvLinks.DataSource = linkDT;
                gvLinks.DataBind();
    }        

    Basically, I am using the WebClient class to download the webpage which in this case is www.azamsharp.com. The page is downloaded as a byte array which is later converted to string by using Utf8Encoding. Finally, a regular expression is applied to the string and all the extracted links are populated into a DataTable which is later bind to a GridView control.

    The output looks something like this:




  • by Keith E. Cooper on 12/23/2008 12:35:54 PM
  • Hi Mohammad,

    I enjoyed this post on screen scraping! For reasons that are too long to discuss, I need to post to a web form that is running within our intranet. I have reviewed countless articles about how to do this, but I'm not getting it work. Here is a code snippet:

    // Start of snippet:

    // Earl is a string pointing to the target site

    // Data is a NameValueCollection with six
    //fields; one of which is the tag/data pair
    //for the submit button.

    byte[] response = webClient.UploadValues(
    Earl, "POST", Data );

    UTF8Encoding utf = new UTF8Encoding();

    ResponseString = utf.GetString( response );

    // End of snippet

    No errors are thrown. ResponseString contains exactly what you would see if you did a view source on the target page. I have tried webClient.uploadData() too: I get the same results.

    Any clues on how I can figure out what's going on? Basically, the form needs to "see" the submit button tag-data pair so that it fires its "action."

    Thanks,

    Keith E. Cooper
    Tampa, FL
  • by Keith E. Cooper on 12/24/2008 4:45:40 PM
  • Nevermind! I ran Fiddler and saw that I had the wrong URL specified (long story). Fiddler or tools like it are very useful!

    Thanks again for your site and your posts!

    Peace,

    Keith E. Cooper
  • by Mohammad Azam on 12/27/2008 2:42:14 PM
  • Hi Keith,

    I am glad you solved the problem!