Tuesday, March 7, 2017

Web Scraping / Web Crawler

  • Web Scraping
Web scraping is the process of using bots to extract content and data from a website.web scraping extracts underling HTML code and, with it, data stored in a database.

Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:
    Search engine bots crawling a site, analyzing its content and then ranking it.
    Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
    Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).

Scraper Tools and Bots
Web scraping tools are software (i.e., bots) programmed to sift through databases and extract information. A variety of bot types are used, many being fully customizable to:
    Recognize unique HTML site structures
    Extract and transform content
    Store scraped data
    Extract data from APIs

Since all scraping bots have the same purpose—to access site data—it can be difficult to distinguish between legitimate and malicious bots.
several key differences help distinguish between the two.

    Legitimate bots are identified with the organization for which they scrape. For example, Googlebot identifies itself in its HTTP header as belonging to Google. Malicious bots, conversely, impersonate legitimate traffic by creating a false HTTP user agent.
    Legitimate bots abide a site’s robot.txt file, which lists those pages a bot is permitted to access and those it cannot. Malicious scrapers, on the other hand, crawl the website regardless of what the site operator has allowed.

A perpetrator, lacking such a budget, often resorts to using a botnet—geographically dispersed computers, infected with the same malware and controlled from a central location. Individual botnet computer owners are unaware of their participation. The combined power of the infected systems enables large scale scraping of many different websites by the perpetrator.

Malicious Web Scraping Examples

Price Scraping
In price scraping, a perpetrator typically uses a botnet from which to launch scraper bots to inspect competing business databases. The goal is to access pricing information, undercut rivals and boost sales.
Content Scraping
Content scraping comprises large-scale content theft from a given site. Typical targets include online product catalogues and websites relying on digital content to drive business.

Web Scraping Protection
granular traffic analysis.
The process involves the cross verification of factors, including:

    HTML fingerprint – The filtering process starts with granular inspection of HTML headers. These can provide clues as to whether a visitor is human or bot, and malicious or safe. Header signatures are compared against a constantly updated database of over 10 million known variants.
    IP reputation – We collect IP data from all attacks against our clients. Visits from IP addresses having a history of being used in assaults are treated with suspicion and are more likely to be scrutinized further.
    Behavior analysis – Tracking the ways visitors interact with a website can reveal abnormal behavioral patterns, such as a suspiciously aggressive rate of requests and illogical browsing patterns. This helps identify bots that pose as human visitors.
    Progressive challenges – We use a set of challenges, including cookie support and JavaScript execution, to filter out bots and minimize false positives. As a last resort, a CAPTCHA challenge can weed out bots attempting to pass themselves off as humans.
https://www.incapsula.com/web-application-security/web-scraping-attack.html

  • What is Web Scraping?

Web Scraping refers to the technique of extracting bulk data (both text and graphic) from websites and then compiling the gathered information into physical data storage units (hard disks, compact disks etc) to use it for financial gains or other business purposes. This is usually done by using artificially intelligent web scraping software programs which simulate the complete human-computer interaction to automate the process of manual data extraction (copy paste techniques) and making it easy to harvest tonnes of data quickly and efficiently into spreadsheets.

Web Scraping Software Vs Web Browser

A Web scraping software is functionally similar to a Web browser in the sense that both of them interact with the website in a similar way and have built-in capabilities to parse the HTML document object model (DOM). However, a web browser focuses on just rendering the HTML tags into a full-fledged webpage while, a web harvesting software quickly extracts the desired content(only the desired fields like name, phone no, address etc) from the HTML syntax and saves it to a local file present on the hard disk of your computer or an external database.

Web Scraping Software Vs Web Crawler
Web Scraping software usually simulates the way humans explore the web just like the web crawlers do but additionally, while crawlers just index the data for search engines, scraping software also transforms the unusable and non-readable format of data (HTML format) into usable and readable format (original content like text, images etc) that can be easily exported into spreadsheets for later analysis.
https://www.scrapesentry.com/news/web-scraping-definition-detection-prevention/

  • WebScarab
WebScarab is a framework for analysing applications that communicate using the HTTP and HTTPS protocols. It is written in Java, and is thus portable to many platforms. WebScarab has several modes of operation, implemented by a number of plugins. In its most common usage, WebScarab operates as an intercepting proxy, allowing the operator to review and modify requests created by the browser before they are sent to the server, and to review and modify responses returned from the server before they are received by the browser.
https://www.owasp.org/index.php/OWASP_WebScarab


  • What is Apache Nutch?
    Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene,
    http://nutch.apache.org/