EzDevInfo.com

web-scraping interview questions

Top web-scraping frequently asked interview questions

How to save an image locally using Python whose URL address I already know?

I know the URL of an image on Internet.

e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google.

Now, how can I download this image using Python without actually opening the URL in a browser and saving the file manually.


Source: (StackOverflow)

How can I get the Google cache age of any URL or web page?

In my project I need the Google cache age to be added as important information. I tried to search sources for the Google cache age, that is, the number of days since Google last re-indexed the page listed.

Where can I get the Google cache age?


Source: (StackOverflow)

Advertisements

Web scraping - how to identify main content on a webpage

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

What's a generic way of doing this that will work on most major news sites?

What are some good tools or libraries for data mining? (preferably python based)


Source: (StackOverflow)

Headless Browser and scraping - solutions [closed]

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.


BROWSER TESTING / SCRAPING:

  • Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, JavaScript, C#, Haskell and more, IDE for Firefox (as an extension) for faster test deployment. Can act as a Server and has tons of features.

JAVASCRIPT

  • PhantomJS - JavaScript, headless testing with screen capture and automation, uses Webkit. As of version 1.8 Selenium's WebDriver API is implemented, so you can use any WebDriver binding and tests will be compatible with Selenium
  • SlimerJS - similar to PhantomJS, uses Gecko (Firefox) instead of WebKit
  • CasperJS - JavaScript, build on both PhantomJS and SlimerJS, has extra features
  • Ghost Driver - JavaScript implementation of the WebDriver Wire Protocol for PhantomJS.
  • new PhantomCSS - CSS regression testing. A CasperJS module for automating visual regression testing with PhantomJS and Resemble.js.
  • new WebdriverCSS - plugin for Webdriver.io for automating visual regression testing
  • new PhantomFlow - Describe and visualize user flows through tests. An experimental approach to Web user interface testing.
  • new trifleJS - ports the PhantomJS API to use the Internet Explorer engine.
  • new CasperJS IDE (commercial)

NODE.JS

  • Node-phantom - bridges the gap between PhantomJS and node.js
  • WebDriverJs - Selenium WebDriver bindings for node.js by Selenium Team
  • WD.js - node module for WebDriver/Selenium 2
  • yiewd - WD.js wrapper using latest Harmony generators! Get rid of the callback pyramid with yield
  • ZombieJs - Insanely fast, headless full-stack testing using node.js
  • NightwatchJs - Node JS based testing solution using Selenium Webdriver
  • Chimera - Chimera: can do everything what phantomJS does, but in a full JS environment
  • Dalek.js - Automated cross browser testing with JavaScript through Selenium Webdriver
  • Webdriver.io - better implementation of WebDriver bindings with predefined 50+ actions
  • Nightmare - Electron bridge with a high-level API.
  • jsdom - Tailored towards web scraping. A very lightweight DOM implemented in Node.js, it supports pages with javascript.

WEB SCRAPING / MINING

  • Scrapy - Python, mainly a scraper/miner - fast, well documented and, can be linked with Django Dynamic Scraper for nice mining deployments, or Scrapy Cloud for PaaS (server-less) deployment, works in terminal or an server stand-alone proces, can be used with Celery, built on top of Twisted
  • Snailer - node.js module, untested yet.
  • Node-Crawler - node.js module, untested yet.

ONLINE TOOLS


RELATED LINKS & RESOURCES

Questions:

  • Any pure Node.js solution or Nodejs to PhanthomJS/CasperJS module that actually works and is documented?

Answer: Chimera seems to go in that direction, checkout Chimera

  • Other solutions capable of easier JavaScript injection than Selenium?

  • Do you know any pure ruby solutions?

Answer: Checkout the list created by rjk with ruby based solutions

  • Do you know any related tech or solution?

Feel free to reedit this question and add content as you wish! Thank you for your contributions!


Updates

  1. added SlimerJS to the list
  2. added Snailer and Node-Crawler and Node-phantom
  3. added Yiewd WebDriver wrapper
  4. added WebDriverJs and WD.js
  5. added Ghost Driver
  6. added Comparsion of Webscraping software on Screen Scraper Blog
  7. added ZombieJs
  8. added Resemble.js and PhantomCSS and PhantomFlow, categorised and reedited content
  9. 04.01.2014, added Chimera, answered 2 questions
  10. added NightWatchJs
  11. added DalekJS
  12. added WebdriverCSS
  13. added CasperBox
  14. added trifleJS
  15. added CasperJS IDE
  16. added Nightmare
  17. added jsdom
  18. added Online HTTP client, updated CasperBox (dead)

Source: (StackOverflow)

Save and render a webpage with PhantomJS and node.js

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.

This should be a simple example with an obvious use-case for PhantomJS. I can't find a decent example, the documentation seems to be all about command line use.


Source: (StackOverflow)

What is the difference between web-crawling and web-scraping?

Is there a difference between Crawling and Web-scraping?

If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine?


Source: (StackOverflow)

Web Scraping with Scala [closed]

Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)


Source: (StackOverflow)

Is it possible to use Selenium WebDriver to drive PhantomJS?

I'm going through the documentation for the Selenium WebDriver, and it can drive Chrome for example. I got thinking, wouldn't it be far more efficient to 'drive' PhantomJS?

Is there a way to use selenium with PhathomJS?

My intended use would be webscraping: The sites I scrape are loaded with AJAX and lots of lovely javascript, and I'm thinking this setup could be a good replacement for the scrappy python framework that I'm currently working with.


Source: (StackOverflow)

How to retrieve/calculate citation counts and/or citation indices from a list of authors?

I have a list of authors. I wish to automatically retrieve/calculate the (ideally yearly) citation index (h-index, m-quotient,g-index, HCP indicator or ...) for each author.

Author Year Index
first  2000   1
first  2001   2
first  2002   3

I can calculate all of these metrics given the citation counts for each paper of each researcher.

Author Paper Year Citation_count
first    1    2000   1
first    2    2000   2
first    3    2002   3

Despite my efforts, I have not found an API/scraping method capable of this.

My institution has access to a number of services including Web of Science.


Source: (StackOverflow)

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this:

<a rel='nofollow' href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.


Source: (StackOverflow)

YouTube comment scraper returns limited results

The task:

I wanted to scrape all the YouTube comments from a given video.

I successfully adapted the R code from a previous question (Scraping Youtube comments in R).

Here is the code:

library(RCurl)
library(XML)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
"//body//text()[not(ancestor::script)][not(ancestor::style)[not(ancestor::noscript)]",xmlValue)

To use it, simply replace the video ID (i.e. "4H9pTgQY_mo") with the ID you require.

The problem:

The problem is that it doesn't return all the comments. In fact, it always returns a vector with 283 elements, regardless of how many comments are in the video.

Can anyone please shed light on what is going wrong here? It is incredibly frustrating. Thank you.


Source: (StackOverflow)

Scraping Real Time Visitors from Google Analytics

I have a lot of sites and want to build a dashboard showing the number of real time visitors on each of them on a single page. (would anyone else want this?) Right now the only way to view this information is to open a new tab for each site.

Google doesn't have a real-time API, so I'm wondering if it is possible to scrape this data. Eduardo Cereto found out that Google transfers the real-time data over the realtime/bind network request. Anyone more savvy have an idea of how I should start? Here's what I'm thinking:

  1. Figure out how to authenticate programmatically
  2. Inspect all of the realtime/bind requests to see how they change. Does each request have a unique key? Where does that come from? Below is my breakdown of the request:

    https://www.google.com/analytics/realtime/bind?VER=8

    &key=[What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]

    &ds=[What is this? Where does it come from? 21 character lowercase alphanumeric, stays the same each request]

    &pageId=rt-standard%2Frt-overview

    &q=t%3A0%7C%3A1%3A0%3A%2Ct%3A11%7C%3A1%3A5%3A%2Cot%3A0%3A0%3A4%2Cot%3A0%3A0%3A3%2Ct%3A7%7C%3A1%3A10%3A6%3D%3DREFERRAL%3B%2Ct%3A10%7C%3A1%3A10%3A%2Ct%3A18%7C%3A1%3A10%3A%2Ct%3A4%7C5%7C2%7C%3A1%3A10%3A2!%3Dzz%3B%2C&f

    The q variable URI decodes to this (what the?): t:0|:1:0:,t:11|:1:5:,ot:0:0:4,ot:0:0:3,t:7|:1:10:6==REFERRAL;,t:10|:1:10:,t:18|:1:10:,t:4|5|2|:1:10:2!=zz;,&f

    &RID=rpc

    &SID=[What is this? Where does it come from? 16 character uppercase alphanumeric, stays the same each request]

    &CI=0

    &AID=[What is this? Where does it come from? integer, starts at 1, increments weirdly to 150 and then 298]

    &TYPE=xmlhttp

    &zx=[What is this? Where does it come from? 12 character lowercase alphanumeric, changes each request]

    &t=1

  3. Inspect all of the realtime/bind responses to see how they change. How does the data come in? It looks like some altered JSON. How many times do I need to connect to get the data? Where is the active visitors on site number in there? Here is a dump of sample data:

    19 [[151,["noop"] ] ] 388 [[152,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[49,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2,0],"name":"Total"}]}}]]] ] 388 [[153,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[52,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[2,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3,2],"name":"Total"}]}}]]] ] 388 [[154,["rt",[{"ot:0:0:4":{"timeUnit":"MINUTES","overTimeData":[{"values":[53,53,52,40,42,55,49,41,51,52,47,42,62,82,76,71,81,66,81,86,71,66,65,65,55,51,53,73,71,81],"name":"Total"}]},"ot:0:0:3":{"timeUnit":"SECONDS","overTimeData":[{"values":[0,3,1,1,1,1,1,0,1,0,1,1,1,0,2,0,2,2,1,0,0,0,0,0,2,1,1,2,1,2,0,5,1,0,2,1,1,1,2,0,2,1,0,5,1,1,2,0,0,0,0,0,0,0,0,0,1,1,0,3],"name":"Total"}]}}]]] ]

Let me know if you can help with any of the items above!

enter image description here


Source: (StackOverflow)

How to scroll down with Phantomjs to load dynamic content

I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-

  • Setting viewportSize to a large height right after var page = require('webpage').create();

page.viewportSize = { width: 1600, height: 10000, };

  • Using page.scrollPosition = { top: 10000, left: 0 } inside page.open but have no effect like-
page.open('http://example.com/?q=houston', function(status) {
   if (status == "success") {
      page.scrollPosition = { top: 10000, left: 0 };  
   }
});
  • Also tried putting it inside page.evaluate function but that gives

Reference error: Can't find variable page

  • Tried using jQuery and JS code inside page.evaluate and page.open but to no avail-

$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('check for execution'); });

as it is and also inside document.ready. Similarly for JS code-

window.scrollBy(0,10000)

as it is and also inside window.onload

I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.

Update

I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0

var hitRockBottom = false; while (!hitRockBottom) {
    // Scroll the page (not sure if this is the best way to do so...)
    page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };

    // Check if we've hit the bottom
    hitRockBottom = page.evaluate(function() {
        return document.querySelector(".has-more-items") === null;
    }); }

Where .has-more-items is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.

However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 }; with codes from below as well (one at a time)

window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.rel='nofollow' href=".has-more-items";

But nothing seems to work.


Source: (StackOverflow)

Call Javascript function from Python

I am working on a web-scraping project. One of the website, I am working has the data coming from Javascript.

There was a suggestion in one of my earlier question, that I can directly call the Javascript from Python.

Any idea how to do it? I was not able to figure out how to call the Javascript function for instance in Python.

For example : If JS function is defined as : add_2(var,var2)

How do we call the same function from Python ? Any useful reference would be highly appreciated.

Thanks


Source: (StackOverflow)

How do screen scrapers work?

I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.


Source: (StackOverflow)