EzDevInfo.com

scrapy interview questions

Top scrapy frequently asked interview questions

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.

Source: (StackOverflow)

How to use pycharm to debug scrapy projects

I am working on scrapy 0.20 with python 2.7.

I found pycharm a good python debugger.

I want to test my scrapy spiders using it.

Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a scrip. As a result, I built that scrip.

Then, I tried to add my scrapy project to pycharm as a model like this:

file->setting->project structure->add content root.

But I don't know what else I have to do

Source: (StackOverflow)

Headless Browser and scraping - solutions [closed]

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.

BROWSER TESTING / SCRAPING:

Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, JavaScript, C#, Haskell and more, IDE for Firefox (as an extension) for faster test deployment. Can act as a Server and has tons of features.

JAVASCRIPT

PhantomJS - JavaScript, headless testing with screen capture and automation, uses Webkit. As of version 1.8 Selenium's WebDriver API is implemented, so you can use any WebDriver binding and tests will be compatible with Selenium
SlimerJS - similar to PhantomJS, uses Gecko (Firefox) instead of WebKit
CasperJS - JavaScript, build on both PhantomJS and SlimerJS, has extra features
Ghost Driver - JavaScript implementation of the WebDriver Wire Protocol for PhantomJS.
new PhantomCSS - CSS regression testing. A CasperJS module for automating visual regression testing with PhantomJS and Resemble.js.
new WebdriverCSS - plugin for Webdriver.io for automating visual regression testing
new PhantomFlow - Describe and visualize user flows through tests. An experimental approach to Web user interface testing.
new trifleJS - ports the PhantomJS API to use the Internet Explorer engine.
new CasperJS IDE (commercial)

NODE.JS

Node-phantom - bridges the gap between PhantomJS and node.js
WebDriverJs - Selenium WebDriver bindings for node.js by Selenium Team
WD.js - node module for WebDriver/Selenium 2
yiewd - WD.js wrapper using latest Harmony generators! Get rid of the callback pyramid with yield
ZombieJs - Insanely fast, headless full-stack testing using node.js
NightwatchJs - Node JS based testing solution using Selenium Webdriver
Chimera - Chimera: can do everything what phantomJS does, but in a full JS environment
Dalek.js - Automated cross browser testing with JavaScript through Selenium Webdriver
Webdriver.io - better implementation of WebDriver bindings with predefined 50+ actions
Nightmare - Electron bridge with a high-level API.
jsdom - Tailored towards web scraping. A very lightweight DOM implemented in Node.js, it supports pages with javascript.

WEB SCRAPING / MINING

Scrapy - Python, mainly a scraper/miner - fast, well documented and, can be linked with Django Dynamic Scraper for nice mining deployments, or Scrapy Cloud for PaaS (server-less) deployment, works in terminal or an server stand-alone proces, can be used with Celery, built on top of Twisted
Snailer - node.js module, untested yet.
Node-Crawler - node.js module, untested yet.

ONLINE TOOLS

new Online HTTP client - Dedicated SO answer
dead CasperBox - Run CasperJS scripts online

RELATED LINKS & RESOURCES

Comparsion of Webscraping software
new Resemble.js : Image analysis and comparison

Questions:

Any pure Node.js solution or Nodejs to PhanthomJS/CasperJS module that actually works and is documented?

Answer: Chimera seems to go in that direction, checkout Chimera

Other solutions capable of easier JavaScript injection than Selenium?
Do you know any pure ruby solutions?

Answer: Checkout the list created by rjk with ruby based solutions

Do you know any related tech or solution?

Feel free to reedit this question and add content as you wish! Thank you for your contributions!

Updates

added SlimerJS to the list
added Snailer and Node-Crawler and Node-phantom
added Yiewd WebDriver wrapper
added WebDriverJs and WD.js
added Ghost Driver
added Comparsion of Webscraping software on Screen Scraper Blog
added ZombieJs
added Resemble.js and PhantomCSS and PhantomFlow, categorised and reedited content
04.01.2014, added Chimera, answered 2 questions
added NightWatchJs
added DalekJS
added WebdriverCSS
added CasperBox
added trifleJS
added CasperJS IDE
added Nightmare
added jsdom
added Online HTTP client, updated CasperBox (dead)

Source: (StackOverflow)

difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

Source: (StackOverflow)

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.

Thanks

Source: (StackOverflow)

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I've seen this, but I don't really get how to set it up?

Source: (StackOverflow)

Scrapy crawl from script always blocks script execution after scraping

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"

It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?

Source: (StackOverflow)

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

Thank you.

Source: (StackOverflow)

Installing scrapy/pyopenssl in Windows' virtualenv

I am trying to install scrapy on Windows XP (32bit) virtualenv:

pip install scrapy

The installer spits out this ambiguous error message:

error: Only found improper OpenSSL directories: ['E:\\cygwin', 'E:\\Program Files\\Git']

How should I configure openssl / pyOpenSSL to make pip work?

Source: (StackOverflow)

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

Source: (StackOverflow)

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

Source: (StackOverflow)

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:

Set 1 or several start_urls
Set 1 or several allowed_domains
Set settings values
Start the spider
Stop / pause / resume a spider
retrieve some stats while running
retrive some stats after spider is complete.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I also looked at this question : How to give URL to scrapy for crawling? ; But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls, but it doesn't seem to allow allowed_domains or settings.

Then I gave a look to scrapy webservices : It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings

There are a several questions on this subject, none of them seems satisfactory:

using-one-scrapy-spider-for-several-websites This one seems outdated, as scrapy has evolved a lot since 0.7
creating-a-generic-scrapy-spider No accepted answer, still talking around tweaking shell parameters.

I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)

Thanks a lot for your help.

Source: (StackOverflow)

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this:

<a rel='nofollow' href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.

Source: (StackOverflow)

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.

But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.

And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.

My code so far looks like this:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )


    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

Source: (StackOverflow)

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:

The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?

Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...

BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):

|-------------------------------------------------|
|             Title              |    Due Date    |
|-------------------------------------------------|
| Job Title (Clickable Link)     |    1/1/2012    |
| Other Job (Link)               |    3/2/2012    |
|--------------------------------|----------------|

I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.

Source: (StackOverflow)