EzDevInfo.com

web-crawler interview questions

Top web-crawler frequently asked interview questions

How to identify web-crawler?

How can I filter out hits from webcrawlers etc. Hits which not is human..

I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.


Source: (StackOverflow)

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.


Source: (StackOverflow)

Advertisements

How to find all links / pages on a website

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.


Source: (StackOverflow)

What are some good Ruby-based web crawlers? [closed]

I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.

Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.


Source: (StackOverflow)

Sending "User-agent" using Requests library in Python

I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:

debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(url, headers = user_agent, config=debug)

The debug information isn't showing the headers being sent during the request.

Is it acceptable to send this information in the header? If not, how can I send it?


Source: (StackOverflow)

how to detect search engine bots with php?

How can one detect the search engine bots using php?


Source: (StackOverflow)

How to write a crawler?

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.

Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.


Source: (StackOverflow)

How do I make a simple crawler in PHP?

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.


Source: (StackOverflow)

How to request Google to re-crawl my website? [closed]

Does someone know a way to request Google to re-crawl a website? And this shouldn't last months if possible?

My site manishshrivastava.com is showing wrong Heading in Google Search Result.

search Result Here

How can I show it with correct Heading and description?

Thanks!


Source: (StackOverflow)

What is the difference between web-crawling and web-scraping?

Is there a difference between Crawling and Web-scraping?

If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine?


Source: (StackOverflow)

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

Thank you.


Source: (StackOverflow)

Alternative to HtmlUnit

I have been researching about the headless browsers available till to date and found HtmlUnit being used pretty extensively. Do we have any alternative to HtmlUnit with possible advantage compared to HtmlUnit?

Thanks Nayn


Source: (StackOverflow)

how do web crawlers handle javascript

Today a lot of content on Internet is generated using JavaScript (specifically by background AJAX calls). I was wondering how web crawlers like Google handle them. Are they aware of JavaScript? Do they have a built-in JavaScript engine? Or do they simple ignore all JavaScript generated content in the page (I guess quite unlikely). Do people use specific techniques for getting their content indexed which would otherwise be available through background AJAX requests to a normal Internet user?


Source: (StackOverflow)

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers.

An example:

We tell our crawler to crawl the domain evil.com by entering an initial lookup URL.

Lets assume we let it crawl the front page initially, evil.com/index

The returned HTML will contain several "unique" links:

  • evil.com/somePageOne
  • evil.com/somePageTwo
  • evil.com/somePageThree

The crawler will add these to the buffer of uncrawled URLs.

When somePageOne is being crawled, the crawler receives more URLs:

  • evil.com/someSubPageOne
  • evil.com/someSubPageTwo

These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole".

The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in.

These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity.

My question is, what techniques can be used to detect so called black holes?

One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites.


Source: (StackOverflow)

Language/libraries for downloading & parsing web pages?

What language and libraries are suitable for a script to parse and download small numbers of web resources?

For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page containing the playlist. I want to write a script to run regularly and parse the relevant pages for the link and playlist info, download the MP3, and put the playlist in the MP3 tags so it shows up nicely in my iPod. There are a bunch of similar applications that I could write too.

What language would you recommend? I would like the script to run on Windows and MacOS. Here are some alternatives:

  • JavaScript. Just so I could use jQuery for the parsing. I don't know if jQuery works outside a browser though.
  • Python. Probably good library support for doing what I want. But I don't love Python syntax.
  • Ruby. I've done simple stuff (manual parsing) in Ruby before.
  • Clojure. Because I want to spend a bit of time with it.

What's your favourite language and libraries for doing this? And why? Are there any nice jQuery-like libraries for other languages?


Source: (StackOverflow)