EzDevInfo.com

crawler4j

Open Source Web Crawler for Java

Crawler4j missing outgoing links?

I'm trying to crawl the Apache Mailing Lists to get all the archived messages using Crawler4j. I provided a seed URL and am trying to get links to the other messages. However, it seems to not be extracting all the links.

Following is the HTML of my seed page (http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>Re: some healthy broker disappear from zookeeper</title>
  <link rel="stylesheet" type="text/css" rel='nofollow' href="/archives/style.css" />
 </head>

 <body id="archives">
  <h1>kafka-users mailing list archives</h1>

  <h5>
<a rel='nofollow' href="http://mail-archives.apache.org/mod_mbox/" title="Back to the archives depot">Site index</a> &middot; <a rel='nofollow' href="/mod_mbox/kafka-users" title="Back to the list index">List index</a></h5>  <table class="static" id="msgview">
   <thead>
    <tr>
    <th class="title">Message view</th>
    <th class="nav"><a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e" title="Previous by date">&laquo;</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e" title="Next by date">&raquo;</a> &middot; <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e" title="Previous by thread">&laquo;</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e" title="Next by thread">&raquo;</a></th>
   </tr>
   </thead>

   <tfoot>
    <tr>
    <th class="title"><a rel='nofollow' href="#archives">Top</a></th>
    <th class="nav"><a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e" title="Previous by date">&laquo;</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e" title="Next by date">&raquo;</a> &middot; <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e" title="Previous by thread">&laquo;</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e" title="Next by thread">&raquo;</a></th>
   </tr>
   </tfoot>

   <tbody>
   <tr class="from">
    <td class="left">From</td>
    <td class="right">Neha Narkhede &lt;neha.narkh...@gmail.com&gt;</td>
   </tr>
   <tr class="subject">
    <td class="left">Subject</td>
    <td class="right">Re: some healthy broker disappear from zookeeper</td>
   </tr>
   <tr class="date">
    <td class="left">Date</td>
    <td class="right">Tue, 20 Nov 2012 19:01:56 GMT</td>
   </tr>
   <tr class="contents"><td colspan="2"><pre>
zookeeper server version is 3.3.3 is pretty buggy and has known
session expiration and unexpected ephemeral node deletion bugs.
Please upgrade to 3.3.4 and retry.

Thanks,
Neha

On Tue, Nov 20, 2012 at 10:42 AM, Xiaoyu Wang &lt;xwang@rocketfuel.com&gt; wrote:
&gt; Hello everybody,
&gt;
&gt; We have run into this problem a few times in the past week. The symptom is
&gt; some broker disappear from zookeeper. The broker appears to be healthy.
&gt; After that, producers start producing lots of ZK producer cache stale log
&gt; and stop making any progress.
&gt;  "logger.info("Try #" + numRetries + " ZK producer cache is stale.
&gt; Refreshing it by reading from ZK again")"
&gt;
&gt; We are running kafka 0.7.1 and the zookeeper server version is 3.3.3.
&gt;
&gt; The missing broker will show up in zookeeper after we restart it. My
&gt; question is
&gt;
&gt;    1. Did anyone encounter the same problem? how did you fix it?
&gt;    2. Why producer is not making any progress? Can we make the producer
&gt;    work with those brokers that are listed in zookeeper.
&gt;
&gt;
&gt; Thanks,
&gt;
&gt; -Xiaoyu

</pre></td></tr>
   <tr class="mime">
    <td class="left">Mime</td>
    <td class="right">
<ul>
<li><a rel="nofollow" rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg@mail.gmail.com%3e/">Unnamed text/plain</a> (inline, None, 1037 bytes)</li>
</ul>
</td>
</tr>
   <tr class="raw">
    <td class="left"></td>
    <td class="right"><a rel='nofollow' href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg@mail.gmail.com%3e" rel="nofollow">View raw message</a></td>
   </tr>
   </tbody>
  </table>
 </body>
</html>

These are the outgoing URLs as identified by Crawler4j.

http://mail-archives.apache.org/archives/style.css
http://mail-archives.apache.org/mod_mbox/
http://mail-archives.apache.org/mod_mbox/kafka-users
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread

However, the URLs that I'm interested in are missing.

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e

What am I doing wrong? How do I get Crawler4j to extract the URLs I need?


Source: (StackOverflow)

Parsing robot.txt using java and identify whether an url is allowed

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12


Source: (StackOverflow)

Advertisements

Grails: Pass value from controller to thread

In my project, the action of my Grails controller is creating a new thread and calling a class form src/groovy folder each time this action is executed. I need to pass the value from this action to the new thread being created. How can I achieve this?

Update: I am implementing crawler4j in my project.

My controller code is as follows: Thanks in advance.

class ResourceController{
def crawl(Integer max) {
    String crawlStorageFolder = ".crawler4j";
    String website = "www.google.com";
    controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers); //this line after a series of background tasks calls the BasicCrawler class located in src/groovy. 
    Thread.sleep(30 * 1000);
}

The crawler4j starts a new thread when it calls the BasicCrawler class.

The BasicCrawler class has a visit function. I need to pass value of website from ResourceController to the visit function.


Source: (StackOverflow)

Set values from src/groovy classes to domain class properties

I'm working on crawler4j using groovy and grails.

I have a BasicCrawler.groovy class in src/groovy and the domain class Crawler.groovy and a controller called CrawlerController.groovy.

I have few properties in BasicCrawler.groovy class like url, parentUrl, domain etc.

I want to persist these values to the database by passing these values to the domain class while crawling is happening.

I tried doing this in my BasicCrawler class under src/groovy

class BasicCrawler extends WebCrawler {
   Crawler obj = new Crawler()
   //crawling code 
   @Override
   void visit(Page page) {
      //crawling code
      obj.url = page.getWebURL().getURL()
      obj.parentUrl = page.getWebURL().getParentUrl()
   }

   @Override
   protected void handlePageStatusCode(WebURL webUrl, int statusCode, String   statusDescription) {
      //crawling code
      obj.httpstatus = "not found"
   }
}

And my domain class is as follows:

class Crawler extends BasicCrawler {
   String url
   String parentUrl
   String httpstatus
   static constraints = {}
}

But I got the following error:

ERROR crawler.WebCrawler  - Exception while running the visit method. Message: 'No such property: url for class: mypackage.BasicCrawler
Possible solutions: obj' at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:50)

After this I tried another approach. In my src/groovy/BasicCrawler.groovy class, I declared the url and parentUrl properties on the top and then used databinding (I might be wrong since I am just a beginner):

class BasicCrawler extends WebCrawler {
   String url
   String parentUrl

   @Override
   boolean shouldVisit(WebURL url) { //code
   }

   @Override
   void visit(Page page) { //code
   }

   @Override
   protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
      //code}
   }
   def bindingMap = [url: url , parentUrl: parentUrl]
   def Crawler = new Crawler(bindingMap)
}

And my Crawler.groovy domain class is as follows:

class Crawler {
   String url
   String parentUrl
   static constraints = {}
}

Now, it doesn't show any error but the values are not being persisted in the database. I am using mongodb for the backend.


Source: (StackOverflow)

Kill threads created by an object

I have created a custom crawler using crawler4j. In my app, I create a lot of controllers and after a while, the number of threads in the system will hit the maximum value and the JVM will throw an Exception. Even though I call ShutDown() on the controller, and set it as null and call System.gc(), the threads in my app remain open and the app will crash.

I used the jvisualvm.exe (Java VisualVM) and saw that at one point my app hits 931 threads.

Is there a way I can immediately kill all the threads created by the CrawlController object of the crawler4j project? (or any other object for that matter)


Source: (StackOverflow)

Crawler4j with Grails App throws error

This might be a very basic and silly question for experienced people. But please help. I am trying to use Crawler4j with in my Grails app by following this tutorial. I know its Java code but I am using it in a controller class called CrawlerController.groovy.

I added the jar files but when I write CrawlConfig crawlConfig = new CrawlConfig() it throws me a compiler error saying "Groovy unable to resolve class" . I refreshed dependencies and tried everything. May be I am missing something since I am a beginner. This is what I have written so far and all the import statements and CrawlConfig statement throws errors:

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
class CrawlerController extends WebCrawler {
    public static void main(String[] args) {
    CrawlConfig crawlConfig = new CrawlConfig()

    }
}

` Please help. Thanks.


Source: (StackOverflow)

Crawl urls with a certain prefix

I would like to just crawl with crawler4j, certain URLs which have a certain prefix.

So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php.

This URL is not valid: http://test1.com/timer/image

I tried to implement it like that:

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    String adrs1 = "http://url1.com/timer/image";
    String adrs2 = "http://url2.com/house/image";

    if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
        return false;
    }

    if (filters.matcher(href).matches()) {
        return false;
    }

    for (String crawlDomain : myCrawlDomains) {
        if (href.startsWith(crawlDomain)) {
            return true;
        }
    }

    return false;
}

However, it does not seem that this works, because the crawler also visits other URLs.

Any recommendation what I could so?

I appreciate your answer!


Source: (StackOverflow)

Improving Crawler4j-Crawler efficiency,scalabitlity

I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following settings:-

  • I have increased the no. of crawler threads to 10 (using this function ContentCrawler('classfilename',10);)
  • I have reduced the politeness delay to 50 ms (using Crawlconfig.setpolitenessdelay(50);)
  • I am giving depth of crawling as 2 (using Crawlconfig.setMaxDepthOfCrawling(2))

Now what I want to know is:-

1) Are there any side effects with these kind of settings.

2) Are there any things I have to do apart from this so that I can improve my Crawler speed.

3) Can some one tell me maximum limits of every setting(ex:- Max no. of threads supported by crawler4j at a time etc).Beacuse I have already gone through the code of Crawler4j but I did not find any limits any where.

4)How to crawl a domain without checking it's robots.txt file.Beacause I understood that crawler4j is first checking a Domain's robots.txt file before crawling.I don't want that!!

5)How does page fetcher works(pls explain it briefly)

Any help is appreciated,and pls go easy on me if the question is stupid.


Source: (StackOverflow)

Get mp3/pdf files using JSoup in Groovy

I am developing an application for crawling the web using crawler4j and Jsoup. I need to parse a webpage using JSoup and check if it has zip files, pdf/doc and mp3/mov file available as a resource for download.

For zip files i did the following and it works:

Elements zip = doc.select("a[href\$=.zip]")
        println "No of zip files is " + zip.size() 

This code correctly tells me how many zip files are there in a page. I am not sure how to count all audio files or document files using JSoup. Any help is appreciated. Thanks.


Source: (StackOverflow)

Crawler4j with mongoDB

I was researching on crawler4j. I found that it uses BerkeleyDB as the database. I am developing a Grails app using mongoDB and was wondering how flexible will crawler4j be to work within my application. I basically want to store the crawled information in the mongodb database. Is it possible to configure crawler4j in such a way that it used mongoDB as the default datastore rather than BerkeleyDB? Any suggestions would be helpful. Thanks


Source: (StackOverflow)

crawler4j always returns fatal transport error

This is what I get for any seed I add to crawler4j.

ERROR [Crawler 1] Fatal transport error: Connection to http://example.com refused while fetching http://example.com/page.html (link found in doc #0)

This is really weird for me. I don't know what causes it.


Source: (StackOverflow)

Web Crawling (Ajax/JavaScript enabled pages) using java

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741. I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot).

enter image description here

If you observe the attached screenshot it has three names (Highlighted in red boxes). If you click one of the link you will see a popup and that popup contains the whole information about that author. I want to crawl the information which are there in that popup.

I am using the following code to crawl the content.

public class WebContentDownloader {

private Parser parser;
private PageFetcher pageFetcher;

public WebContentDownloader() {
    CrawlConfig config = new CrawlConfig();
    parser = new Parser(config);
    pageFetcher = new PageFetcher(config);
}

private Page download(String url) {
    WebURL curURL = new WebURL();
    curURL.setURL(url);
    PageFetchResult fetchResult = null;
    try {
        fetchResult = pageFetcher.fetchHeader(curURL);
        if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
            try {
                Page page = new Page(curURL);
                fetchResult.fetchContent(page);
                if (parser.parse(page, curURL.getURL())) {
                    return page;
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    } finally {
        if (fetchResult != null) {
            fetchResult.discardContentIfNotConsumed();
        }
    }
    return null;
}

private String processUrl(String url) {
    System.out.println("Processing: " + url);
    Page page = download(url);
    if (page != null) {
        ParseData parseData = page.getParseData();
        if (parseData != null) {
            if (parseData instanceof HtmlParseData) {
                HtmlParseData htmlParseData = (HtmlParseData) parseData;
                return htmlParseData.getHtml();
            }
        } else {
            System.out.println("Couldn't parse the content of the page.");
        }
    } else {
        System.out.println("Couldn't fetch the content of the page.");
    }
    return null;
}

public String getHtmlContent(String argUrl) {
    return this.processUrl(argUrl);
}
}

I was able to crawl the content from the aforementioned link/site. But it doesn't have the information what I marked in the red boxes. I think those are the dynamic links.

  • My question is how can I crawl the content from the aforementioned link/website...???
  • How to crawl the content from Ajax/JavaScript based websites...???

Please can anyone help me on this.

Thanks & Regards, Amar


Source: (StackOverflow)

How to crawl my site to detect 404/500 errors?

Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)?

I also want to be able to set it to only click into 1 of each type of link. So if I have 1000 category pages, it only clicks into one.

Is http://code.google.com/p/crawler4j/ a good option?

I'd like something that is super easy to set up, and I'd prefer PHP over Java (though if Java is significantly faster, that would be ok).


Source: (StackOverflow)

guide to setup crawler4j

I would like to setup the crawler to crawl a website, let say blog, and fetch me only the links in the website and paste the links inside a text file. Can you guide me step by step for setup the crawler? I am using Eclipse.


Source: (StackOverflow)

How can I get crawler4j to download all links from a page more quickly?

What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them

There must be a quicker way, where I can download the links directly when I visit the page? Thx!


Source: (StackOverflow)