jsoup
jsoup: Java HTML Parser, with best of DOM, CSS, and jquery
jsoup Java HTML Parser, with best of DOM, CSS, and jquery open source java html parser, with dom, css, and jquery-like methods for easy data extraction.
I have the following code:
public class NewClass {
public String noTags(String str){
return Jsoup.parse(str).text();
}
public static void main(String args[]) {
String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
"<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a rel='nofollow' href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";
NewClass text = new NewClass();
System.out.println((text.noTags(strings)));
}
And I have the result:
hello world yo googlez
But I want to break the line:
hello world
yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a <br>
in the markup I parse, how can I get a line break in my resulting output?
Source: (StackOverflow)
I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup.
For example, I got a list of links :
<a rel='nofollow' href="www.domain.com/url1.html">link1</a>
<a rel='nofollow' href="www.domain.com/url2.html">link2</a>
<a rel='nofollow' href="www.domain.com/url3.html">link3</a>
<a rel='nofollow' href="www.domain.com/url4.html">link4</a>
For each link, I parse the document linked to the URL (from the href attribute) to get other pieces of information in those pages.
So I can imagine that it takes lot of time, but how to shut off this exception?
Here is the whole stack trace:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at app.ForumCrawler.crawl(ForumCrawler.java:50)
at Main.main(Main.java:15)
Thank you buddies!
sp00m
EDIT:
Hum... Sorry, just found the solution:
Jsoup.connect(url).timeout(0).get();
Hope that could be useful for someone else... :)
Source: (StackOverflow)
Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).
For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?
What would this process even be called? I have no idea were to even begin researching this.
Edit:
Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)
I do have Apache Commons
Source: (StackOverflow)
<td width="10"></td>
<td width="65"><img src="/images/sparks/NIFTY.png" /></td>
<td width="65">5,390.85</td>
<td width="65">5,428.15</td>
<td width="65">5,376.15</td>
<td width="65">5,413.85</td>
This is the HTML source from which i have to extract the values 5390.85,5428.15 , 5376.15 , 5413.85.
I wanted to do this using jsoup. But i am relatively new to jsoup( today i started using it). So how should i do this?
URL url = new URL("http://www.nseindia.com/content/equities/niftysparks.htm");
Document doc = Jsoup.parse(url,3*1000);
String text = doc.body().text();
I have already extracted the content of the website using jsoup.
but how to extract the values i require?
Thanks in advance
Source: (StackOverflow)
I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".
Source: (StackOverflow)
I already know where the image is, but for simplicity's sake I wanted to download the image using JSoup itself. (This is to simplify getting cookies, referrer, etc.)
This is what I have so far:
//Open a URL Stream
Response resultImageResponse = Jsoup.connect(imageLocation).cookies(cookies).ignoreContentType(true).execute();
// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(outputFolder + name));
//BufferedWriter out = new BufferedWriter(new FileWriter(outputFolder + name));
out.write(resultImageResponse.body()); // resultImageResponse.body() is where the image's contents are.
out.close();
Source: (StackOverflow)
I am trying to select, using Jsoup, a <div>
that has multiple classes:
<div class="content-text right-align bold-font">...</div>
The syntax for doing so, to the best of my understanding, should be:
document.select("div.content-text.right-align.bold-font");
However, for some reason, this doesn't work for me.
When I try the same exact syntax on JSFIDDLE, it works without a hitch.
Does multi-class selection work in Jsoup?
(I'd rather find out that this is a bug in my code than find out that this is a Jsoup limitation :)
UPDATE (thanks to the answer below): Jsoup works perfectly with the aforementioned syntax.
Source: (StackOverflow)
I'm cleaning some text from unwanted HTML tags (such as <script>
) by using
String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());
The problem is that it replaces for instance å
with å
(which causes troubles for me since it's not "pure xml").
For example
Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())
yields
"hello å world"
but I would like
"hello å world"
Is there a simple way to achieve this? (I.e. simpler than converting å
back to å
in the result.)
Source: (StackOverflow)
Let's say i have a html fragment like this:
<p> <span> foo </span> <em> bar <a> foobar </a> baz </em> </p>
What i want to extract from that is:
foo bar foobar baz
So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html?
As you can see in the title, i want to use jsoup for the parsing.
Example for accented html (note the 'á' character):
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
<p><strong>Tarthatatlan biztonsági viszonyok</strong></p>
What i want:
Tarthatatlan biztonsági viszonyok
Tarthatatlan biztonsági viszonyok
This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.
Source: (StackOverflow)
I'm trying to parse the frontpage of facebook with JSoup
but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).
I'm setting my User Agent like this:
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")
.get();
Am I doing something wrong?
EDIT:
I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....
I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.
Source: (StackOverflow)
I am trying to parse XML with jsoup, but I can't find any examples on this task.
My XML document looks like this:
<?xml version="1.0" encoding="UTF-8">
<tests>
<test>
<id>xxx</id>
<status>xxx</status>
</test>
<test>
<id>xxx</id>
<status>xxx</status>
</test>
....
</tests>
</xml>
It should be quite straightforward, but my attempt has failed.
Code:
Element content = doc.getElementById("content");
Elements tests = content.getElementsByTag("tests");
for (Element testElement : tests) {
System.out.println(testElement.getElementsByTag("test"));
}
Source: (StackOverflow)
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : http://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Source: (StackOverflow)
I am trying to POST data into website to make a login into the site using Jsoup , but its not working ?
I am trying the code
Document docs = Jsoup.connect("http://some.com/login")
.data("cmd", "login","username", "xxxx","password", "yyyyy")
.referrer("http://some.com/login/").post();
here it is giving normal page of login in pagesource
i have also tried the code
Document docs = (Document) Jsoup.connect("http://some.com/login")
.data("cmd", "login","username", "xxxx","password", "yyyyy")
.referrer("http://some.com/login/").method(Method.POST).execute().parse();
here also it is giving normal page of login again in pagesource.
Any suggestions regarding the same would be highly appreciated !!
Thanks....
Source: (StackOverflow)