goose
Html Content / Article Extractor in Scala - open sourced from Gravity Labs
can anybody help?
I used to use Goose to extra text from html to complete. I've imported in my project and try to run, but the exception comes
07-18 09:44:13.472: E/AndroidRuntime(2565): Caused by:
java.lang.NoClassDefFoundError: com.gravity.goose.Goose
I think it might be a problem between Java and Scala.
code like:
String url = mNewsItem.getURL();
Goose goose = new Goose(new Configuration());
Article article = goose.extractContent(url);
Thx
Source: (StackOverflow)
from newspaper import Article
import pdb
from unidecode import unidecode
def get_article_newspaper(url):
article = Article(url,en='zh') # Chinese
article.download();
article.parse()# article.text if blank!
print unidecode(article.text).replace('Image caption','')
url='http://www.tyfzw.cn/?sw=774&b=177%20'
get_article_newspaper(url)
This seemed the most maintained so tried. Also, tried goose and boilerpipe neither work.
Later want to translate also :
import goslate
def language_translate(text): #translates to language
gs = goslate.Goslate()
language_id = gs.detect('text')
if language_id != 'en':
text=gs.translate(text, 'en')
return text
Source: (StackOverflow)
The goose install places goose in the python-goose directory. When I try to import goose at the IDLE prompt I get:
>>> from goose import Goose
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
from goose import Goose
ImportError: No module named goose
Because goose is installed in the python-goose directory I believe the import syntax should be:
from python-goose.goose import Goose
however when I run this I get the following syntax error message:
>>> from python-goose.goose import Goose
SyntaxError: invalid syntax
Any suggestions on how to properly import goose would be appreciated.
Source: (StackOverflow)
I'm trying to set up a small Android application which extracts content from a web page using the Goose library. Since the library is written in Scala, I'm using the .jar I found here. The problem is, when I try to extract content from a page, it returns nothing. I successfully create an Article
object using the URL I need, but the values of the object (title, domain, topImage etc.) are all null
. I tried using different urls, to see if the problem was isolated to a single website, but it doesn't appear to be so.
The code I use to set up the Goose
instance is this:
gooseDir = context.getCacheDir();
Configuration config = new Configuration();
config.setLocalStoragePath(gooseDir.getAbsolutePath());
Goose goose = new Goose(config);
And then I just create the Article
instance like so:
Article article = goose.extractContent(url);
Any advice?
Source: (StackOverflow)
I have a small regex problems with text extracted by goose.
I have extracted the clean text out of a html page using Goose, the output that goose gives is fine, but with a small problem. I get the below string.
My name is Sam\'s, I like to play \'football\'
The actual text looks like
My name is Sam's, I like to play 'football'
I am trying to get rid of the backslash. When I try the below code for the text extracted by goose, somehow the code doesn't work, however, if I input the text myself the code works perfectly.
I tried the below code
re.sub(r"\\","",text) or
text.replace("\\","")
text.decode()
Please find the code below:
from goose import Goose
url = 'http://economictimes.indiatimes.com/news/politics-and- nation/swach-bharat-drives-draws-inspiration-from-mahatma- gandhi/articleshow/49203355.cms'
g = Goose()
article = g.extract(url=url)
text=article.cleaned_text
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....
text=re.sub(r"\\","",text)
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....
How do I get rid of the backslash.
Source: (StackOverflow)
I have been wondering lately to use Goose Extractor for boiler plate removal purposes. I am not sure if blindly trusting Goose extractor will be the right thing to do. Thus, I wanted to ask if anyone knows the logic behind the goose extractor? I know that it is a sort of statistical method but nowhere they have mentioned the whole logic behind extraction process.
Any help will be highly appreciated.
Thank you!
Source: (StackOverflow)
I am one problem with goose-extractor
This is my code:
for resultado in soup.find_all('a', rel='nofollow' href=True,text=re.compile(llave)):
url = resultado['href']
article = g.extract(url=url)
print article.title
and take a look at my problem.
RuntimeError: maximum recursion depth exceeded
any suggestions ?
I am a lousy programmer or hidden errors are not visible in python
Source: (StackOverflow)
I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.
Goose version used:https://github.com/agolo/python-goose/
Present version gives some errors.
from goose import Goose
from requests import get
response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text
Source: (StackOverflow)
After a anticipated wait I decided to give Goose a try to extract articles, however, I am getting so many unicode problems with the extracted text.
g = Goose()
article = g.extract(url='http://www.forbes.com/sites/benkepes/2014/08/19/more-openstack-certifications-because-interoperability-is-key/?partner=yahootix')
articlebody = article.cleaned_text[:1300]
ex_article = articlebody.encode('utf-8')
print ex_article
The result looks as follows:
Trove is the database as a service component of OpenStack that lets administrators and DevOps operate many instances of a variety of different database management systems (DBMS) technologies using common infrastructure. Â Â Trove assures common administrative tasks including provisioning,
So far I have tried using
1) .encode('utf-8')
2) .decode('utf-8')
3) from __future__ import unicode_literals
at the start of the file
4) Reloading sys
to include UTF-8
How can I get this cleansed, pure text article with no unicode problems?
Source: (StackOverflow)
Im trying to work with Python-Goose extractor. I Installed virtualenv, and followed the setup instructions. When running from PyCharm everything works great.
But when running from the Windows Command Prompt I'm getting this error:
C:\Users\tal>C:\virtual_enviroments\goose_venv\Scripts\activate
(goose_venv) C:\Users\tal>cd C:\main\prototypes\collection\goose-cli\app
(goose_venv) C:\main\prototypes\collection\goose-cli\app>extract-new-events.py
Traceback (most recent call last):
File "C:\main\prototypes\collection\goose-cli\app\extract-new-events.py", line 1, in <module>
from goose import Goose
ImportError: No module named goose
What am I doing wrong here?
Here is an image of it working in PyCharm (large):
Source: (StackOverflow)
Question
Why will Pyinstaller not work with goose
files? Is it an issue with the executable creator or my code?
Code
from goose.Goose import Goose
url =
'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose({'debug':False,'enableImageFetching': False,'localStoragePath':'./tmp'})
article = g.extractContent(url=url)
#article.title
print article.cleanedArticleText[:150].encode("utf8","ignore")
Error Log From Pyinstaller
My program, created with pyinstaller, fails to find goose files in this path:
IOError: Couldn't open file C:\Users\user\Desktop\dist\main.exe?118272\goose/resources/text/stopwords-en.txt
This happens:
Traceback (most recent call last):
File "<string>", line 15, in <module>
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Goose",line 52, in extractContent
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Goose",line 59, in sendToActor
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Crawler", line 86, in crawl
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.extractors", line 245, in calculateBestNodeBasedOnClustering
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.text", line 97, in __init__
File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.utils",line 76, in loadResourceFile
IOError: Couldn't open file C:\Users\user\Desktop\dist\main.exe?118272\goose/resources/text/stopwords-en.txt
What's wrong?
Source: (StackOverflow)
I am trying to implement Goose-2.1.22 into one of my applications. However, when I try to run my app with the basic code they provided me I get this error:
02-16 11:19:55.048 29391-29391/test.package.test2 E/AndroidRuntime﹕ FATAL EXCEPTION: main
java.lang.NoClassDefFoundError: com.gravity.goose.Goose
at test.package.test2.Searching_Animation_Screen.goose_it(Searching_Animation_Screen.java:65)
at test.package.test2.Searching_Animation_Screen.onCreate(Searching_Animation_Screen.java:59)
at android.app.Activity.performCreate(Activity.java:5372)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1104)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2267)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2359)
at android.app.ActivityThread.access$700(ActivityThread.java:165)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1326)\
Here is the code that uses goose (method called from onCreate()
)
String url = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
Goose goose = new Goose(new Configuration());
Article article = goose.extractContent(url);
System.out.println(article.cleanedArticleText());
text.setText(article.cleanedArticleText().toString());
Any ideas how to fix my issue? Thanks everyone!
Source: (StackOverflow)
I followed the exact instructions from https://github.com/grangier/python-goose when installing goose, and after I typed in "mkvirtualenv --no-site-packages goose", this is what I got:
172-27-220-167:~ yitongwang$ mkvirtualenv --no-site-packages goose
New python executable in goose/bin/python
Installing setuptools, pip...done.
Error: deactivate must be sourced. Run 'source deactivate'
instead of 'deactivate'.
Usage: source deactivate
removes the 'bin' directory of the environment activated with 'source
activate' from PATH.
(goose)172-27-220-167:~ yitongwang$
I have installed virtualenv and virtualenvwrapper using 'sudo pip install virtualenv/virtualenvwrapper', and the weirdest thing is I seemed to still manage to enter the goose virtual environment (seems like it). After cloning into the git repo and change to the directory python-goose cloned earlier, I attempted to run 'pip install -r requirements.txt' and 'python setup.py install', and these are the errors:
In file included from _imagingft.c:31:
/Users/yitongwang/anaconda/include/ft2build.h:56:10: fatal error: 'freetype/config/ftheader.h' file not found
#include <freetype/config/ftheader.h>
^
1 error generated.
Building using 4 processes
gcc -bundle -undefined dynamic_lookup -L/Users/yitongwang/anaconda/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.5-x86_64-2.7/_imagingft.o -L/Users/yitongwang/.virtualenvs/goose/lib -L/usr/local/lib -L/usr/local/Cellar/freetype/2.5.5/lib -L/usr/lib -L/Users/yitongwang/anaconda/lib -lfreetype -o build/lib.macosx-10.5-x86_64-2.7/PIL/_imagingft.so
clang: error: no such file or directory: 'build/temp.macosx-10.5-x86_64-2.7/_imagingft.o'
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/Users/yitongwang/.virtualenvs/goose/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-build-nL0d0r/Pillow/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-k7HUgC-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/yitongwang/.virtualenvs/goose/include/site/python2.7" failed with error code 1 in /private/var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-build-nL0d0r/Pillow
In file included from _imagingft.c:31:
/Users/yitongwang/anaconda/include/ft2build.h:56:10: fatal error:
'freetype/config/ftheader.h' file not found
#include <freetype/config/ftheader.h>
^
1 error generated.
clang: error: no such file or directory: 'build/temp.macosx-10.5-x86_64-2.7/_imagingft.o'
error: Setup script exited with error: command 'gcc' failed with exit status 1
I'm not sure particularly what's wrong, cause I have tried a few times from scratch where I deleted the directory 'python-goose' and './virtualenv' as well as the path from .bash_profile.
Any help would be much much appreciated!
Thanks
P.S. I'm using Anaconda with Python 2.7 in it.
Source: (StackOverflow)
Goose is a tool which extracts sentences, photos, pictures etc from urls.
This tool is written by python. All codes are in the following URL.
https://github.com/grangier/python-goose/tree/develop/goose
My main purposes is adding processings for other languages which is NOT contained at the current version.
First, I read the tutorials, and Chinese and Korean and Arabic languages are able to be applicable by setting "stop_words" parameters.
Thus, I also searched this notion of "stop_words" in the entire packages.
I found the following python-classes.
class StopWords(object):
class StopWordsChinese(StopWords):
class StopWordsArabic(StopWords):
class StopWordsKorean(StopWords):
I also found text files of stop words written in various languages. The place where these files are located is the /resource/text/ in the above URL.
QUESTION 1:
Are there other components in the package to rewrite these codes of Goose and to add the versions of Japanese and all other languages which is NOT included in the newest version??
.
QUESTION 2:
As a firtstep, I wanted to add Japanese procedures. Is there a TIPs for web-scraping from URLs in Japanese??
Source: (StackOverflow)
I am trying to properly set up python-goose in a virtualenv.
Update: I nuked python and started with a clean install as outlined here.
I followed the python-goose instructions and did:
mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install
pip install -r requirements.txt
fails on lxml
Error I get now is:
error: command 'cc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /Users/me/.virtualenvs/goose/bin/python -c "import setuptools, tokenize;__file__='/Users/me/.virtualenvs/goose/build/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/wg/82j6ndq50tl4m9rjkqszyx8r0000gp/T/pip-c9DtYT-record/install-record.txt --single-version-externally-managed --compile --install-headers
/Users/me/.virtualenvs/goose/include/site/python2.7 failed with error code 1 in
/Users/me/.virtualenvs/goose/build/lxml
Is there anything I am doing incorrectly or are there any alternative ways I can try to get this working?
Source: (StackOverflow)