EzDevInfo.com

newspaper

News, full-text, and article metadata extraction in python 2.6 - 3.4. Newspaper: Article scraping & curation — newspaper 0.0.2 documentation

ImportError when installing newspaper

I am pretty new to python and am trying to import newspaper for article extraction. Whenever I try to import the module I get ImportError: cannot import name images. Anyone come across this problem and found a solution?


Source: (StackOverflow)

OpenCV Newspaper Segmentation

I'm totally new to OpenCV and dealing with images. I have to detect the "areas" composing a newspaper page like articles, titles, images and so on. Then I have to keep only the text areas and run OCR on it. I have been searching for something for a week, finding almost nothing. I tried line detection with hough transform, I tried Watershed segmenting, I tried rectangles identification... A common schema in a newspaper article seems that it is composed by rectangle areas, each one separated by vertical and horizontal thick lines. Can someone point me in the right direction?


Source: (StackOverflow)

Advertisements

downloading articles from multiple urls with newspaper

I've been trying to extract mulitple articles from a webpage (zeit online, german newspaper), for which I have a list of urls I want to download articles from, so I do not need to crawl the page for urls.

The newspaper package for python does an awesome job for parsing the content of a single page. What I would need to do ist automatically change the urls, until all the articles are downloaded. I do unfortunately have limited coding knowledge and haven't found a way to do that. I'd be very grateful if anyone could help me.

One of the things I tried was the following:

import newspaper
from newspaper import Article

lista = ['url','url']


for list in lista:

 first_article = Article(url="%s", language='de') % list

 first_article.download()

 first_article.parse()

 print(first_article.text)

it returned the following error: unsupported operand type for %:'article' and 'str'

This seems to do the job, although I'd expect there to be an easier way involving less apples and bananas.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import newspaper
from newspaper import Article

lista = ['http://www.zeit.de/1946/01/unsere-aufgabe', 'http://www.zeit.de/1946/04/amerika-baut-auf', 'http://www.zeit.de/1946/04/bedingung', 'http://www.zeit.de/1946/04/bodenrecht']

apple = 0
banana = lista[apple]


while apple <4 :

 first_article = Article(url= banana , language='de') 

 first_article.download()

 first_article.parse()

 print(first_article.text).encode('cp850', errors='replace')

 apple += 1
 banana = lista[apple]

Source: (StackOverflow)

How to use newspaper which is a library of python?

I'm a beginner of python. And I'm trying to make web parser and saved it. By the way I had found the newspaper and completed setup on my PC. And I write code on eclipse. But I couldn't good result. Please help me.

     import newspaper

     cnn_paper = newspaper.build('http://cnn.com')

     for article in cnn_paper.articles:
       print(article.url)

This is a error message:

Traceback (most recent call last):
  File "D:\workspace2\JesElaSearchSys\NespaperScraper_01.py", line 2, in <module>
    import newspaper
  File "C:\Python27\lib\site-packages\newspaper3k-0.1.5-py2.7.egg\newspaper\__init__.py", line 10, i
n <module>
    from .article import Article, ArticleException
  File "C:\Python27\lib\site-packages\newspaper3k-0.1.5-py2.7.egg\newspaper\article.py", line 12, in
 <module>
    from . import images
  File "C:\Python27\lib\site-packages\newspaper3k-0.1.5-py2.7.egg\newspaper\images.py", line 15, in 
<module>
    import urllib.request
ImportError: No module named request

Source: (StackOverflow)

Trouble parsing URLs from a list imported with pickle using newspaper library

I've been trying to pass a list of urls to extract articles from the pages. Extraction(with newspaper) works just fine if I build an actual list of urls (e.g. lista = 'http://www.zeit.de', ...). Taking the list from another file does not work, however, even though printing the list works. The following is the code:

import io
import newspaper
from newspaper import Article
import pickle

lista = ['http://www.zeit.de',
         'http://www.guardian.co.uk',
         'http://www.zeit.de',
         'http://www.spiegel.de']

apple = 0
banana = lista[apple]
orange = "file_" + str(apple) + ".txt" 

while apple <4 :

   first_article = Article(url= banana , language='de')     
   first_article.download()    
   first_article.parse()

   print(first_article.text).encode('cp850', errors='replace')

   with io.open(orange, 'w', encoding='utf-8') as f:
       f.write(first_article.text)

   apple += 1
   banana = lista[apple]
   orange = "file_" + str(apple) + ".txt" 

The above MCVE works fine. When I unpickle my list, printing it to console works as I expect, for example with this script:

import pickle
import io

lista = pickle.load( open( "save.p", "rb" ) )    
print lista

A sample of the List output looks like this

['www.zeit.de/1998/51/Psychokrieg_mit_Todesfolge', 'www.zeit.de/1998/51/Raffgierig', 'www.zeit.de/1998/51
/Runter_geht_es_schnell', 'www.zeit.de/1998/51/Runter_mit_den_Zinsen_', 'www.zeit.de/1998/51/SACHBUCH', 'www.zeit.de/199
8/51/Schwerer_Irrtum', 'www.zeit.de/1998/51/Silvester_mit_Geist', 'www.zeit.de/1998/51/Tannen_ohne_Nachwuchs', 'www.zeit
.de/1998/51/This_is_Mu_hen', 'www.zeit.de/1998/51/Tuechtig', 'www.zeit.de/1998/51/Ungelehrig']

but there are thousands of urls in the list.

The error message shown doesn't tell me much (full traceback below)

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\newspaper\parsers.py", line 53, in fromstring
    cls.doc = lxml.html.fromstring(html)
  File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 706, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 600, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
  File "parser.pxi", line 1667, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101229)
  File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:96139)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91939)
XMLSyntaxError: None

I've been trying to fix this for hours but I just haven't found a way. Any help would be greatly appreciated.


Source: (StackOverflow)

news aggregator for sentiment analysis

I am writing a little news sentiment analysis app - in python. I want to prepare a database of news articles to train my classifier on, so I am wondering what is my best course of action for fetching news articles off of the web. I looked at newspaper, which looks like a cool module and very generic, but what I am looking for is a way of fetching old news articles - i.e all news articles of 2014. newspaper only uses RSS feeds that never go too far back. Another option would be writing a scraper for google news, and filtering by date in the url, or using API's of publishers like NYT (they have an API).

What is the best way to create a news article database like this? Is there a tool/database on the web I can use to get the articles?


Source: (StackOverflow)