Top beautifulsoup frequently asked interview questions

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

Source: (StackOverflow)

Beautiful Soup and extracting a div and its contents by ID

soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from

soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

Edit: There is no answer to this post - how do I delete it? I found that BeautifulSoup is not parsing correctly, which probably actually means the page I'm trying to parse isn't properly formatted in SGML or whatever.

Source: (StackOverflow)

Downloading a picture via urllib and python

So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code:

>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)

I then searched my computer for a file "00000001.jpg", but all I found was the cached picture of it. I'm not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the '00000000'.'jpg' and increment the '00000000' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly?

Thanks!

EDIT 6/15/10

Here is the completed script, it saves the files to any directory you choose. For some odd reason, the files weren't downloading and they just did. Any suggestions on how to clean it up would be much appreciated. I'm currently working out how to find out many comics exist on the site so I can get just the latest one, rather than having the program quit after a certain number of exceptions are raised.

import urllib
import os

comicCounter=len(os.listdir('/file'))+1  # reads the number of files in the folder to start downloading at the next comic
errorCount=0

def download_comic(url,comicName):
    """
    download a comic in the form of

    url = http://www.example.com
    comicName = '00000000.jpg'
    """
    image=urllib.URLopener()
    image.retrieve(url,comicName)  # download comicName at URL

while comicCounter <= 1000:  # not the most elegant solution
    os.chdir('/file')  # set where files download to
        try:
        if comicCounter < 10:  # needed to break into 10^n segments because comic names are a set of zeros followed by a number
            comicNumber=str('0000000'+str(comicCounter))  # string containing the eight digit comic number
            comicName=str(comicNumber+".jpg")  # string containing the file name
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)  # creates the URL for the comic
            comicCounter+=1  # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
            download_comic(url,comicName)  # uses the function defined above to download the comic
            print url
        if 10 <= comicCounter < 100:
            comicNumber=str('000000'+str(comicCounter))
            comicName=str(comicNumber+".jpg")
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)
            comicCounter+=1
            download_comic(url,comicName)
            print url
        if 100 <= comicCounter < 1000:
            comicNumber=str('00000'+str(comicCounter))
            comicName=str(comicNumber+".jpg")
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)
            comicCounter+=1
            download_comic(url,comicName)
            print url
        else:  # quit the program if any number outside this range shows up
            quit
    except IOError:  # urllib raises an IOError for a 404 error, when the comic doesn't exist
        errorCount+=1  # add one to the error count
        if errorCount>3:  # if more than three errors occur during downloading, quit the program
            break
        else:
            print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist")  # otherwise say that the certain comic number doesn't exist
print "all comics are up to date"  # prints if all comics are downloaded

Source: (StackOverflow)

retrieve links from web page using python and BeautifulSoup

How can I retrieve the links of a webpage and copy the url address of the links using Python?

Source: (StackOverflow)

ImportError: No Module Named bs4 (BeautifulSoup)

I'm working in Python and using Flask. When I run my main Python file on my computer, it works perfectly, but when I activate venv and run the Flask Python file in the terminal, it says that my main Python file has "No Module Named bs4." Any comments or advice is greatly appreciated.

Source: (StackOverflow)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web page (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducable - in that it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode related error.

One of the sections of code that is causing problems is hown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSITENTLY fix this problem?

Source: (StackOverflow)

BeautifulSoup Grab Visible Webpage Text

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage... For instance, this webpage is my test case http://www.nytimes.com/2009/12/21/us/21storm.html .. And I mainly want to just get the body text (article) and and maybe even a few tab names here and there. However after trying this suggestion http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents > that returns lots of tags and html comments which aren't needed.. I can't figure out what are the right arguments to findAll (http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-limit) that I need to do what I need...

So, how should I find all visible text excluding scripts/comments/css/junk...etc.. ??

Source: (StackOverflow)

ImportError: No module named BeautifulSoup

I have installed BeautifulSoup using easy_install and trying to run following script

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()

But not sure why this is happening

Traceback (most recent call last):
  File "C:\Python27\reading and writing xml file from web1.py", line 49, in <module>
    from BeautifulSoup import BeautifulSoup
ImportError: No module named BeautifulSoup

Could you please help. Thanks

Source: (StackOverflow)

BeautifulSoup getting href [duplicate]

This question already has an answer here:

retrieve links from web page using python and BeautifulSoup 11 answers

I have the following soup:

<a rel='nofollow' href="some_url">next</a>
<span class="class">...</span>

From this I want to extract the href, "some_url"

I can do it if I only have one tag, but here there are two tags. I can also get the text 'next' but that's not what I want.

Also, is there a good description of the API somewhere with examples. I'm using the standard documentation, but I'm looking for something a little more organized.

Source: (StackOverflow)

Beautiful Soup cannot find a CSS class if the object has other classes, too

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.

If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?

Source: (StackOverflow)

difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

Source: (StackOverflow)

Can I remove script tags with BeautifulSoup?

Can script tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?

Source: (StackOverflow)

How to find children of nodes using Beautiful Soup

I want to get all the <a> tags which are children of <li>

<div>
<li class="test">
    <a>link1</a>
    <ul> 
       <li>  
          <a>link2</a> 
       </li>
    </ul>
</li>
</div>

I know how to find element with particular class like this

soup.find("li", { "class" : "test" })

But i don't know how to find all a which are children of <li class=test> but not any others

like i want to select

<a> link1 </a>

Source: (StackOverflow)

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

<h2> this is cool #12345678901 </h2>

So, the previous would match by using:

soup('h2',text=re.compile(r' #\S{11}'))

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.

Ideas?

Source: (StackOverflow)

Best way for a beginner to learn screen scraping with python

This might be one of those questions that are difficult to answer, but here goes:

I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language - so I am not a complete stranger to programming logic.

Now I would like to learn python - primarily to do screen scraping and text analysis, but also for writing webapps with Pylons or Django.

So: How should I go about learning to screen scrape with python? I started going through the scrappy docs but I feel to much "magic" is going on - after all - I am trying to learn, not just do.

On the other hand: There is no reason to reinvent the wheel, and if Scrapy is to screen scraping what Django is to webpages, then It might after all be worth jumping straight into Scrapy. What do you think?

Oh - BTW: The kind of screen scraping: I want to scrape newspaper sites (i.e. fairly complex and big) for mentions of politicians etc. - That means I will need to scrape daily, incrementally and recursively - and I need to log the results into a database of sorts - which lead me to a bonus question: Everybody is talking about nonSQL DB. Should I learn to use e.g. mongoDB right away (I don't think I need strong consistency), or is that foolish for what I want to do?

Thank you for any thoughts - and I apologize if this is to general to be considered a programming question.

Source: (StackOverflow)

EzDevInfo.com

beautifulsoup interview questions

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Beautiful Soup and extracting a div and its contents by ID

Downloading a picture via urllib and python

retrieve links from web page using python and BeautifulSoup

ImportError: No Module Named bs4 (BeautifulSoup)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

BeautifulSoup Grab Visible Webpage Text

ImportError: No module named BeautifulSoup

BeautifulSoup getting href [duplicate]

Beautiful Soup cannot find a CSS class if the object has other classes, too

difference between BeautifulSoup and Scrapy crawler?

Can I remove script tags with BeautifulSoup?

How to find children of nodes using Beautiful Soup

Using BeautifulSoup to find a HTML tag that contains certain text

Best way for a beginner to learn screen scraping with python