mechanize in Ruby

Ruby Mechanize https error

I'm trying to do the following:

page = Mechanize.new.get "https://sis-app.sph.harvard.edu:9030/prod/bwckschd.p_disp_dyn_sched"

But I only get this exception:

OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=SSLv2/v3 read server hello A: sslv3 alert illegal parameter
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
    from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:54:in `timeout'
    from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:99:in `timeout'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
    from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:755:in `do_start'
    from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:750:in `start'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent.rb:511:in `connection_for'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent.rb:806:in `request'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:258:in `fetch'
    from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:407:in `get'
    from (irb):549
    from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `<main>'

How can I get the webpage to load in Mechanize?

Source: (StackOverflow)

Using WWW:Mechanize to download a file to disk without loading it all in memory first

I'm using Mechanize to facilitate the downloading of some files. At the moment my script uses the following line to actually download the files...

agent.get('http://example.com/foo').save_as 'a_file_name'

However this downloads the complete file into memory before dumping it to disk. How do you bypass this behavior, and simply download straight to disk? If I need to use something other than WWW:Mechanize then how would I go about using WWW:Mechanize's cookies with it?

Source: (StackOverflow)

How to manually add a cookie to Mechanize state?

I'm working in Ruby, but my question is valid for other languages as well.

I have a Mechanize-driven application. The server I'm talking to sets a cookie using JavaScript (rather than standard set-cookie), so Mechanize doesn't catch the cookie. I need to pass that cookie back on the next GET request.

The good news is that I already know the value of the cookie, but I don't know how to tell Mechanize to include it in my next GET request.

Source: (StackOverflow)

How to set custom user-agent for Mechanize in Rails

I know you have a set of pre-defined aliases you can use by setting agent.user_agent_alias = 'Linux Mozilla' for instance, but what if I want to set my own user agent, as I'm writing a web crawler and want to identify it, for the sites I'm indexing's sake. Just like Googlebot.

There seems to be a user_agent method, but I can't seem to find any documentation about it's function.

Source: (StackOverflow)

Programmatic Python Browser with JavaScript

I want to screen-scrape a web-site that uses JavaScript.

There is mechanize, the programmatic web browser for Python. However, it (understandably) doesn't interpret javascript. Is there any programmatic browser for Python which does? If not, is there any JavaScript implementation in Python that I could use to attempt to create one?

Source: (StackOverflow)

Python mechanize, following link by url and what is the nr parameter?

I'm sorry to have to ask something like this but python's mechanize documentation seems to really be lacking and I can't figure this out.. they only give one example that I can find for following a link:

response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)

But I don't want to use a regex, I just want to follow a link based on its url, how would I do this.. also what is "nr" that is used sometimes for following links?

Thanks for any info

Source: (StackOverflow)

Python mechanize - two buttons of type 'submit'

I have a mechanize script written in python that fills out a web form and is supposed to click on the 'create' button. But there's a problem, the form has two buttons. One for 'add attached file' and one for 'create'. Both are of type 'submit', and the attach button is the first one listed. So when I select the forum and do br.submit(), it clicks on the 'attach' button instead of 'create'. Extensive Googling has yielded nothing useful for selecting a specific button in a form. Does anyone know of any methods for skipping over the first 'submit' button and clicking the second?

Source: (StackOverflow)

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

Source: (StackOverflow)

adding directory to sys.path /PYTHONPATH

I am trying to import a module from a particular directory.

The problem is that if I use sys.path.append(mod_directory) to append the path and then open the python interpreter, the directory mod_directory gets added to the end of the list sys.path. If I export the PYTHONPATH variable before opening the python interpreter, the directory gets added to the start of the list. In the latter case I can import the module but in the former, I cannot.

Can somebody explain why this is happening and give me a solution to add the mod_directory to the start, inside a python script ?

Source: (StackOverflow)

How do I parse an HTML table with Nokogiri?

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?

<table >
  <tbody>
    <tr>  <!-- table header --> </tr>
  </tbody>
  <!-- show threads -->
  <tbody id="threadbits_forum_251">
    <tr>
      <td></td>
      <td></td>
      <td>
        <div>
          <a rel='nofollow' href="showthread.php?t=230708" >Vb4 Gold Released</a>
        </div>
        <div>
          <span><a>Paul M</a></span>
        </div>
      </td>
      <td>
          06 Jan 2010 <span class="time">23:35</span><br />
          by <a rel='nofollow' href="member.php?find=lastposter&amp;t=230708">shane943</a> 
        </div>
      </td>
      <td><a rel='nofollow' href="#">24</a></td>
      <td>1,320</td>
    </tr>

  </tbody>
</table>

Source: (StackOverflow)

Which is best in Python: urllib2, PycURL or mechanize?

Ok so I need to download some web pages using Python and did a quick investigation of my options.

Included with Python:

urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)

urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)

Full featured:

mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)

PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)

New possibilities:

urllib3 - supports connection re-using/pooling and file posting

Deprecated (a.k.a. use urllib/urllib2 instead):

httplib - HTTP/HTTPS only (no FTP)

httplib2 - HTTP/HTTPS only (no FTP)

The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).

urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.

Edit: added note on verb support in urllib2

Source: (StackOverflow)

Is there a PHP equivalent of Perl's WWW::Mechanize?

I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.

I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...) statements

Clarification:

I want something more high-level than the answers so far. For example, in Perl, you could do something like:

# navigate to the main page
$mech->get( 'http://www.somesite.com/' ); 

# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );

# submit a POST form, to log into the site
$mech->submit_form(
    with_fields      => {
        username    => 'mungo',
        password    => 'lost-and-alone',
    }
);

# save the results as a file
$mech->save_content('somefile.zip');

To do the same thing using HTTP_Client or wget or CURL would be a lot of work, I'd have to manually parse the pages to find the links, find the form URL, extract all the hidden fields, and so on. The reason I'm asking for a PHP solution is that I have no experience with Perl, and I could probably build what I need with a lot of work, but it would be much quicker if I could do the above in PHP.

Source: (StackOverflow)

Python Mechanize select a form with no name

I am attempting to have mechanize select a form from a page, but the form in question has no "name" attribute in the html. What should I do? when I try to use

br.select_form(name = "")

I get errors that no form is declared with that name, and the function requires a name input. There is only one form on the page, is there some other way I can select that form?

Source: (StackOverflow)

How do I use Mechanize to process JavaScript?

I'm connecting to a web site, logging in.

The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.

I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?

Source: (StackOverflow)

Mechanize and Javascript

I want to use mechanize to simulate browsing to a web page with active javascript (including DOM Events) and AJAX, and so far I've found to way to do that...

Ive looked at some python client browsers that support javascript like spynner and zope, and none of them really work for me (spynner crashes PyQt all the time and zope doesnt support JS as it seems)

Is there any way to simulate browsing with Python only, no extra processes (like WATIR or libraries that manipulate FF and IE) while supporting Javascript fully, as if actually browing the page?

Thanks ahead

Source: (StackOverflow)