mechanize
Mechanize is a ruby library that makes automated web interaction easy.
RDoc Documentation
I'm trying to do the following:
page = Mechanize.new.get "https://sis-app.sph.harvard.edu:9030/prod/bwckschd.p_disp_dyn_sched"
But I only get this exception:
OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=SSLv2/v3 read server hello A: sslv3 alert illegal parameter
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:54:in `timeout'
from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/timeout.rb:99:in `timeout'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:755:in `do_start'
from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:750:in `start'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent.rb:511:in `connection_for'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/net-http-persistent-2.7/lib/net/http/persistent.rb:806:in `request'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:258:in `fetch'
from /Users/amosng/.rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:407:in `get'
from (irb):549
from /Users/amosng/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `<main>'
How can I get the webpage to load in Mechanize?
Source: (StackOverflow)
I'm using Mechanize to facilitate the downloading of some files. At the moment my script uses the following line to actually download the files...
agent.get('http://example.com/foo').save_as 'a_file_name'
However this downloads the complete file into memory before dumping it to disk. How do you bypass this behavior, and simply download straight to disk? If I need to use something other than WWW:Mechanize then how would I go about using WWW:Mechanize's cookies with it?
Source: (StackOverflow)
I'm working in Ruby, but my question is valid for other languages as well.
I have a Mechanize-driven application. The server I'm talking to sets a cookie using JavaScript (rather than standard set-cookie), so Mechanize doesn't catch the cookie. I need to pass that cookie back on the next GET request.
The good news is that I already know the value of the cookie, but I don't know how to tell Mechanize to include it in my next GET request.
Source: (StackOverflow)
I know you have a set of pre-defined aliases you can use by setting agent.user_agent_alias = 'Linux Mozilla' for instance, but what if I want to set my own user agent, as I'm writing a web crawler and want to identify it, for the sites I'm indexing's sake. Just like Googlebot.
There seems to be a user_agent method, but I can't seem to find any documentation about it's function.
Source: (StackOverflow)
I want to screen-scrape a web-site that uses JavaScript.
There is mechanize, the programmatic web browser for Python. However, it (understandably) doesn't interpret javascript. Is there any programmatic browser for Python which does? If not, is there any JavaScript implementation in Python that I could use to attempt to create one?
Source: (StackOverflow)
I'm sorry to have to ask something like this but python's mechanize documentation seems to really be lacking and I can't figure this out.. they only give one example that I can find for following a link:
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
But I don't want to use a regex, I just want to follow a link based on its url, how would I do this.. also what is "nr" that is used sometimes for following links?
Thanks for any info
Source: (StackOverflow)
I have a mechanize script written in python that fills out a web form and is supposed to click on the 'create' button. But there's a problem, the form has two buttons. One for 'add attached file' and one for 'create'. Both are of type 'submit', and the attach button is the first one listed. So when I select the forum and do br.submit(), it clicks on the 'attach' button instead of 'create'. Extensive Googling has yielded nothing useful for selecting a specific button in a form. Does anyone know of any methods for skipping over the first 'submit' button and clicking the second?
Source: (StackOverflow)
Is there a way to get around the following?
httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.
I'm using mechanize and BeautifulSoup on Python2.6.
hoping for a work-around
Source: (StackOverflow)
I am trying to import a module from a particular directory.
The problem is that if I use sys.path.append(mod_directory)
to append the path and then open the python interpreter, the directory mod_directory
gets added to the end of the list sys.path. If I export the PYTHONPATH
variable before opening the python interpreter, the directory gets added to the start of the list. In the latter case I can import the module but in the former, I cannot.
Can somebody explain why this is happening and give me a solution to add the mod_directory
to the start, inside a python script ?
Source: (StackOverflow)
I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.
What about this table
? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.
Please note that there are few tables in the HTML document? I am after one particular table with its tbody
, <tbody id="threadbits_forum_251">
. The name will be always the same (I hope). Can I use the tbody
and the name
in the code?
<table >
<tbody>
<tr> <!-- table header --> </tr>
</tbody>
<!-- show threads -->
<tbody id="threadbits_forum_251">
<tr>
<td></td>
<td></td>
<td>
<div>
<a rel='nofollow' href="showthread.php?t=230708" >Vb4 Gold Released</a>
</div>
<div>
<span><a>Paul M</a></span>
</div>
</td>
<td>
06 Jan 2010 <span class="time">23:35</span><br />
by <a rel='nofollow' href="member.php?find=lastposter&t=230708">shane943</a>
</div>
</td>
<td><a rel='nofollow' href="#">24</a></td>
<td>1,320</td>
</tr>
</tbody>
</table>
Source: (StackOverflow)
Ok so I need to download some web pages using Python and did a quick investigation of my options.
Included with Python:
urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)
urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)
Full featured:
mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)
PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)
New possibilities:
urllib3 - supports connection re-using/pooling and file posting
Deprecated (a.k.a. use urllib/urllib2 instead):
httplib - HTTP/HTTPS only (no FTP)
httplib2 - HTTP/HTTPS only (no FTP)
The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).
urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.
Edit: added note on verb support in urllib2
Source: (StackOverflow)
I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.
I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...)
statements
Clarification:
I want something more high-level than the answers so far. For example, in Perl, you could do something like:
# navigate to the main page
$mech->get( 'http://www.somesite.com/' );
# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );
# submit a POST form, to log into the site
$mech->submit_form(
with_fields => {
username => 'mungo',
password => 'lost-and-alone',
}
);
# save the results as a file
$mech->save_content('somefile.zip');
To do the same thing using HTTP_Client or wget or CURL would be a lot of work, I'd have to manually parse the pages to find the links, find the form URL, extract all the hidden fields, and so on. The reason I'm asking for a PHP solution is that I have no experience with Perl, and I could probably build what I need with a lot of work, but it would be much quicker if I could do the above in PHP.
Source: (StackOverflow)
I am attempting to have mechanize select a form from a page, but the form in question has no "name" attribute in the html. What should I do? when I try to use
br.select_form(name = "")
I get errors that no form is declared with that name, and the function requires a name input. There is only one form on the page, is there some other way I can select that form?
Source: (StackOverflow)
I'm connecting to a web site, logging in.
The website redirects me to new pages and Mechanize deals with all cookie and redirection jobs, but, I can't get the last page. I used Firebug and did same job again and saw that there are two more pages I had to pass with Mechanize.
I took a quick look at the pages and saw that there is some JavaScript and HTML code but couldn't understand it because it doesn't look like normal page code. What are those pages for? How they can redirect to other pages? What should I do to pass these?
Source: (StackOverflow)
I want to use mechanize to simulate browsing to a web page with active javascript (including DOM Events) and AJAX, and so far I've found to way to do that...
Ive looked at some python client browsers that support javascript like spynner and zope, and none of them really work for me (spynner crashes PyQt all the time and zope doesnt support JS as it seems)
Is there any way to simulate browsing with Python only, no extra processes (like WATIR or libraries that manipulate FF and IE) while supporting Javascript fully, as if actually browing the page?
Thanks ahead
Source: (StackOverflow)