pyquery in Python

Getting attributes in PyQuery?

I'm using PyQuery and want to print a list of links, but can't figure out how to get the href attribute from each link in the PyQuery syntax.

This is my code:

  e = pq(url=results_url)
  links = e('li.moredetails a')
  print len(links)
  for link in links:
    print link.attr('href')

This prints 10, then gives the following error:

AttributeError: 'HtmlElement' object has no attribute 'attr'

What am I doing wrong?

Source: (StackOverflow)

Why can this unbound variable work in Python (pyquery)?

The code is from the guide of pyquery

from pyquery import PyQuery
d = PyQuery('<p class="hello">Hi</p><p>Bye</p>')
d('p').filter(lambda i: PyQuery(this).text() == 'Hi')

My question is this in the 3rd line is an unbound variable and is never defined in current environment, but the above code still works.

How can it work? Why it doesn't complain NameError: name 'this' is not defined?

It seems that something happens at https://bitbucket.org/olauzanne/pyquery/src/c148e4445f49/pyquery/pyquery.py#cl-478 , could anybody explain it?

Source: (StackOverflow)

Iterating over objects in pyquery

I'm scraping a page with Python's pyquery, and I'm kinda confused by the types it returns, and in particular how to iterate over a list of results.

If my HTML looks a bit like this:

<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>

How do I get the inside of the <h3> tags, one by one so I can process them? I'm trying:

results_page = pq(response.read())
formwraps = results_page(".formwrap") 
print type(formwraps)
print type([formwraps])
for my_div in [formwraps]:
    print type(my_div)
    print my_div("h3").text()

This produces:

<class 'pyquery.pyquery.PyQuery'>
<type 'list'>
<class 'pyquery.pyquery.PyQuery'>
Something interesting something else interesting

It looks like there's no actual iteration going on. How can I pull out each element individually?

Extra question from a newbie: what are the square brackets around [a] doing? It looks like it converts a special Pyquery object to a list. Is [] a standard Python operator?

------UPDATE--------

I've found an 'each' function in the pyquery docs. However, I don't understand how to use it for what I want. Say I just want to print out the content of the <h3>. This produces a syntax error: why?

formwraps.each(lambda e: print e("h3").text())

Source: (StackOverflow)

PyQuery get text node

I'm using PyQuery to process this HTML:

<div class="container">
    <strong>Personality: Strengths</strong>
    <br />
    Text
    <br />
    <br />
    <strong>Personality: Weaknesses</strong>
    <br />
    Text
    <br />
    <br />
</div>

Now that I've got a variable e point to .container, I'm looping through its children:

for c in e.iterchildren():
    print c.tag

but in this way I can't get text nodes (the two Text string)

How can I loop an element's children include text nodes?

Source: (StackOverflow)

how to use pyquery to modify a node attribute in python

iwant use pyquery to do this.

for example:

html='<div>arya stark<img src="1111"/>ahahah<img src="2222"/></div>'
a=PyQuery(html)

i want to modify the html to

<div>arya stark<img src="aaaa"/>ahahah<img src="bbbb"/></div>

in other words, just need change img element's src attribute, and get the modified html.

any ideas?or any other method?

thanks

Source: (StackOverflow)

Find tag name of pyquery object

for l in d.items('nl,de,en'):
   if l.tag()=='nl':
      dothis()

How can I find the tag associated with a pyquery object? The method tag() in the exaple above doesnt exist...

Source: (StackOverflow)

How do I access the first item(or xth item) in a PyQuery query?

I have a query for a one of my tests that returns 2 results. Specifically the 3rd level of an outline found using

query = html("ul ol ul")

How do I select the first or second unordered list?

query[0]

decays to a HTMLElement

list(query.items())[0]

or

query.items().next() #(in case of the first element)

is there any better way that I can't see?

note:

query = html("ul ol ul :first")

gets the first element of each list not the first list.

Source: (StackOverflow)

Python/PyQuery: Unable to find vcvarsall.bat?

I have Python 2.7 and I was trying to use PyQuery, so for a test I just typed "import PyQuery" and I got an error:

Traceback (most recent call last):
  File "C:\Users\Jacob\Documents\dupes.py", line 1, in <module>
    import pyquery
  File "C:\Python27\lib\site-packages\pyquery-1.2.1-py2.7.egg\pyquery\__init__.py", line 12, in <module>
    from .pyquery import PyQuery
  File "C:\Python27\lib\site-packages\pyquery-1.2.1-py2.7.egg\pyquery\pyquery.py", line 8, in <module>
    from lxml import etree
ImportError: No module named lxml

So I went to the command prompt and tried to install lxml, but I got this:

Building lxml version 2.3.5.
Building without Cython.
ERROR: 'xslt-config' is not recognized as an internal or external command,
operable program or batch file.

** make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
error: Setup script exited with error: Unable to find vcvarsall.bat

I don't really understand what's wrong or what I should do...can someone help?

Thanks.

EDIT:

In response to the comment, I used easy install...

Source: (StackOverflow)

How to unescape special characters while converting pyquery object to string

I am trying to fetch a remote page with python requests module, reconstruct a DOM tree, do some processing and save the result to file. When I fetch a page and then just write it to the file everything works (I can open an html file later in the browser and it is rendered correctly).

However, if I create a pyquery object and do some processing and then save it by using str conversion it fails. Specifically, special characters like && and etc. get modified within script tags of the saved source (caused by application of pyquery) and it prevents page from rendering correctly.

Here is my code:

import requests
from lxml import etree
from pyquery import PyQuery as pq

user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)

DOM = pq(r.text)
#some optional processing
fTest = open("fTest.html","wb")
fTest.write(str(DOM))
fTest.close()

So, the question is: How to make sure that special characters aren't escaped after application of pyquery? I suppose it might be related to lxml (parent library for pyquery), but after tedious search online and experiments with different ways of object serialization I still didn't make it. Maybe this is also related to unicode handling?!

Many thanks in advance!

Source: (StackOverflow)

PyQuery: Get only text of element, not text of child elements

I have the following HTML:

<h1 class="price">
 <span class="strike">$325.00</span>$295.00
</h1>

I'd like to get the $295 out. However, if I simply use PyQuery as follows:

price = pq('h1').text()

I get both prices.

Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?

Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.

Thanks for your help.

Source: (StackOverflow)

How to get text content of multiple tags inside a table using PyQuery?

How to select attribute's text field from given book-details table field where values are in text or in text field?

    <table cellspacing="0" class="fk-specs-type2">
        <tr>
            <th class="group-head" colspan="2">Book Details</th>
        </tr>
                                                                                    <tr>
                <td class="specs-key">Publisher</td>
                <td class="specs-value fk-data">HARPER COLLINS INDIA</td>
            </tr>
                                                                                    <tr>
                <td class="specs-key">ISBN-13</td>
                <td class="specs-value fk-data">9789350291924</td>
            </tr>

                </table>

Source: (StackOverflow)

How to use Pyquery with scrapy?

My objective is to use pyquery with scrapy, apparently from scrapy.selector import PyQuerySelector returns ImportError: cannot import name PyQuerySelector when I crawl the spider.

I followed this specific gist https://gist.github.com/joehillen/795180 to implement pyquery.

Any suggestions or tutorials that can help me get this job done?

Source: (StackOverflow)

Entire JSON into One SQLite Field with Python

I have what is likely an easy question. I'm trying to pull a JSON from an online source, and store it in a SQLite table. In addition to storing the data in a rich table, corresponding to the many fields in the JSON, I would like to also just dump the entire JSON into a table every time it is pulled.

The table looks like:

CREATE TABLE Raw_JSONs (ID INTEGER PRIMARY KEY ASC, T DATE DEFAULT (datetime('now','localtime')), JSON text);

I've pulled a JSON from some URL using the following python code:

from pyquery import PyQuery
from lxml import etree
import urllib

x = PyQuery(url='json')
y = x('p').text()

Now, I'd like to execute the following INSERT command:

import sqlite3

db = sqlite3.connect('a.db')
c = db.cursor()

c.execute("insert into Raw_JSONs values(NULL,DATETIME('now'),?)", y)

But I'm told that I've supplied the incorrect number bindings (i.e. thousands, instead of just 1). I gather it's reading the y variable as all the different elements of the JSON.

Can someone help me store just the JSON, in it's entirety?

Also, as I'm obviously new to this JSON game, any online resources to recommend would be amazing.

Thanks!

Source: (StackOverflow)

Stop pyquery inserting spaces where there aren't any in source HTML?

I am trying to get some text from an element, using pyquery 1.2. There are no spaces in the displayed text, but pyquery is inserting spaces.

Here is my code:

from pyquery import PyQuery as pq
html = '<h1><span class="highlight" style="background-color:">Randomized</span> and <span class="highlight" style="background-color:">non-randomized</span> <span class="highlight" style="background-color:">patients</span> in <span class="highlight" style="background-color:">clinical</span> <span class="highlight" style="background-color:">trials</span>: <span class="highlight" style="background-color:">experiences</span> with <span class="highlight" style="background-color:">comprehensive</span> <span class="highlight" style="background-color:">cohort</span> <span class="highlight" style="background-color:">studies</span>.</h1>'
doc = pq(html)
print doc('h1').text()

This produces (note spaces before colon and period):

Randomized and non-randomized patients in clinical trials : 
experiences with comprehensive cohort studies .

How can I stop pyquery inserting spaces into the text?

Source: (StackOverflow)

Convert unicode with utf-8 string as content to str

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

Source: (StackOverflow)