EzDevInfo.com

Goutte

Goutte, a simple PHP Web Scraper

How can I skip or remove a list of html tags from my crawler object using Symfony DomCrawler Component and Goutte for Laravel 4?

This was my first attempt but it did not work.

$this->crawler = $client->request('GET', $this->url);
$document = new \DOMDocument('1.0', 'UTF-8');
$root = $document->appendChild($document->createElement('_root'));
$this->crawler->rewind();
$root->appendChild($document->importNode($this->crawler->current(), true));

$selectorsToRemove = ['script','p'];
foreach ($selectorsToRemove as $selector) {
   $crawlerInverse = $this->crawler->filter($selector);
   foreach ($crawlerInverse as $elementToRemove) {
      $parent = $elementToRemove->parentNode;
      $parent->removeChild($elementToRemove);
    }
}
$this->crawler->clear();
$this->crawler->add($document);

I want to get the "p" tags from this page http://www.amazon.com/dp/B00IOY8XWQ/ref=fs_kv and it seams that it has some js inside the paragraph so when I try to do $node->text(); it gets me the text and the js inside the "script" inside the "p". The structure is like this;

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut    labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<script>
 "JS CODE"
</script>
</p>

So I just want the Lorem ipsum text.


Source: (StackOverflow)

Getting cURL timeout error using Goutte, even with config settings:

Here's the code:

$this->baseUrl = "https://sfbay.craigslist.org/sfc/apa/";
$this->client = new Goutte\Client();
$curlOptions = array(
    CURLOPT_CONNECTTIMEOUT => 600,
    CURLOPT_TIMEOUT => 600
);
$this->client->getClient()->setDefaultOption('config', ['curl' => $curlOptions]);
$crawler = $this->client->request("GET", $this->baseUrl);

And the error (repeated twice): PHP Fatal error: Uncaught exception 'GuzzleHttp\Exception\RequestException' with message '[curl] (#28) See http://curl.haxx.se/libcurl/c/libcurl-errors.html for an explanation of cURL errors [url] https://sfbay.craigslist.org/sfc/apa/' in /Users/...../vendor/guzzlehttp/guzzle/src/Adapter/Curl/MultiAdapter.php:216

The thing is: the code worked an hour ago, no issue! I added the cURL options after finding out error #28 is timeout.

Am I missing a cURL option? Or maybe I'm setting the values wrong? And why the change? I'm not on a significantly slower network (AFAIK).


Source: (StackOverflow)

Advertisements

Mink XML handling not as expected, adds backslashes to feed

I am using Mink with the Goutte webdriver trying to replace the contents of a form in a website with an XML feed.

I coded the following method:

public function replaceField($field)
    {
        $baseText = '<?xml version="1.0" encoding="UTF-8"?>
<RiskAssessmentReply xmlns="http://test.com" >
    <!-- ExternalId of the Order --> 
    <OrderId>TO_REPLACE</OrderId>
    <RiskInfo>
        <Actions>
            <SystemAction>SystemAction</SystemAction>
            <FinalAction>FinalAction</FinalAction>
        </Actions>
        <Score SystemScore="0"/>
    </RiskInfo>
    <!-- One of Accept, Manual_Accept, Reject, Cancel, or Ignore --> 
    <ResponseCode>Accept</ResponseCode>
    <StoreId>TESTSTORE</StoreId>
</RiskAssessmentReply>';

        $textWithOrderId = preg_replace('/TO_REPLACE/', $GLOBALS['ORDER_ID'], $baseText);
        $this->getSession()->getPage()->fillField($field, $textWithOrderId);
    }

Which basically contains the XML feed, then I replace a part of it with an order ID that I have from beforehand and call the function fillField which comes bundled with Mink.

The problem is that it does not just paste the text that I provide, but formats it in a weird manner by setting backslashes before the " symbols, like this:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>

Therefore, when I try to submit the XML feed, the website displays the following error:

|  Fatal error: Uncaught exception 'ErrorException' with message 'SimpleXMLElement::__construct(): Entity: line 1: parser error : String not started expecting ' or "'

I've tried using the stripslashes method from PHP, but it doesn't work, as if I try an echo after adding it an Order ID it displays the original XML without slashes, so I'm guessing there is a calling to some other function when using fillField that does indeed add the backslashes to my text, but I haven't been able to find the source for it.

Does anyone know where this conversion from " to \" is made in order to avoid it?

Thanks


Source: (StackOverflow)

How to run PHPUnit from a PHP script?

I am creating a custom testing application using PHPUnit and Goutte. I would like to load the Goutte library (plus any files required for the tests) within my own bootstrap file and then start the PHPUnit test runner once it is all bootstrapped.

I'm not sure how to do this without calling the phpunit script externally (Which would be a seperate process, and won't be able to see my bootstrapped libraries). Has anyone done anything like this before? What is the best way to do it?


Source: (StackOverflow)

How can I scrape website content in PHP from a website that requires a cookie login?

My problem is that it doesn't just require a basic cookie, but rather asks for a session cookie, and for randomly generated IDs. I think this means I need to use a web browser emulator with a cookie jar?

I have tried to use Snoopy, Goutte and a couple of other web browser emulators, but as of yet I have not been able to find tutorials on how to receive cookies. I am getting a little desperate!

Can anyone give me an example of how to accept cookies in Snoopy or Goutte?

Thanks in advance!


Source: (StackOverflow)

Access Guzzle Response from Goutte

I'm trying to access to the Guzzle Response object from Goutte. Because that object has nice methods that i want to use. getEffectiveUrl for example.

As far as i can see there is no way doing it without hacking the code.

Or without accessing the response object, is there a way to get the last redirected url froum goutte?


Source: (StackOverflow)

How to use Goutte

Issue:
Cannot fully understand the Goutte web scraper.

Request:
Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form?

Webpage Layout attempting to be scraped:
Step 1:
The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is defaulted to Name with First and Last Name textboxes along with a State drop down menu select list. If you choose Radio there is jQuery or JavaScript that makes the First and Last Name textboxes go away and a License Textbox appears.

Step 2:
Once you have successfully submitted the form then it brings you to a page that has multiple links. We can go in to one of two of them to get our information we need.

Step 3:
Once we have successfully clicked on the link we want the third page has the data that we are looking for and we want to store that data into a php variable.

Submitting Incorrect information:
If wrong information is submitted then a jQuery/Javascript returns a message of "No records were found." on the same page as the submission.

Note:
The preferred method would be to select the license radio button, fill in the license number, choose the state and then submit the form. I have read tons of posts and blogs and other items about Goutte and nowhere can I find what options are available for Goutte, how you find out this information or how to use this information if it did exist.


Source: (StackOverflow)

Can Goutte/Guzzle be forced into UTF-8 mode?

I'm scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

However, the content type header is thus:

Content-Type: text/html

and not:

Content-Type: text/html; charset=utf-8

Thus, when I scrape, Goutte does not spot that it is UTF-8, and grabs data incorrectly. The remote site is not under my control, so I can't fix the problem there! Here's a set of scripts to replicate the problem. First, the scraper:

<?php

require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar';

$url = 'http://crawler-tests.local/utf-8.php';
use Goutte\Client;

$client = new Client();
$crawler = $client->request('get', $url);
$text = $crawler->text();
echo 'Whole page: ' . $text . "\n";

Now a test page to be placed on a web server:

<?php
// Correct
#header('Content-Type: text/html; charset=utf-8');

// Incorrect
header('Content-Type: text/html');
?>  
<!DOCTYPE html>
<html>
    <head>
        <title>UTF-8 test</title>
        <meta charset="utf-8" />
    </head>
    <body>
        <p>When the Content-Header header is incomplete, the pound sign breaks:

        £15,216</p>
    </body>
</html>

Here's the output of the Goutte test:

Whole page: UTF-8 test When the Content-Header header is incomplete, the pound sign breaks: £15,216

As you can see from the comments in the last script, properly declaring the character set in the header fixes things. I've hunted around in Goutte to see if there is anything that looks like it would force the character set, but to no avail. Any ideas?


Source: (StackOverflow)

Goutte - dom crawler - remove node

I have html on my site (http://testsite.com/test.php) :

<div class="first">
  <div class="second">
     <a rel='nofollow' href="/test.php">click</a>
     <span>back</span>
  </div>
</div>
<div class="first">
  <div class="second">
     <a rel='nofollow' href="/test.php">click</a>
     <span>back</span>
  </div>
</div>

I would like receive:

<div class="first">
  <div class="second">
     <a rel='nofollow' href="/test.php">click</a>
  </div>
</div>
<div class="first">
  <div class="second">
     <a rel='nofollow' href="/test.php">click</a>
  </div>
</div>

So i would like remove span. I use Goutte in Symfony2 based on http://symfony.com/doc/current/components/dom_crawler.html :

    $client = new Client();
    $crawler = $client->request('GET', 'http://testsite.com/test.php');

    $crawler->filter('.first .second')->each(function ($node) {
        //??????
    });

Source: (StackOverflow)

How to extract data with Goutte Crawler?

This code, returned hrefs to content, now I want to extract content form this hrefs and sent it to my view. Name divs which I need to extract:

<div class="c_pad">
  <div class="c_label">
    <span class="std_header2">Contact:</span>
  </div>
<div class="c_name">
  <span class="std_text_b">Monkey</span>
</div>
<div class="clear"></div>
</div>

<div class="c_pad">
    <div class="c_label">
      <span class="std_header2">Phone number:</span>
    </div>
    <div class="c_phone">
      <span class="std_text_b">001111111</span>
    </div>
    <div class="clear"></div>
</div>

for($i=0; $i <= 1; $i++)
    {
      $p = new Client();
      $d = $p->request('GET', ''.$link.'&std=1&results='. $i);
      $n = $d->filter('a[class="o_title"]')->each(function ($node) 
        { 
         $pp = new Client();
         $dd = $pp->request('GET', $node->attr('href'));
         $kk = $dd->filter('div[id="adv_desc"]')->each(function ($tekst) {  echo $node->attr('href').'<br>'.$tekst->text(); 
                    });
         });
    }

Source: (StackOverflow)

Behat & Mink : Use the test environment

I'm current using Behat with Mink & Goutte Driver. When i'm trying to use it with my dev environment, via the app_dev.php file, which is a typical app_dev.php file from a Symfony2 Standard Edition, my tests are working just fine (Gists).

But, if I want to use a app_test file (which is the same as the app_dev file, except for the environment parameter set to "test" instead of "dev", and debug mode disabled), on the logout scenario, it seems that Goutte can't find the "user_signup" identifier, and in the "login" scenario, it does not find the "Root" text node. Indeed, when i'm using a print last response, it seems that the user is just not logged in : I still see the forms to log in an user...

When i'm on my dev environment (app_dev) or prod environment (app), everything seems to be working just fine though... Any idea ?

(If you think you need some other files, please tell me).


Source: (StackOverflow)

Confusion between BehatContext and MinkContext

I am trying to do BDD in my Symfony 2.3 project and appear to be struggling with some inconsistencies.

Depending on whether I use BehatContext or MinkContext as my base class for the FeatureContext class, I am getting different results.

If I use:

class FeatureContext extends BehatContext 

everything is fine. however if I use:

class FeatureContext extends MinkContext

I get errors which make it look like the MinkContext is not liking my regex no more which the system generate itself. Can you please help me understand:

FFFFFFF

(::) failed steps (::)

01. Ambiguous match of "I am on homepage":
    to `/^I am on homepage$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iAmOnHomepage()
    to `/^(?:|I )am on (?:|the )homepage$/` from Behat\MinkExtension\Context\MinkContext::iAmOnHomepage()
    In step `Given I am on homepage'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

02. Ambiguous match of "I follow "sign up"":
    to `/^I follow "([^"]*)"$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iFollow()
    to `/^(?:|I )follow "(?P<link>(?:[^"]|\\")*)"$/` from Behat\MinkExtension\Context\MinkContext::clickLink()
    In step `And I follow "sign up"'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

03. Ambiguous match of "I fill in "username" with "email@email.com"":
    to `/^I fill in "([^"]*)" with "([^"]*)"$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iFillInWith()
    to `/^(?:|I )fill in "(?P<field>(?:[^"]|\\")*)" with "(?P<value>(?:[^"]|\\")*)"$/` from Behat\MinkExtension\Context\MinkContext::fillField()
    In step `When I fill in "username" with "email@email.com"'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

04. Ambiguous match of "I fill in "password" with "password123"":
    to `/^I fill in "([^"]*)" with "([^"]*)"$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iFillInWith()
    to `/^(?:|I )fill in "(?P<field>(?:[^"]|\\")*)" with "(?P<value>(?:[^"]|\\")*)"$/` from Behat\MinkExtension\Context\MinkContext::fillField()
    In step `And I fill in "password" with "password123"'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

05. Ambiguous match of "I press "register"":
    to `/^I press "([^"]*)"$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iPress()
    to `/^(?:|I )press "(?P<button>(?:[^"]|\\")*)"$/` from Behat\MinkExtension\Context\MinkContext::pressButton()
    In step `And I press "register"'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

06. Ambiguous match of "I should see "You have successfully registered"":
    to `/^I should see "([^"]*)"$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iShouldSee()
    to `/^(?:|I )should see "(?P<text>(?:[^"]|\\")*)"$/` from Behat\MinkExtension\Context\MinkContext::assertPageContainsText()
    In step `Then I should see "You have successfully registered"'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

07. Ambiguous match of "I should be on homepage":
    to `/^I should be on homepage$/` from Main\ReferralCaptureBundle\Features\Context\FeatureContext::iShouldBeOnHomepage()
    to `/^(?:|I )should be on (?:|the )homepage$/` from Behat\MinkExtension\Context\MinkContext::assertHomepage()
    In step `And I should be on homepage'.
    From scenario `Successful registration when user provides all the required info'. # src/Main/ReferralCaptureBundle/Features/registration.feature:12
    Of feature `registration'.                                                        # src/Main/ReferralCaptureBundle/Features/registration.feature

1 scenario (1 failed)
7 steps (7 failed)

FeatureContext.php

<?php

namespace Main\ReferralCaptureBundle\Features\Context;

use Main\ReferralCaptureBundle\Features\Context\FeatureContext;

use Symfony\Component\HttpKernel\KernelInterface;
use Behat\Symfony2Extension\Context\KernelAwareInterface;
use Behat\MinkExtension\Context\MinkContext;
use Behat\MinkExtension\Context\RawMinkContext;

use Behat\Behat\Context\BehatContext,
    Behat\Behat\Exception\PendingException;
use Behat\Gherkin\Node\PyStringNode,
    Behat\Gherkin\Node\TableNode;

use Goutte\Client;

//
// Require 3rd-party libraries here:
//
   require_once 'PHPUnit/Autoload.php';
   require_once 'PHPUnit/Framework/Assert/Functions.php';
//

/**
 * Feature context.
 */
class FeatureContext extends RawMinkContext //WHAT TO USE HERE!!!!!!
                  implements KernelAwareInterface
{
    private $kernel;
    private $parameters;

    /**
     * Initializes context with parameters from behat.yml.
     *
     * @param array $parameters
     */
    public function __construct(array $parameters)
    {
        $this->parameters = $parameters;
   //     $this->useContext('mink', new MinkContext);
    }

    /**
     * Sets HttpKernel instance.
     * This method will be automatically called by Symfony2Extension ContextInitializer.
     *
     * @param KernelInterface $kernel
     */
    public function setKernel(KernelInterface $kernel)
    {
        $this->kernel = $kernel;
    }

//
// Place your definition and hook methods here:
//
//    /**
//     * @Given /^I have done something with "([^"]*)"$/
//     */
//    public function iHaveDoneSomethingWith($argument)
//    {
//        $container = $this->kernel->getContainer();
//        $container->get('some_service')->doSomethingWith($argument);
//    }
//    
//    
//    
//

    /**
     * @Given /^I am on homepage$/
     */
    public function iAmOnHomepage()
    {
        $client = new Client();
        $crawler = $client->request('GET', 'http://local.referral.com/');

        $link = $crawler->selectLink('I am a Physician')->link();


       if (!count($link)>0)
       {
          throw new Exception("Home Page Not Loaded:\n");   

       }
    }

    /**
     * @Given /^I follow "([^"]*)"$/
     */
    public function iFollow($arg1)
    {
        throw new PendingException();
    }

    /**
     * @When /^I fill in "([^"]*)" with "([^"]*)"$/
     */
    public function iFillInWith($arg1, $arg2)
    {
        throw new PendingException();
    }

    /**
     * @Given /^I press "([^"]*)"$/
     */
    public function iPress($arg1)
    {
        throw new PendingException();
    }

    /**
     * @Then /^I should see "([^"]*)"$/
     */
    public function iShouldSee($arg1)
    {
        throw new PendingException();
    }

    /**
     * @Given /^I should be on homepage$/
     */
    public function iShouldBeOnHomepage()
    {
        throw new PendingException();
    }


}

behat.yml

default:
  formatter:
    name: progress
  extensions:
    Behat\Symfony2Extension\Extension:
      mink_driver: true
      kernel:
        env: test
        debug: true
    Behat\MinkExtension\Extension:
      goutte: ~
      base_url: 'http://local.mysite.com'
      default_session: symfony2

Source: (StackOverflow)

DOMCrawler not dumping data properly for parsing

I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information

I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing.

so I have a $crawler object. And I've tried to loop through the following to get what I want:

$title = $crawler->filterXPath('//td[. = "Title"]/following-sibling::td[1]')->each(funtion (Crawler $node, $i) {
        return $node->text();
});

I'm not sure what Crawler $node, I just got it from the example on the web page. Perhaps if I can get this working, then it will loop through each node in the $crawler object and find what I'm actually looking for.

Here's an example of the page:

<table> 
<tr>
    <td>Title</td>
    <td>The Harsh Face of Mother Nature</td>
   <td>The Harsh Face of Mother Nature</td>
</tr>
.
.
.
</table>

And this is just one table, there are many tables and a huge sloppy mess outside of this one. Any ideas?

(Note: earlier I was able to apply a filter to the $crawler object for some information I needed, then I serialize() the information, and has a string finally, which made sense. But I cannot get a string at all anymore, idk why.)


Source: (StackOverflow)

Can't Scrape Attribute of from Sibling Element

I am trying to scrape some data using Symfony2, Goutte, and DomCrawler. I have a tricky situation where I need to get a value of an attribute inside a <td>.

Working section:

    $query = "//td[normalize-space(text()) = 'Event Title']/following-sibling::td[1]";
    $crawler->filterXPath($query)->each(function($crawler, $i) {
        echo $crawler->text();// . $i . PHP_EOL;
    });


<tr>
    <td>Event Title</td> 
    <td>the title is here</td> 
</tr>

well, now it's:

<tr>
    <td>Event Title</td> 
    <td><input value="thisiswhatIneed"></td> 
</tr>

And I'm trying to change the selector

$query = "//td[normalize-space(text()) = 'Presenter']/following-sibling::td[1]/input[value]"; 

Any idea how to continue to traverse the so that I can access the <input> in order to get what it's attribute value="" is?


Source: (StackOverflow)

Using goutte to read from a file / string

I'm using Goutte to make a webscraper.

For development, I've saved a .html document I'd like to traverse (so i'm not constantly making requests to the website). Here's what I have so far:

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

Which based of what I know should call request in Symfony\Component\BrowserKit, and pass in the raw body data. Here's the error message I'm getting:

PHP Fatal error:  Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.

If I were to just use DomCrawler, it's non-trivial to create a crawler using a string. (see: http://symfony.com/doc/current/components/dom_crawler.html). I'm just unsure about how to do the equivalent with Goutte.

Thanks in advance.


Source: (StackOverflow)