API

This part of the documentation covers all the interfaces of crawly.

Runner:

This class is not offered as a public interface, instead users should use runner module attribute that is an instance of _Runner.

class crawly._Runner

Class to manage running all requests concurrently and extracting data from the website and writing them back to pipelines.

add_pipeline(pipeline)

Add a pipeline which is a callable that accept a WebPage class or subclass instance, which will be passed after extracting all the data instructed.

Return:
self to allow “Fluent Interface” creation pattern.
fetch(request)

Execute send request in a greenlet from the pool of requests.

filter(predicate)

Add a predicate to filter pages (URLs) to include only the ones with which the predicate return True.

The difference between this method and _Runner.takewhile() is that _Runner.filter() method allow only to filter individual URLs while _Runner.takewhile() will stop at a given URL when the predicate return False and all URLs which come after this last URL will not be crawled.

Return:
self to allow “Fluent Interface” creation pattern.
log(msg, level=20)

Log a message under level, default to INFO.

on_exception(func)

Add a function to be executed when the crawler find an exception.

Argument:
func: A function that should accept one arguments, that will be the greenlet that raised the exception.
Return:
self to allow “Fluent Interface” creation pattern.
on_finish(func)

Add a function to be executed when the crawler finish crawling and all the greenlet has been joined.

Argument:
func: A function that should accept no arguments.
Return:
self to allow “Fluent Interface” creation pattern.
report

Get execution report.

The report contains the following fields:

  • CRAWLED URLS: count of crawled URLs.
  • EXTRACTED DATA: count of extracted data passed to pipelines.
  • EXCEPTIONS COUNTER: count number of exceptions raised.
  • START TIME: Date time when the crawler started.
  • FINISH TIME: Date time when the crawler finished.
  • TOTAL TIME: The total time spend crawling.
  • SHUTDOWN REASON: Reason why the crawler finished, i.e. show the exception that made the crawler stop if there is one, else show ‘FINISH CRAWLING’ which mean the crawler finish normally.
set_website(website)

Set the website to crawl, the website argument can be an instance or a class that inherit from WebSite class.

Return:
self to allow Fluent Interface creation pattern.
start(argv=None)

Start/Launch crawling.

Argument:
argv: Command line arguments, default to sys.argv[1:].
Command line argument:
--config=file.json

The file.json configuration file should be in JSON format which will replace default configuration that is taken from the global configuration.

takewhile(predicate)

Add a predicate that will stop adding URLs to fetch when the predicate will return False.

Argument:
predicate: A function that accept a page as an argument and return a boolean; when the predicate return False all URLs after this one in the website will not be fetched.
Return:
self to allow “Fluent Interface” creation pattern.

WARNING: The page when passed to the predicate is not fetched yet, so no data is extracted from this page yet.

Website structures:

class crawly.WebSite

An abstract super class that represent a website.

Class inheriting from this class should implement the url class variable, else this class will raise an Exception.

Examples

>>> class PythonQuestions(WebSite):
...     url = "http://stackoverflow.com/question/tagged/python"
...     Pagination = Pagination(
...         'http://stackoverflow.com/questions/tagged/python',
...         data={'page': '{page}'},
...         end=4
...     )
...
>>> [page.url for page in PythonQuestions().pages]      
['<GET http://stackoverflow.com/questions/tagged/python?page=1>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=2>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=3>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=4>']
WebPageCls

alias of WebPage

pages

Get pages from the website.

If WebSite.Pagination class variable was set, this return a list of pages yield by the pagination, else it return a list with a single element which is a WebPage instance of this url.

class crawly.Pagination(url, data, method='GET', start=1, end=None)

Class that iterate over a website pages and return a request for each one of them.

Arguments:
  • url: Pagination url.
  • data: Dictionary of data to send with URL to get the next page, you can pass the string template {page} as a value of a dictionary key, which will be replaced by the exact page value before sending the request.
  • method: HTTP method to use to request the url, default: GET.
  • start: Page number to start requesting from included, default: 1.
  • end: Last pagination’s page number included, default to None in this case developers must override the end_reached method to be able to stop somewhere.

Example

>>> stackoverflow_pages = Pagination(
...     'http://stackoverflow.com/questions/tagged/python',
...     data={'page': '{page}'},
...     end=4
... )
>>> [r.pretty_url for r in stackoverflow_pages]  
['<GET http://stackoverflow.com/questions/tagged/python?page=1>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=2>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=3>',
 '<GET http://stackoverflow.com/questions/tagged/python?page=4>']
end_reached()

Method meant to be overrided to stop iterating over pagination if end constructor argument wasn’t set.

Return True to stop paginating else False.

next()

Get the next page request.

class crawly.WebPage(url_or_request, parent=None, initial=None)

Class that represent a WEB site page that can be used to extract data or extract links to follow.

Extract data from the page

>>> class PythonJobs(WebPage):
...     toextract = {
...         'title': '//div[5]/div/div/div[2]/h2/a/text()'
...     }
...
>>> page = PythonJobs('http://www.python.org/community/jobs/')
>>> page.extract()  
{'title': ...}

Extract links to follow

>>> class PythonJobs(WebPage):
...     tofollow = '//div[5]/div/div/div[2]/h2/a/@href'
...
>>> page = PythonJobs('http://www.python.org/community/jobs/')
>>> list(page.follow_links())  
[...]
Arguments:
  • url_or_request: This argument can be a string representing the URL of this page or for better customizing it can be also a request.
  • parent: A WebPage or a WebSite instance that represent the parent site/page of this one.
  • initial: Initial data related to this page.
data

Get extracted data.

WARNING: This property will recalculate each time the data to return when it’s accessed, so be careful about side effect, what i mean by that is if you override this method and for example the new method define a new value that change in each call e.g. datetime.now(), than you will have inconsistency in your data. In this case and if inconsistency is a problem, developers should use WebPage._getdata() method instead to define any extra data, which is computed only the first time this property is accessed.

extract(toextract=None, update=True)

Extract the data given by toextract.

Argument:
  • toextract: same argument accepted by HTML.extract() method.
  • update: Boolean that enable updating the internal data holder when it’s set to True (default) else it will return extracted data w/o updating internal data holder.
Return:
Extracted data.
Raise:
  • ExtractionError if extraction failed.
  • ValueError if the argument didn’t follow the documentation guidline.

Follow the links given and return a WebPage.WebPageCls instance for each link.

Argument:
tofollow: same argument accepted by HTML.extract() method, if tofollow is a dictionary it must contain a key “links” which should point to the path to use to extract URLs to follow and any extra keys in the dictionary will be used to extract extra data that must be of the same length as the URLs to follow and this data will be passed to the generated WebPageCls instances.
Return:
Generate a list of WebPageCls instances for each link to follow.
Raise:
  • ExtractionError if extraction failed.
  • ValueError if the argument didn’t follow the documentation guideline.
html

Get the HTML of this page as a HTML class instance.

request

Get the request used by this page.

url

Get a pretty URL of this page in the form <(method: data) url>.

class crawly.HTML(html)

Class to represent HTML code.

This class is a wrapper around lxml.html.HtmlElement class, so developers can interact with instance of this class in the same way you do with lxml.html.HtmlElement instances, with the addition that this class define a new method HTML.extract() that allow extracting data from the html.

Example

>>> html = HTML('<html><body><div><h2>test</h2></div></body></html>')
>>> html.extract('//div/h2/text()')
'test'
extract(extractor)

Extract from this HTML the data pointed by extractor.

Argument:
extractor: Can be a dictionary in the form {'name': <callable> or <string>}, or unique callable object that accept a lxml.html.HtmlElement e.g. XPath class instance or a string which in this case the string will be automatically transformed to an XPath instance.
Return:
The extracted data in the form of a dictionary if the extractor argument given was a dictionary else it return a list or string depending on the extractor callbacks.
Raise:
ExtractionError if extraction failed.

Extraction tools:

class crawly.XPath(xpath, *callbacks)

Callable class that define XPATH query with callbacks.

Arguments:
  • xpath: A string representing the XPath query.

  • callbacks: A list of functions to call in order (first to last) over the result returned by lxml.etree.XPath, this class have also a callbacks class variable that can be set by subclasses which have priority over the callbaks passed in this argument, which mean that callbacks passed here will be called after the class variable callbacks.

    Illustration

    XPath("...", callback1, callback2, callback3)
        <=>
    callback3( callback2( callback1( XPath("...") ) ) )
Raise:
ExtractionError if extraction failed.

Example

>>> import string

>>> x = XPath('//div/h2/text()', string.strip)
>>> x('<html><body><div><h2>\r\ntest\n</h2></div></body></html>')
'test'

>>> x = XPath('//ul/li/text()', lambda ls: map(int, ls))
>>> x('<html><body><ul><li>1</li><li>2</li></ul></body></html>')
[1, 2]

Exceptions:

exception crawly.ExtractionError

Error raised when extracting data from HTML fail.

Configuration:

Crawly can be configured by passing a JSON formatted file in the --config command line option that will override the default configuration, which is a combinaison of requests configuration and logging configuration.

{
    'timeout': 15,
    # Requests configuration: http://tinyurl.com/dyvdj57
    'requests': {
        'base_headers': {
            'Accept': '*/*',
            'Accept-Encoding': 'identity, deflate, compress, gzip',
            'User-Agent': 'crawly/' + __version__
        },
        'danger_mode': False,
        'encode_uri': True,
        'keep_alive': True,
        'max_redirects': 30,
        'max_retries': 3,
        'pool_connections': 10,
        'pool_maxsize': 10,
        'safe_mode': True,   # Default in False in requests.
        'strict_mode': False,
        'trust_env': True,
        'verbose': False
    },
    # Logging configuration: http://tinyurl.com/crt6rkw
    'logging': {
        'version': 1,
        'formatters': {
            'standard': {
                'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s'
            }
        },
        'handlers': {
            'console': {
                'formatter': 'standard',
                'class': 'logging.StreamHandler',
            }
        },
        'loggers': {
            '': {
                'handlers': ['console'],
                'level': 'DEBUG',
                'propagate': False,
            }
        }
    }
}

Project Versions

Table Of Contents

Previous topic

Installation

Next topic

Examples

This Page