This part of the documentation covers all the interfaces of crawly.
This class is not offered as a public interface, instead users should use runner module attribute that is an instance of _Runner.
Class to manage running all requests concurrently and extracting data from the website and writing them back to pipelines.
Add a pipeline which is a callable that accept a WebPage class or subclass instance, which will be passed after extracting all the data instructed.
Execute send request in a greenlet from the pool of requests.
Add a predicate to filter pages (URLs) to include only the ones with which the predicate return True.
The difference between this method and _Runner.takewhile() is that _Runner.filter() method allow only to filter individual URLs while _Runner.takewhile() will stop at a given URL when the predicate return False and all URLs which come after this last URL will not be crawled.
Log a message under level, default to INFO.
Add a function to be executed when the crawler find an exception.
Add a function to be executed when the crawler finish crawling and all the greenlet has been joined.
Get execution report.
The report contains the following fields:
- CRAWLED URLS: count of crawled URLs.
- EXTRACTED DATA: count of extracted data passed to pipelines.
- EXCEPTIONS COUNTER: count number of exceptions raised.
- START TIME: Date time when the crawler started.
- FINISH TIME: Date time when the crawler finished.
- TOTAL TIME: The total time spend crawling.
- SHUTDOWN REASON: Reason why the crawler finished, i.e. show the exception that made the crawler stop if there is one, else show ‘FINISH CRAWLING’ which mean the crawler finish normally.
Set the website to crawl, the website argument can be an instance or a class that inherit from WebSite class.
Start/Launch crawling.
The file.json configuration file should be in JSON format which will replace default configuration that is taken from the global configuration.
Add a predicate that will stop adding URLs to fetch when the predicate will return False.
WARNING: The page when passed to the predicate is not fetched yet, so no data is extracted from this page yet.
An abstract super class that represent a website.
Class inheriting from this class should implement the url class variable, else this class will raise an Exception.
Examples
>>> class PythonQuestions(WebSite):
... url = "http://stackoverflow.com/question/tagged/python"
... Pagination = Pagination(
... 'http://stackoverflow.com/questions/tagged/python',
... data={'page': '{page}'},
... end=4
... )
...
>>> [page.url for page in PythonQuestions().pages]
['<GET http://stackoverflow.com/questions/tagged/python?page=1>',
'<GET http://stackoverflow.com/questions/tagged/python?page=2>',
'<GET http://stackoverflow.com/questions/tagged/python?page=3>',
'<GET http://stackoverflow.com/questions/tagged/python?page=4>']
Class that iterate over a website pages and return a request for each one of them.
Example
>>> stackoverflow_pages = Pagination(
... 'http://stackoverflow.com/questions/tagged/python',
... data={'page': '{page}'},
... end=4
... )
>>> [r.pretty_url for r in stackoverflow_pages]
['<GET http://stackoverflow.com/questions/tagged/python?page=1>',
'<GET http://stackoverflow.com/questions/tagged/python?page=2>',
'<GET http://stackoverflow.com/questions/tagged/python?page=3>',
'<GET http://stackoverflow.com/questions/tagged/python?page=4>']
Method meant to be overrided to stop iterating over pagination if end constructor argument wasn’t set.
Return True to stop paginating else False.
Get the next page request.
Class that represent a WEB site page that can be used to extract data or extract links to follow.
Extract data from the page
>>> class PythonJobs(WebPage): ... toextract = { ... 'title': '//div[5]/div/div/div[2]/h2/a/text()' ... } ... >>> page = PythonJobs('http://www.python.org/community/jobs/') >>> page.extract() {'title': ...}Extract links to follow
>>> class PythonJobs(WebPage): ... tofollow = '//div[5]/div/div/div[2]/h2/a/@href' ... >>> page = PythonJobs('http://www.python.org/community/jobs/') >>> list(page.follow_links()) [...]
Get extracted data.
WARNING: This property will recalculate each time the data to return when it’s accessed, so be careful about side effect, what i mean by that is if you override this method and for example the new method define a new value that change in each call e.g. datetime.now(), than you will have inconsistency in your data. In this case and if inconsistency is a problem, developers should use WebPage._getdata() method instead to define any extra data, which is computed only the first time this property is accessed.
Extract the data given by toextract.
Follow the links given and return a WebPage.WebPageCls instance for each link.
Get the request used by this page.
Get a pretty URL of this page in the form <(method: data) url>.
Class to represent HTML code.
This class is a wrapper around lxml.html.HtmlElement class, so developers can interact with instance of this class in the same way you do with lxml.html.HtmlElement instances, with the addition that this class define a new method HTML.extract() that allow extracting data from the html.
Example
>>> html = HTML('<html><body><div><h2>test</h2></div></body></html>')
>>> html.extract('//div/h2/text()')
'test'
Extract from this HTML the data pointed by extractor.
Callable class that define XPATH query with callbacks.
xpath: A string representing the XPath query.
callbacks: A list of functions to call in order (first to last) over the result returned by lxml.etree.XPath, this class have also a callbacks class variable that can be set by subclasses which have priority over the callbaks passed in this argument, which mean that callbacks passed here will be called after the class variable callbacks.
Illustration
XPath("...", callback1, callback2, callback3)
<=>
callback3( callback2( callback1( XPath("...") ) ) )
Example
>>> import string
>>> x = XPath('//div/h2/text()', string.strip)
>>> x('<html><body><div><h2>\r\ntest\n</h2></div></body></html>')
'test'
>>> x = XPath('//ul/li/text()', lambda ls: map(int, ls))
>>> x('<html><body><ul><li>1</li><li>2</li></ul></body></html>')
[1, 2]
Crawly can be configured by passing a JSON formatted file in the --config command line option that will override the default configuration, which is a combinaison of requests configuration and logging configuration.
{
'timeout': 15,
# Requests configuration: http://tinyurl.com/dyvdj57
'requests': {
'base_headers': {
'Accept': '*/*',
'Accept-Encoding': 'identity, deflate, compress, gzip',
'User-Agent': 'crawly/' + __version__
},
'danger_mode': False,
'encode_uri': True,
'keep_alive': True,
'max_redirects': 30,
'max_retries': 3,
'pool_connections': 10,
'pool_maxsize': 10,
'safe_mode': True, # Default in False in requests.
'strict_mode': False,
'trust_env': True,
'verbose': False
},
# Logging configuration: http://tinyurl.com/crt6rkw
'logging': {
'version': 1,
'formatters': {
'standard': {
'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s'
}
},
'handlers': {
'console': {
'formatter': 'standard',
'class': 'logging.StreamHandler',
}
},
'loggers': {
'': {
'handlers': ['console'],
'level': 'DEBUG',
'propagate': False,
}
}
}
}