FAQ¶

Existing study: Why you shouldn’t use Crawly ?¶

First of all, have you checked scrapy (http://scrapy.org) ? if not, you should, it’s a very powerful framework, but in my case and unfortunately i have found some drawbacks with Scrapy which lead me to create Crawly, which are:

Scrapy was too big and too hard to hack in, as i had some problems with it, especially concerning consistency of scraped data which is a huge problem when it came to scapers, but a lot of spaghetti code make it also very hard to dig in :).
Most website just look the same (at least the one that i crawled) but scrapy didn’t help make my code clean because of a lot of boilerplate.
Scrapy is huge in term of architecture (scrapyd, web interface, ...), and all of this was consuming a lot of memory and my little server wasn’t able to support, so it was crushing other process each time scrapy start crawling.

Define the need: Why I have created Crawly ?¶

Because i love micro-frameworks (Flask VS Django) and because i believe that

Inside every large, complex program is a small, elegant program that does the same thing, correctly -- Tony Hoare

And because i wanted to fix all the problems listed above without having to dig in Scrapy, and when i estimated the cost of digging into scrapy and the cost of me creating a new crawler library and what i will gain, well guess what ?!

Goals: What should a crawler library do ?¶

IMHO, a crawler library should (not in order of importance):

Simple Usage: Make it easy to instruct the library to crawl a given website, by handling most common pattern existing for website design, for example: single page, list->detail, paginate->list->detail and such, and make it easy to extend for special website.
Feedback: Log everything to user.
Configurable: Something that all library should offer.
Encoding: Handle all HTML encoding (utf8, latin1 ...).
Scraping: Give developer easy way to extract data from a website, using XPath for example.
Speed: crawling a website should be fast.
Play nice: by handling rate limits, so we don’t DOS the servers.

Status: How is Crawly compared to Scrapy ?¶

Speed: In term of speed i can tell you with confidence that Crawly is very fast and that all thanks to Gevent. Most of the times in my tests i remarked that Crawly was a little bit faster than Scrapy, but nothing very noticeable (few seconds of difference) because Scrapy is already very fast.
Memory: Well Crawly is small so in term of memory and it’s very light.
Usage/Simplicity: Well i may be a little biased on this one, but that was on of the main reason for me to create Crawly.
Features: For the mean time Scrapy has a lot of feature that don’t have a match in Crawly.
Maturity: Crawly is still new at this stage while Scrapy is very mature Open Source project so nothing to compare here :)

FAQ¶

Existing study: Why you shouldn’t use Crawly ?¶

Define the need: Why I have created Crawly ?¶

Goals: What should a crawler library do ?¶

Status: How is Crawly compared to Scrapy ?¶

Project Versions

Table Of Contents

Previous topic

This Page

Navigation

FAQ¶

Existing study: Why you shouldn’t use Crawly ?¶

Define the need: Why I have created Crawly ?¶

Goals: What should a crawler library do ?¶

Status: How is Crawly compared to Scrapy ?¶

Project Versions

RTD Search

Table Of Contents

Previous topic

This Page

Quick search

Navigation