Existing study: Why you shouldn’t use Crawly ?

First of all, have you checked scrapy (http://scrapy.org) ? if not, you should, it’s a very powerful framework, but in my case and unfortunately i have found some drawbacks with Scrapy which lead me to create Crawly, which are:

  • Scrapy was too big and too hard to hack in, as i had some problems with it, especially concerning consistency of scraped data which is a huge problem when it came to scapers, but a lot of spaghetti code make it also very hard to dig in :).
  • Most website just look the same (at least the one that i crawled) but scrapy didn’t help make my code clean because of a lot of boilerplate.
  • Scrapy is huge in term of architecture (scrapyd, web interface, ...), and all of this was consuming a lot of memory and my little server wasn’t able to support, so it was crushing other process each time scrapy start crawling.

Define the need: Why I have created Crawly ?

Because i love micro-frameworks (Flask VS Django) and because i believe that

Inside every large, complex program is a small, elegant program that does the same thing, correctly -- Tony Hoare

And because i wanted to fix all the problems listed above without having to dig in Scrapy, and when i estimated the cost of digging into scrapy and the cost of me creating a new crawler library and what i will gain, well guess what ?!

Goals: What should a crawler library do ?

IMHO, a crawler library should (not in order of importance):

  • Simple Usage: Make it easy to instruct the library to crawl a given website, by handling most common pattern existing for website design, for example: single page, list->detail, paginate->list->detail and such, and make it easy to extend for special website.
  • Feedback: Log everything to user.
  • Configurable: Something that all library should offer.
  • Encoding: Handle all HTML encoding (utf8, latin1 ...).
  • Scraping: Give developer easy way to extract data from a website, using XPath for example.
  • Speed: crawling a website should be fast.
  • Play nice: by handling rate limits, so we don’t DOS the servers.

Status: How is Crawly compared to Scrapy ?

  • Speed: In term of speed i can tell you with confidence that Crawly is very fast and that all thanks to Gevent. Most of the times in my tests i remarked that Crawly was a little bit faster than Scrapy, but nothing very noticeable (few seconds of difference) because Scrapy is already very fast.
  • Memory: Well Crawly is small so in term of memory and it’s very light.
  • Usage/Simplicity: Well i may be a little biased on this one, but that was on of the main reason for me to create Crawly.
  • Features: For the mean time Scrapy has a lot of feature that don’t have a match in Crawly.
  • Maturity: Crawly is still new at this stage while Scrapy is very mature Open Source project so nothing to compare here :)

Project Versions

Table Of Contents

Previous topic


This Page