Crawly: Micro crawler for Python

Crawly is a Python library that allow to crawl website and extract data from this later using a simple API.

Crawly work by combining different tool, that ultimately created a small library (~350 lines of code) that fetch website HTML, crawl it (follow links) and extract data from each page.

Libraries used:

  • requests It’s a Python HTTP library, it’s used by crawly to fetch website HTML, this library take care of maintaining the Connection Pool, it’s also easily configurable and support a lot of feature including: SSL, Cookies, Persistent requests, HTML decoding ... .
  • gevent This is the engine responsible of the speed in crawly, with gevent you can run concurrent code, using green thread.
  • lxml a fast, easy to use Python library that used to parse the HTML fetched to help extracting data easily.
  • logging Python standard library module that log information, also easily configurable.

Project Versions

Table Of Contents

Next topic

Installation

This Page