Welcome to scrawler's documentation! ==================================== *"scrawler" = "scraper" + "crawler"* Provides functionality for the automatic collection of website data (`web scraping `__) and following links to map an entire domain (`crawling `__). It can handle these tasks individually, or process several websites/domains in parallel using ``asyncio`` and ``multithreading``. This project was initially developed while working at the `Fraunhofer Institute for Systems and Innovation Research `__. Many thanks for the opportunity and support! Installation ------------ You can install scrawler from PyPI: :: pip install scrawler .. note:: Alternatively, you can find the ``.whl`` and ``.tar.gz`` files on GitHub for each respective `release `__. Getting Started --------------- Check out the `Getting Started Guide `__. Important concepts and classes ------------------------------ :class:`.Website` Object ~~~~~~~~~~~~~~~~~~~~~~~~ Basic object to contain information on one website. This is basically a wrapper around a `BeautifulSoup `__ object constructed from a website's HTML text, adding additional information such as the URL and `its parts `__ and the HTTP response when fetching the website. - `Website Object documentation `__ Crawling/Scraping attribute objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Used for specifying options for the crawling/scraping processes, like what data to collect, which URLs to include and where to save the data. .. autosummary:: :nosignatures: ~scrawler.attributes.SearchAttributes ~scrawler.attributes.ExportAttributes ~scrawler.attributes.CrawlingAttributes Data Extractors ~~~~~~~~~~~~~~~ Data extractors are functions used to retrieve various data points from :class:`.Website` objects. - `List of built-in data extractors `__ - `Guide on how to build custom data extractors `__ .. toctree:: :hidden: getting_started built_in_data_extractors custom_data_extractors reference Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`