Getting Started =============== To get started, have a look at the `templates `__ folder. It contains four files, each one doing a different task. All templates include three sections: 1. **Imports** retrieves all code dependencies from various files. 2. **Setup** is where all parameters are specified. 3. In **Execution**, an instance of the respective Python object is created and its ``run()`` method executed. As a starting point, you can copy-and-paste a template and make any adjustments you would like. Let's have a closer look at the **Setup** section. First, the *URL(s)* to be processed are specified. Then, the *attributes* that define how to accomplish the tasks are specified: .. autosummary:: :nosignatures: ~scrawler.attributes.SearchAttributes ~scrawler.attributes.ExportAttributes ~scrawler.attributes.CrawlingAttributes For more details, see the section `Attributes`_. In the section **Execution**, these parameters are then passed to the relevant object (see next section). Basic Objects ------------- The basic functionality of **scrawler** is contained in two classes, :class:`.Scraper` and :class:`.Crawler`. Functionality ~~~~~~~~~~~~~ The objects are passed all relevant parameters during object initialization. Then, three methods can be applied to them: - ``run()``: Execute the task and return the results. - ``run_and_export()``: This may be used when scraping/crawling many sites at once, generating huge amounts of data. In order to prevent a :class:`.MemoryError`, data will be exported as soon as it is ready and then discarded to make room for the next sites/domains. - ``export_data()``: Export the collected data to CSV file(s). Example Crawling ~~~~~~~~~~~~~~~~ Let's have a look at an example for **crawling** ``https://example.com``. For the moment, you can ignore the variables ``search_attrs``, ``export_attrs`` and ``crawling_attrs``. We will get to them `later <#attributes>`_. .. code:: python from scrawler import Crawler search_attrs, export_attrs, crawling_attrs = ..., ..., ... crawler = Crawler("https://example.com", search_attributes=search_attrs, export_attributes=export_attrs, crawling_attributes=crawling_attrs) results = crawler.run() crawler.export_data() Example Scraping ~~~~~~~~~~~~~~~~ Here, multiple sites are **scraped** at once. .. code:: python from scrawler import Scraper search_attrs, export_attrs = ..., ... scraper = Scraper(["https://www.example1.com", "https://www.example2.com", "https://www.example3.com"], search_attributes=search_attrs, export_attributes=export_attrs) results = scraper.run() scraper.export_data() Attributes ---------- Now that we know the objects that will perform our tasks, we would like to specify exactly how to go about it. Search Attributes ~~~~~~~~~~~~~~~~~ The :class:`.SearchAttributes` specify which data to collect/search for in the website (and how to do it). This is done by passing data extractor objects to :class:`.SearchAttributes` during initialization. There are many data extractors already build into the project, see `built-in data extractors `__. You can also specify your own `custom data extractors `__. In this example, we set up :class:`.SearchAttributes` that will extract three different data points from websites, specified using the built-in :class:`.UrlExtractor`, :class:`.TitleExtractor` and :class:`.DateExtractor` data extractors. Note how parameters for the data extractors are passed directly during initialization. .. code:: python from scrawler.attributes import SearchAttributes from scrawler.data_extractors import * search_attrs = SearchAttributes( UrlExtractor(), # returns URL TitleExtractor(), # returns website tag content DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"}) # returns publication date from pubdate meta tag ) .. seealso:: :class:`.SearchAttributes`: More detailed documentation. Export Attributes ~~~~~~~~~~~~~~~~~ The :class:`.ExportAttributes` specify how and where to export the collected data to. Data is always exported to the CSV format, therefore the various parameters are geared towards the CSV format. Two parameters *must* be specified here: - ``directory``: The directory (folder) that the file(s) will be saved to. - ``fn``: Filename(s) of the exported CSV files containing the crawled data. You don't have to specify the file extension ``.csv``, since the files will always be CSV files (for example, use ``crawled_data`` instead of ``crawled_data.csv``). Here's an exemplary :class:`.ExportAttributes` object creation: .. code:: python from scrawler.attributes import ExportAttributes export_attrs = ExportAttributes( directory=r"C:\Users\USER\Documents", fn=["example1_crawled_data", "example1_crawled_data", "example1_crawled_data"], header=["URL", "Title", "Publication Date"], separator="\t" ) .. seealso:: :class:`.ExportAttributes`: More detailed documentation. Crawling Attributes ~~~~~~~~~~~~~~~~~~~ The :class:`.CrawlingAttributes` specify how to conduct the crawling, e.g. how to filter irrelevant URLs or limits on the number of URLs crawled. As implied by their name, they are only relevant for crawling tasks. Some commonly adjusted parameters include: - ``filter_foreign_urls``: This parameter defines how the crawler knows that a given URL is still part of the target domain. For example, one may only want to crawl a subdomain, not the entire domain (only URLs from ``subdomain.example.com`` vs. the entire ``example.com`` domain). Details on valid input values can be found in the documentation for :class:`.CrawlingAttributes`. By default, this is set to ``auto``, which means that the correct mode will be inferred by looking at the passed base/start URL. For example, if the start URL contains a subdomain, only links from the subdomain will be crawled. For details, refer to the documentation for the :func:`.extract_same_host_pattern` function. Note that you can also pass your own comparison function here. It has to include two parameters, ``url1`` and ``url2``. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should return ``True`` for URLs that belong to the same host, and ``False`` for foreign URLs. - ``filter_media_files``: Controls whether to filter out (ignore) media files. Media files can be quite large and make the crawling process significantly longer, while not adding any new information because media file data can't be parsed and processed. Therefore, the crawler filters media by looking at the URL (e.g. URLs ending in ``.pdf`` or ``.jpg``), as well as the response header `content-type <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type>`__. - ``blocklist``: Some directories might not be interesting for the crawling process (e.g., ``/media/``). The ``blocklist`` parameter makes it possible to pass a list of strings that might occur in a URL. If the URL contains any of the given strings, it is filtered out. - ``max_no_urls``: Some domains contain many webpages. This parameter can be passed an integer as the maximum total amount of URLs to be crawled. Here's an exemplary :class:`.CrawlingAttributes` object creation: .. code:: python from scrawler.attributes import CrawlingAttributes DOMAIN_TO_CRAWL = "https://www.blog.example.com" crawling_attrs = CrawlingAttributes( filter_foreign_urls="subdomain1", # only crawling the `blog` subdomain filter_media_files=True, blocklist=("git.", "datasets.", "nextcloud."), max_no_urls=1000 ) Another example with a *custom foreign URL filter*: .. code:: python import tld.exceptions from scrawler.attributes import CrawlingAttributes from scrawler.utils.web_utils import ParsedUrl DOMAIN_TO_CRAWL = "https://www.blog.example.com/my_directory/index.html" def should_be_crawled(url1: str, url2: str) -> bool: """Custom foreign URL filter: Crawl all URLs from host `www.blog.example.com` inside the directory `my_directory`.""" try: # don't forget exception handling url1 = ParsedUrl(url1) url2 = ParsedUrl(url2) except (tld.exceptions.TldBadUrl, tld.exceptions.TldDomainNotFound): # URL couldn't be parsed return False return ((url1.hostname == url2.hostname) # hostname is `www.blog.example.com` and ("my_directory" in url1.path) and ("my_directory" in url2.path)) crawling_attrs = CrawlingAttributes( filter_foreign_urls=should_be_crawled, # pass custom foreign URL filter here filter_media_files=True, blocklist=("git.", "datasets.", "nextcloud."), max_no_urls=1000 ) .. seealso:: :class:`.CrawlingAttributes`: More detailed documentation. Other Settings ~~~~~~~~~~~~~~ Most parameters are encompassed in the three attribute objects above. However, there are some additional settings available for special cases. If you look at the templates' **Setup** section again, it includes a ``USER_AGENT`` parameter that sets the `user agent <https://en.wikipedia.org/wiki/User_agent>`__ to be used during scraping/crawling. Finally, `defaults.py <https://github.com/dglttr/scrawler/blob/main/scrawler/defaults.py>`__ contains standard settings that are used throughout the project. FAQ --- Why are there two backends? ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The module `backends <https://github.com/dglttr/scrawler/tree/main/scrawler/backends>`__ contains two files with the same functions for scraping and crawling, but built on different technologies for parallelization: One uses :mod:`.asyncio` and the other :mod:`.multiprocessing`, more precisely using *multithreading* by means of :mod:`.multiprocessing.dummy`. In general, :mod:`.asyncio_backend` is preferable because more sites can be processed in parallel. However, on very large sites, scrawler may get stuck, and the entire crawling will hang. Also, :class:`aiohttp:aiohttp.ServerDisconnectedError` may occur a lot. If you expect or experience these cases, it is preferable to use the :class:`~scrawler.backends.multithreading_backend`, which is slower, but more robust.