Custom Data Extractors ====================== scrawler provides several `built-in data extractors `__. However, you can also easily write your own extractors. What kinds of data can be extracted? ------------------------------------ Data extractors are passed a :class:`.Website` object, which provides access to various data points (see also the documented attributes in the :class:`.Website` documentation). The three most important ones are: 1. The website's HTML parsed as a BeautifulSoup object (see `their documentation `__ for how to extract data from it). Because :class:`.Website` extends ``BeautifulSoup``, you can directly execute BeautifulSoup methods on the website object. 2. The HTTP response object (:attr:`.http_response` attribute) as :class:`requests:requests.Response` or :class:`aiohttp:aiohttp.ClientResponse` (depending on whether you are using the ``asyncio`` or ``multithreading`` backend). 3. The website's raw URL (:attr:`~scrawler.website.Website.url` attribute) and parsed URL parts (:attr:`.parsed_url` attribute). Basic structure --------------- Data extractors are classes that inherit from :class:`.BaseExtractor` and implement two methods: - :func:`~scrawler.data_extractors.BaseExtractor.__init__`: Where parameters to the extractor can be passed and are stored in object attributes. - :func:`~scrawler.data_extractors.BaseExtractor.run`: To do the extraction. Make sure that the method signature is the same as for :class:`.BaseExtractor`, i.e. two parameters can be passed, ``website`` and ``index`` as an optional parameter. Note on :class:`.BaseExtractor` parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are two important parameters specified in :class:`.BaseExtractor` that apply to all data extractors. Please consider these parameters when writing your own data extractor. ``n_return_values`` ^^^^^^^^^^^^^^^^^^^ The parameter ``n_return_values`` specifies the number of values that will be returned by the extractor. This is almost always 1, but there are cases such as :class:`.DateExtractor` which may return more values. If you build your own data extractor that may return more than one value, make sure to update ``self.n_return_values``. This attribute is used to validate that the length of the header of the CSV file equals the number of columns generated by the search attributes. Have a look at the implementation of :class:`.DateExtractor` to see how this might be handled. ``dynamic_parameters`` ^^^^^^^^^^^^^^^^^^^^^^ The parameter ``dynamic_parameters`` handles a special case of data extraction when scraping/crawling multiple sites. There may be cases where you would like to have a different set of parameters for each URL. In this case, you can pass the relevant parameter as a list and set ``dynamic_parameters`` to ``True``. The scraper/crawler will then have each URL/scraping target use a different value from that list based on an index. In this example, a different ID will be put for each crawled domain: .. code:: python from scrawler.data_extractors import CustomStringPutter DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"] putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True) Note that when enabling ``dynamic_parameters``, to parameters going into this data extractor can only have one of two forms: - A :class:`list` (not a :class:`tuple`!) where each list entry matches *exactly one* URL (in the same order as in the list of the URLs, see variable ``DOMAINS_TO_CRAWL`` in the example above). - A constant (of a type other than list) than will be the same for all URLs. Passing a parameter list shorter or longer than the list of URLs will raise an error. All built-in data extractors support dynamic parameters and you can easily add support to your custom data extractor by using the :func:`.supports_dynamic_parameters` function decorator to decorate your :func:`~scrawler.data_extractors.BaseExtractor.run` method, like this: .. code:: python from scrawler import Website from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters class CopyrightExtractor(BaseExtractor): def __init__(self, **kwargs): """Extract website copyright tag.""" super().__init__(**kwargs) @supports_dynamic_parameters def run(self, website: Website, index: int = None): copyright_tag = website.find("meta", attrs={"name": "copyright"}) # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag try: copyright_text = copyright_tag.attrs["content"] except (AttributeError, KeyError): copyright_text = "NULL" return copyright_text Example ------- In this example, we build a data extractor to retrieve a website's copyright tag (if available): .. code:: python from scrawler import Website from scrawler.data_extractors import BaseExtractor class CopyrightExtractor(BaseExtractor): def __init__(self, **kwargs): """Extract website copyright tag.""" super().__init__(**kwargs) def run(self, website: Website, index: int = None): copyright_tag = website.find("meta", attrs={"name": "copyright"}) # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag try: copyright_text = copyright_tag.attrs["content"] except (AttributeError, KeyError): copyright_text = "NULL" return copyright_text In this case, we could actually have had an easier solution. The built-in extractor :class:`.GeneralHtmlTagExtractor` already contains all the necessary functionality: .. code:: python from scrawler.data_extractors import GeneralHtmlTagExtractor copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"}, attr_to_extract="content")