Custom Data Extractors
======================

scrawler provides several `built-in data extractors <built_in_data_extractors.html>`__. However, you can also easily write your own extractors.

What kinds of data can be extracted?
------------------------------------

Data extractors are passed a :class:`.Website` object, which provides access to various data points
(see also the documented attributes in the :class:`.Website` documentation).
The three most important ones are:

1. The website's HTML parsed as a BeautifulSoup object (see `their
   documentation <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`__
   for how to extract data from it). Because :class:`.Website` extends
   ``BeautifulSoup``, you can directly execute BeautifulSoup methods on
   the website object.
2. The HTTP response object (:attr:`.http_response` attribute) as :class:`requests:requests.Response` or :class:`aiohttp:aiohttp.ClientResponse`
   (depending on whether you are using the ``asyncio`` or ``multithreading`` backend).
3. The website's raw URL (:attr:`~scrawler.website.Website.url` attribute) and parsed URL parts (:attr:`.parsed_url` attribute).

Basic structure
---------------

Data extractors are classes that inherit from :class:`.BaseExtractor` and implement two methods:

-  :func:`~scrawler.data_extractors.BaseExtractor.__init__`: Where parameters to the extractor can be passed and
   are stored in object attributes.
-  :func:`~scrawler.data_extractors.BaseExtractor.run`: To do the extraction. Make sure that the method signature
   is the same as for :class:`.BaseExtractor`, i.e. two parameters can be
   passed, ``website`` and ``index`` as an optional parameter.

Note on :class:`.BaseExtractor` parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are two important parameters specified in :class:`.BaseExtractor` that apply to all data extractors.
Please consider these parameters when writing your own data extractor.

``n_return_values``
^^^^^^^^^^^^^^^^^^^
The parameter ``n_return_values`` specifies the number of values
that will be returned by the extractor. This is almost always 1, but
there are cases such as :class:`.DateExtractor` which may return more values.
If you build your own data extractor that may return more than one
value, make sure to update ``self.n_return_values``. This attribute is
used to validate that the length of the header of the CSV file equals
the number of columns generated by the search attributes. Have a look at
the implementation of :class:`.DateExtractor` to see how this might be handled.

``dynamic_parameters``
^^^^^^^^^^^^^^^^^^^^^^
The parameter ``dynamic_parameters`` handles a special case of data
extraction when scraping/crawling multiple sites. There may be cases
where you would like to have a different set of parameters for each URL.
In this case, you can pass the relevant parameter as a list and set
``dynamic_parameters`` to ``True``. The scraper/crawler will then have each
URL/scraping target use a different value from that list based on an
index. In this example, a different ID will be put for each crawled domain:

.. code:: python

   from scrawler.data_extractors import CustomStringPutter

   DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"]
   putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True)

Note that when enabling ``dynamic_parameters``, to parameters going into
this data extractor can only have one of two forms:

-  A :class:`list` (not a :class:`tuple`!) where each list entry matches *exactly one* URL
   (in the same order as in the list of the URLs, see variable
   ``DOMAINS_TO_CRAWL`` in the example above).
-  A constant (of a type other than list) than will be the same for all
   URLs.

Passing a parameter list shorter or longer than the list of URLs will
raise an error.

All built-in data extractors support dynamic parameters and you can
easily add support to your custom data extractor by using the
:func:`.supports_dynamic_parameters` function decorator to decorate your
:func:`~scrawler.data_extractors.BaseExtractor.run` method, like this:

.. code:: python

   from scrawler import Website
   from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters


   class CopyrightExtractor(BaseExtractor):
       def __init__(self, **kwargs):
           """Extract website copyright tag."""
           super().__init__(**kwargs)

       @supports_dynamic_parameters
       def run(self, website: Website, index: int = None):
           copyright_tag = website.find("meta", attrs={"name": "copyright"})

           # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
           try:
               copyright_text = copyright_tag.attrs["content"]
           except (AttributeError, KeyError):
               copyright_text = "NULL"

           return copyright_text

Example
-------

In this example, we build a data extractor to retrieve a website's
copyright tag (if available):

.. code:: python

   from scrawler import Website
   from scrawler.data_extractors import BaseExtractor


   class CopyrightExtractor(BaseExtractor):
       def __init__(self, **kwargs):
           """Extract website copyright tag."""
           super().__init__(**kwargs)

       def run(self, website: Website, index: int = None):
           copyright_tag = website.find("meta", attrs={"name": "copyright"})

           # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
           try:
               copyright_text = copyright_tag.attrs["content"]
           except (AttributeError, KeyError):
               copyright_text = "NULL"

           return copyright_text

In this case, we could actually have had an easier solution. The
built-in extractor :class:`.GeneralHtmlTagExtractor` already contains all the
necessary functionality:

.. code:: python

   from scrawler.data_extractors import GeneralHtmlTagExtractor

   copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"},
                                                 attr_to_extract="content")