Custom Data Extractors

scrawler provides several built-in data extractors. However, you can also easily write your own extractors.

What kinds of data can be extracted?

Data extractors are passed a Website object, which provides access to various data points (see also the documented attributes in the Website documentation). The three most important ones are:

  1. The website’s HTML parsed as a BeautifulSoup object (see their documentation for how to extract data from it). Because Website extends BeautifulSoup, you can directly execute BeautifulSoup methods on the website object.

  2. The HTTP response object (http_response attribute) as requests.Response or aiohttp.ClientResponse (depending on whether you are using the asyncio or multithreading backend).

  3. The website’s raw URL (url attribute) and parsed URL parts (parsed_url attribute).

Basic structure

Data extractors are classes that inherit from BaseExtractor and implement two methods:

  • __init__(): Where parameters to the extractor can be passed and are stored in object attributes.

  • run(): To do the extraction. Make sure that the method signature is the same as for BaseExtractor, i.e. two parameters can be passed, website and index as an optional parameter.

Note on BaseExtractor parameters

There are two important parameters specified in BaseExtractor that apply to all data extractors. Please consider these parameters when writing your own data extractor.

n_return_values

The parameter n_return_values specifies the number of values that will be returned by the extractor. This is almost always 1, but there are cases such as DateExtractor which may return more values. If you build your own data extractor that may return more than one value, make sure to update self.n_return_values. This attribute is used to validate that the length of the header of the CSV file equals the number of columns generated by the search attributes. Have a look at the implementation of DateExtractor to see how this might be handled.

dynamic_parameters

The parameter dynamic_parameters handles a special case of data extraction when scraping/crawling multiple sites. There may be cases where you would like to have a different set of parameters for each URL. In this case, you can pass the relevant parameter as a list and set dynamic_parameters to True. The scraper/crawler will then have each URL/scraping target use a different value from that list based on an index. In this example, a different ID will be put for each crawled domain:

from scrawler.data_extractors import CustomStringPutter

DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"]
putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True)

Note that when enabling dynamic_parameters, to parameters going into this data extractor can only have one of two forms:

  • A list (not a tuple!) where each list entry matches exactly one URL (in the same order as in the list of the URLs, see variable DOMAINS_TO_CRAWL in the example above).

  • A constant (of a type other than list) than will be the same for all URLs.

Passing a parameter list shorter or longer than the list of URLs will raise an error.

All built-in data extractors support dynamic parameters and you can easily add support to your custom data extractor by using the supports_dynamic_parameters() function decorator to decorate your run() method, like this:

from scrawler import Website
from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters


class CopyrightExtractor(BaseExtractor):
    def __init__(self, **kwargs):
        """Extract website copyright tag."""
        super().__init__(**kwargs)

    @supports_dynamic_parameters
    def run(self, website: Website, index: int = None):
        copyright_tag = website.find("meta", attrs={"name": "copyright"})

        # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
        try:
            copyright_text = copyright_tag.attrs["content"]
        except (AttributeError, KeyError):
            copyright_text = "NULL"

        return copyright_text

Example

In this example, we build a data extractor to retrieve a website’s copyright tag (if available):

from scrawler import Website
from scrawler.data_extractors import BaseExtractor


class CopyrightExtractor(BaseExtractor):
    def __init__(self, **kwargs):
        """Extract website copyright tag."""
        super().__init__(**kwargs)

    def run(self, website: Website, index: int = None):
        copyright_tag = website.find("meta", attrs={"name": "copyright"})

        # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
        try:
            copyright_text = copyright_tag.attrs["content"]
        except (AttributeError, KeyError):
            copyright_text = "NULL"

        return copyright_text

In this case, we could actually have had an easier solution. The built-in extractor GeneralHtmlTagExtractor already contains all the necessary functionality:

from scrawler.data_extractors import GeneralHtmlTagExtractor

copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"},
                                              attr_to_extract="content")