Custom Data Extractors¶
scrawler provides several built-in data extractors. However, you can also easily write your own extractors.
What kinds of data can be extracted?¶
Data extractors are passed a Website
object, which provides access to various data points
(see also the documented attributes in the Website
documentation).
The three most important ones are:
The website’s HTML parsed as a BeautifulSoup object (see their documentation for how to extract data from it). Because
Website
extendsBeautifulSoup
, you can directly execute BeautifulSoup methods on the website object.The HTTP response object (
http_response
attribute) asrequests.Response
oraiohttp.ClientResponse
(depending on whether you are using theasyncio
ormultithreading
backend).The website’s raw URL (
url
attribute) and parsed URL parts (parsed_url
attribute).
Basic structure¶
Data extractors are classes that inherit from BaseExtractor
and implement two methods:
__init__()
: Where parameters to the extractor can be passed and are stored in object attributes.run()
: To do the extraction. Make sure that the method signature is the same as forBaseExtractor
, i.e. two parameters can be passed,website
andindex
as an optional parameter.
Note on BaseExtractor
parameters¶
There are two important parameters specified in BaseExtractor
that apply to all data extractors.
Please consider these parameters when writing your own data extractor.
n_return_values
¶
The parameter n_return_values
specifies the number of values
that will be returned by the extractor. This is almost always 1, but
there are cases such as DateExtractor
which may return more values.
If you build your own data extractor that may return more than one
value, make sure to update self.n_return_values
. This attribute is
used to validate that the length of the header of the CSV file equals
the number of columns generated by the search attributes. Have a look at
the implementation of DateExtractor
to see how this might be handled.
dynamic_parameters
¶
The parameter dynamic_parameters
handles a special case of data
extraction when scraping/crawling multiple sites. There may be cases
where you would like to have a different set of parameters for each URL.
In this case, you can pass the relevant parameter as a list and set
dynamic_parameters
to True
. The scraper/crawler will then have each
URL/scraping target use a different value from that list based on an
index. In this example, a different ID will be put for each crawled domain:
from scrawler.data_extractors import CustomStringPutter
DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"]
putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True)
Note that when enabling dynamic_parameters
, to parameters going into
this data extractor can only have one of two forms:
A
list
(not atuple
!) where each list entry matches exactly one URL (in the same order as in the list of the URLs, see variableDOMAINS_TO_CRAWL
in the example above).A constant (of a type other than list) than will be the same for all URLs.
Passing a parameter list shorter or longer than the list of URLs will raise an error.
All built-in data extractors support dynamic parameters and you can
easily add support to your custom data extractor by using the
supports_dynamic_parameters()
function decorator to decorate your
run()
method, like this:
from scrawler import Website
from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters
class CopyrightExtractor(BaseExtractor):
def __init__(self, **kwargs):
"""Extract website copyright tag."""
super().__init__(**kwargs)
@supports_dynamic_parameters
def run(self, website: Website, index: int = None):
copyright_tag = website.find("meta", attrs={"name": "copyright"})
# Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
try:
copyright_text = copyright_tag.attrs["content"]
except (AttributeError, KeyError):
copyright_text = "NULL"
return copyright_text
Example¶
In this example, we build a data extractor to retrieve a website’s copyright tag (if available):
from scrawler import Website
from scrawler.data_extractors import BaseExtractor
class CopyrightExtractor(BaseExtractor):
def __init__(self, **kwargs):
"""Extract website copyright tag."""
super().__init__(**kwargs)
def run(self, website: Website, index: int = None):
copyright_tag = website.find("meta", attrs={"name": "copyright"})
# Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
try:
copyright_text = copyright_tag.attrs["content"]
except (AttributeError, KeyError):
copyright_text = "NULL"
return copyright_text
In this case, we could actually have had an easier solution. The
built-in extractor GeneralHtmlTagExtractor
already contains all the
necessary functionality:
from scrawler.data_extractors import GeneralHtmlTagExtractor
copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"},
attr_to_extract="content")