Reference¶

backends¶

asyncio_backend¶

async scrawler.backends.asyncio_backend.async_crawl_domain(start_url: str, session: aiohttp.client.ClientSession, search_attributes: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, max_no_urls: int = inf, max_distance_from_start_url: int = inf, max_subdirectory_depth: int = inf, filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: Iterable = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, return_type: str = 'data', progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None, current_index: Optional[int] = None, semaphore: Optional[asyncio.locks.Semaphore] = None, **kwargs)[source]¶

Collect data from all sites of a given domain. The sites within the domain are found automatically be iteratively searching for all links inside all pages.

Parameters

start_url – The first URL to be accessed. From here, links will be extracted and iteratively processed to find all linked sites.
search_attributes – Dictionary specifying what to search for and how to search it.
export_attrs – Optional. If specified, the crawled data is exported as soon as it’s ready, not after the entire crawling has finished.
user_agent – Optionally specify a user agent for making the HTTP request.
pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s robots.txt file.
max_no_urls – Maximum number of URLs to be crawled (safety limit for very large crawls).
max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start_url.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema: hostname/sub-directory1/sub-siteA. If you would want to crawl all URLs of the same level as sub-directory1, specify 1. sub-siteA will then not be found, but a site hostname/sub-directory2 or hostname/sub-siteB will be.
filter_non_standard_schemes – See filter_urls().
filter_media_files – See filter_urls().
blocklist – See filter_urls().
filter_foreign_urls – See filter_urls().
strip_url_parameters – See strip_unnecessary_url_parts().
strip_url_fragments – See strip_unnecessary_url_parts().
return_type – Specify which values to return (“all”, “none”, “data”).
progress_bar – If a ProgressBar object is passed, prints a progress bar on the command line.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
semaphore – asyncio.Semaphore used for controlling the number of concurrent processes run.
session – aiohttp.ClientSession used to make requests in a concurrent manner.

Returns

List of the data collected from all URLs that where found using start_url as starting point.

async scrawler.backends.asyncio_backend.async_scrape_site(url: str, session: aiohttp.client.ClientSession, search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, current_index: Optional[int] = None, progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None) → list[source]¶

Scrape the data specified in search_attrs from one website.

Parameters

url – URL to be scraped.
session – aiohttp.ClientSession used to make requests in a concurrent manner.
search_attrs – Specify which data to collect/search for in the website.
export_attrs – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
progress_bar – If a ProgressBar object is passed, prints a progress bar on the command line.

Returns

List of data collected from the website.

multithreading_backend¶

scrawler.backends.multithreading_backend.crawl_domain(start_url: str, search_attributes: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, max_no_urls: int = inf, max_distance_from_start_url: int = inf, max_subdirectory_depth: int = inf, filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: Iterable = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, return_type: str = 'data', progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None, current_index: Optional[int] = None, **kwargs)[source]¶

Collect data from all sites of a given domain. The sites within the domain are found automatically be iteratively searching for all links inside all pages.

Parameters

start_url – The first URL to be accessed. From here, links will be extracted and iteratively processed to find all linked sites.
search_attributes – Dictionary specifying what to search for and how to search it.
export_attrs – Optional. If specified, the crawled data is exported as soon as it’s ready, not after the entire crawling has finished.
user_agent – Optionally specify a user agent for making the HTTP request.
pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s robots.txt file.
max_no_urls – Maximum number of URLs to be crawled (safety limit for very large crawls).
max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start_url.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema: hostname/sub-directory1/sub-siteA. If you would want to crawl all URLs of the same level as sub-directory1, specify 1. sub-siteA will then not be found, but a site hostname/sub-directory2 or hostname/sub-siteB will be.
filter_non_standard_schemes – See filter_urls().
filter_media_files – See filter_urls().
blocklist – See filter_urls().
filter_foreign_urls – See filter_urls().
strip_url_parameters – See strip_unnecessary_url_parts().
strip_url_fragments – See strip_unnecessary_url_parts().
return_type – Specify which values to return (“all”, “none”, “data”).
progress_bar – If a ProgressBar object is passed, prints a progress bar on the command line.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.

Returns

List of the data collected from all URLs that where found using start_url as starting point.

scrawler.backends.multithreading_backend.scrape_site(url: str, search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, current_index: Optional[int] = None, progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None) → list[source]¶

Scrape the data specified in search_attrs from one website.

Parameters

url – URL to be scraped.
search_attrs – Specify which data to collect/search for in the website.
export_attrs – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
progress_bar – If a ProgressBar object is passed, prints a progress bar on the command line.

Returns

List of data collected from the website.

utils¶

file_io_utils¶

Functions for local file import/export operations, e. g. CSV file reading and writing.

scrawler.utils.file_io_utils.export_to_csv(data, directory: str, fn: str, header: Optional[Union[list, str, bool]] = None, encoding: str = 'utf-8', separator: str = ',', quoting: int = 0, escapechar: Optional[str] = None, current_index: Optional[int] = None, **kwargs) → None[source]¶

Export data to a CSV file.

Parameters

data – One- or two-dimensional data that will be parsed to a pandas.DataFrame.
directory – Path to directory where file will be saved.
fn – Filename (without file extension).
header – If None or False, no header will be written. If first-row or True, uses first row of data as header. Else, pass list of strings of appropriate length.
encoding – Encoding to use to create the CSV file.
separator – Column separator or delimiter to use for creating the CSV file.
quoting – Puts quotes around cells that contain the separator character.
escapechar – Escapes the separator character.
current_index – If fn is a list of filenames, use this to specify which filename to use.
kwargs – Any parameter supported by pandas.DataFrame.to_csv() can be passed.

scrawler.utils.file_io_utils.get_data_in_dir(directory: str, start_idx: int = 0, end_idx: Optional[int] = None, encoding: str = 'utf-8', separator: str = ',') → list[source]¶

Read all CSV files within a directory. All files in the directory must be CSV files.

Parameters

directory – Path to the directory.
start_idx – Sometimes, not all CSV files in the directory should be read. Together with end_idx, this parameter allows to specify an interval of files that should be read in, e. g. the first up to the 5th file.
end_idx – See start_idx.
encoding – The character encoding of the CSV files to be read.
separator – The separator/delimiter of the CSV files to be read.

scrawler.utils.file_io_utils.multithreaded_csv_export(list_of_datasets: list, **kwargs) → None[source]¶

Export a list of multi-column dataset to a CSV file in parallel using multithreading.

Parameters

list_of_datasets – List of two-dimensional data objects that will be parsed to a pandas.DataFrame.
kwargs – Keywords arguments that are passed on to export_to_csv().

general_utils¶

General purpose utility functions.

class scrawler.utils.general_utils.ProgressBar(total_length: int = 0, progress: int = 0, custom_message: str = '', width_in_command_line: int = 100, progress_char: str = '█', remaining_char: str = '-')[source]¶

Print a progress bar in the command line interface.

Default looks like this: Custom Message |██████████----------| 50.0% (5 / 10).

Parameters

total_length – Absolute length of concept (e.g. total download size = 20,000 bytes).
progress – Share of total_length already reached (e.g. 10,000 bytes already downloaded).
custom_message – String to appear to the left of the progress bar.
width_in_command_line – Number of characters used in print to display the progress bar.
progress_char – Character to use for filling the progress bar.
remaining_char – Character to use for the space not yet filled by progress.

print()[source]¶: Print current progress on the command line.

update(iterations: int = 1, total_length_update: int = 0)[source]¶

Update internal progress parameters.

Parameters

iterations – Used to update progress.
total_length_update – Used to update total_length.

scrawler.utils.general_utils.sanitize_text(text: str, lower: bool = False) → str[source]¶: Sanitize texts by removing unnecessary or unwanted characters.

scrawler.utils.general_utils.timing_decorator(func)[source]¶: A function decorator to measure function runtime and print the runtime on the console.

validation_utils¶

Functions to make sure the specifications for a crawling/scraping are valid and work together correctly.

scrawler.utils.validation_utils.validate_input_params(urls: List[str], search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, crawling_attrs: Optional[scrawler.attributes.CrawlingAttributes] = None, **kwargs)[source]¶: Validate that all URLs work and the various attributes work together.

scrawler.utils.validation_utils.validate_urls(urls: List[str]) → None[source]¶: Checks if URL(s) can be parsed and checks for duplicates.

web_utils¶

Functions for web operations (e. g. working with URLs and retrieving data from websites).

class scrawler.utils.web_utils.ParsedUrl(url: str)[source]¶

Parse a URL string into its various parts. Basically a wrapper around tld.Result to make accessing elements easier.

Parameters: url – URL string to parse.
Raises: Exception – Exceptions from TLD package if the URL is invalid.

domain¶: example in the example from url

fld¶: example.co.uk in the example from url

fragment¶: xyz in the example from url

hostname¶: some.subdomain.example.co.uk in the example from url

netloc¶: username:password@some.subdomain.example.co.uk in the example from url

path¶: /path1/path2 in the example from url

query¶: param="abc" in the example from url

scheme¶: http in the example from url

subdomain¶: some.subdomain in the example from url

tld¶: co.uk in the example from url

url¶: Entire URL. In the following, this example URL is used to illustrate the various URL parts: http://username:password@some.subdomain.example.co.uk/path1/path2?param="abc"#xyz

async scrawler.utils.web_utils.async_get_html(url: str, session: aiohttp.client.ClientSession, user_agent: Optional[str] = None, verify: bool = True, max_content_length: int = - 1, check_http_content_type: bool = True, return_response_object: bool = False, raise_for_status: bool = False, **kwargs) → Union[str, Tuple[str, aiohttp.client_reqrep.ClientResponse]][source]¶

Collect HTML text of a given URL.

Parameters

url – URL to retrieve the HTML from.
session – aiohttp.ClientSession to be used for making the request asynchronously.
user_agent – Allows to optionally specify a different user agent than the default Python user agent.
verify – Whether to verify the server’s TLS certificate. Useful if TLS connections fail, but should in general be True to avoid man-in-the-middle attacks.
max_content_length – Check the HTTP header for the attribute content-length. If it is bigger than this specified parameter, a ValueError is raised. Set to -1 when not needed.
check_http_content_type – Whether to check the HTTP header field content-type. If it does not include text, a ValueError is raised.
return_response_object – If True, also returns the ClientResponse object from the GET request.
raise_for_status – If True, raise an HTTPError if the HTTP request returned an unsuccessful status code.
kwargs – Will be passed on to aiohttp.ClientSession.get().

Returns

HTML text from the given URL. Optionally also returns the HTTP response object.

Raises

aiohttp.ClientError, aiohttp.HTTPError, ValueError – Errors derived from aiohttp.ClientError include InvalidURL, ClientConnectionError and ClientResponseError. May optionally raise aiohttp.HTTPError (if raise_for_status is True) or ValueError (if check_http_content_type or max_content_length are True).

async scrawler.utils.web_utils.async_get_redirected_url(url: str, session: aiohttp.client.ClientSession, max_redirects_to_follow: int = 100, **kwargs) → str[source]¶

Find final, redirected URL. Supports both HTTP redirects and HTML redirects. Also follows up on multiple redirects.

Parameters

url – Original URL.
session – aiohttp.ClientSession to be used for making the request asynchronously.
max_redirects_to_follow – Maximum number of redirects to follow to guard against infinite redirects. If limit is reached, None is returned.
kwargs – Passed on to async_get_html().

Returns

URL after redirects. If URL is invalid or an error occurs, returns None.

async scrawler.utils.web_utils.async_get_robot_file_parser(start_url: str, session: aiohttp.client.ClientSession, **kwargs) → Optional[urllib.robotparser.RobotFileParser][source]¶

Returns RobotFileParser from given URL. If no robots.txt file is found or error occurs, returns None.

Parameters

start_url – URL from which robots.txt will be collected.
session – aiohttp.ClientSession to use for making the request.
kwargs – Will be passed to get_html().

Returns

scrawler.utils.web_utils.extract_same_host_pattern(base_url: str) → str[source]¶: Looks at the passed base/start URL to determine which mode for is_same_host() is appropriate. First looks at whether the given URL contains a non-empty path. If one is found, the number of directories X is counted and directoryX is returned. Otherwise, check whether the URL contains subdomains. If found, the number of subdomains X is counted and subdomainX is returned. If neither exist, returns fld.

See also

is_same_host()

scrawler.utils.web_utils.filter_urls(urls: Iterable, filter_non_standard_schemes: bool, filter_media_files: bool, blocklist: Iterable, filter_foreign_urls: Union[str, callable], base_url: Optional[str] = None, return_discarded: bool = False, **kwargs) → Union[set, Tuple[set, set]][source]¶

Filter a list of URLs along some given attributes.

Parameters

urls – List of URLs to filter.
filter_non_standard_schemes – If True, makes sure that the URLs start with http: or https:.
filter_media_files – If True, discards URLs having media file extensions like .pdf or .jpeg. For details, see is_media_file().
blocklist – Specify a list of words or parts that if they appear in a URL, the URL will be discarded (e. g. ‘git.’, datasets.’).
filter_foreign_urls – Specify how to detect foreign URLs. Can either be a string that is passed to is_same_host(), or a custom Callable that has to include two arguments, url1 and url2. For details on possible strings see is_same_host() (note that the base_url parameter has to be passed for this to work). If you pass your own comparison function here, it has to include two parameters, url1 and url2. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should return True for URLs that belong to the same host, and False for foreign URLs.
base_url – Used in conjunction with the filter_foreign_urls parameter to detect foreign URLs.
return_discarded – If True, also returns to discarded URLs.

Returns

Set containing URLs that were not filtered. Optionally also returns discarded URLs.

attributes¶

Specifies the attribute objects used by crawlers and scrapers.

class scrawler.attributes.CrawlingAttributes(filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: tuple = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, max_no_urls: Optional[int] = None, max_distance_from_start_url: Optional[int] = None, max_subdirectory_depth: Optional[int] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, validate: bool = True)[source]¶

Specify how to conduct the crawling, including filtering irrelevant URLs or limiting the number of crawled URLs.

Parameters

filter_non_standard_schemes – Filter URLs starting with schemes other than http: or https: (e.g., mailto: or javascript:).
filter_media_files – Whether to filter media files. Recommended: True to avoid long runtimes caused by large file downloads.
blocklist – Filter URLs that contain one or more of the parts specified here. Has to be a list.
filter_foreign_urls – Filter URLs that do not belong to the same host (foreign URLs). Can either be a string that is passed to is_same_host(), or a custom Callable that has to include two arguments, url1 and url2. In is_same_host(), the following string values are permitted: 1. auto: Automatically extracts a matching pattern from the start URL (see extract_same_host_pattern() for details). 2. Any one of the attributes of the ParsedUrl class (e.g. domain, hostname, fld). 3. subdomainX with X representing an integer number up to which subdomain the URLs should be compared. E.g., comparing http://www.sub.example.com and http://blog.sub.example.com, sub is the first level, while the second levels are www and blog, respectively. 4. directoryX with X representing an integer number up to which directory the URLs should be compared. E.g., for http://example.com/dir1/dir2/index.html, directory2 would include all files in dir2.
strip_url_parameters – Whether to strip URL query parameters (prefixed by ?) from the URL.
strip_url_fragments – Whether to strip URL fragments (prefixed by #) from the URL.
max_no_urls – Maximum number of URLs to be crawled per domain (safety limit for very large crawls). Set to None if you want all URLs to be crawled.
max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start URL.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema: hostname/sub-directory1/sub-siteA. If you would want to crawl all URLs of the same level as sub-directory1, specify 1. sub-siteA will then not be found, but a site hostname/sub-directory2 or hostname/sub-siteB will be.
pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s robots.txt file.

class scrawler.attributes.ExportAttributes(directory: str, fn: Union[str, list], header: Optional[Union[list, str, bool]] = None, encoding: str = 'utf-8', separator: str = ',', quoting: int = 0, escapechar: Optional[str] = None, validate: bool = True, **kwargs)[source]¶

Specify how and where to export the collected data.

Parameters

directory – Folder where file(s) will be saved to.
fn – Name(s) of the file(s) containing the crawled data. Without file extension.
header – Have the final CSV file have a header. Possible parameters: If None or False, no header will be written. If first-row or True, uses first row of data as header. Else, pass list of strings of appropriate length.
encoding – Encoding to use to create the CSV file.
separator – Column separator or delimiter to use for creating the CSV file.
quoting – Puts quotes around cells that contain the separator character.
escapechar – Escapes the separator character.
validate – Whether to make sure that input parameters are valid.
kwargs – Any parameter supported by pandas.DataFrame.to_csv() can be passed.

class scrawler.attributes.SearchAttributes(*args: scrawler.data_extractors.BaseExtractor, validate: bool = True)[source]¶

Specify which data to collect/search for in the website.

Parameters

args – Data extractors specifying which data to extract in websites (see built-in data extractors or for possibilities or define a custom data extractor).
validate – Whether to make sure that input parameters are valid.

extract_all_attrs_from_website(website: scrawler.website.Website, index: Optional[int] = None) → list[source]¶

Extract data from a website using data extractors specified in SearchAttributes definition.

Parameters

website – Website object to collect the specified data points from.
index – Optionally pass an index for data extractors that index into passed parameters. See this explanation for details.

crawling¶

class scrawler.crawling.Crawler(urls: Union[str, List[str]], search_attributes: scrawler.attributes.SearchAttributes, export_attributes: Optional[scrawler.attributes.ExportAttributes] = None, crawling_attributes: scrawler.attributes.CrawlingAttributes = <scrawler.attributes.CrawlingAttributes object>, user_agent: Optional[str] = None, timeout: Optional[Union[int, aiohttp.client.ClientTimeout]] = None, backend: str = 'asyncio', parallel_processes: int = 4, validate_input_parameters: bool = True)[source]¶

Crawl a domain or multiple domains in parallel.

Parameters

urls – Start URL of domain to crawl or list of all URLs to crawl.
search_attributes – Specify which data to collect/search for in websites.
export_attributes – Specify how and where to export the collected data (as CSV).
crawling_attributes – Specify how to conduct the crawling, e. g. how to filter irrelevant URLs or limits on the number of URLs crawled.
user_agent – Optionally specify a user agent for making the HTTP request.
timeout – Timeout to be used when making HTTP requests. Note that the values specified here apply to each request individually, not to an entire session. When using the asyncio_backend, you can pass an aiohttp.ClientTimeout object where you can specify detailed timeout settings. Alternatively, you can pass an integer that will be interpreted as total timeout for one request in seconds. If nothing is passed, a default timeout will be used.
backend – “asyncio” to use the asyncio_backend (faster when crawling many domains at once, but more unstable and may get hung). “multithreading” to use the multithreading_backend (more stable, but most likely slower). See also Why are there two backends?
parallel_processes – Number of concurrent processes/threads to use. Can be very large when using asyncio_backend. When using multithreading_backend, should not exceed 2x the CPU count on the machine running the crawling.
validate_input_parameters – Whether to validate input parameters. Note that this validates that all URLs work and that the various attributes work together. However, the attributes themselves are also validated independently. You will need to also pass validate=False to the attributes individually to completely disable input validation.

export_data(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) → None[source]¶

Export data previously collected from crawling task.

Parameters: export_attrs – ExportAttributes object specifying export parameters.

run(export_immediately: bool = False) → List[List[List[Any]]][source]¶

Execute the crawling task and return the results.

Parameters: export_immediately – May be used when crawling many sites at once. In order to prevent a MemoryError, data will be exported as soon as it is ready and then discarded to make room for the next domains.
Returns: The result is a list with three layers. The first layer has one entry per crawled domain (result = [domain1, domain2, …]). The second layer (representing each crawled domain) is a list with one entry per processed URL (domain = [url1, url2, …]). The third layer (representing each URL) is a list with one entry per extracted datapoint (url = [datapoint1, datapoint2, …]).

run_and_export(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) → None[source]¶

Shorthand for Crawler.run(export_immediately=True).

Parameters: export_attrs – ExportAttributes object specifying export parameters.

data_extractors¶

class scrawler.data_extractors.BaseExtractor(*args, dynamic_parameters: bool = False, n_return_values: Optional[int] = None, **kwargs)[source]¶

Provides the basic architecture for each data extractor. Every data extractor has to inherit from BaseExtractor.

Parameters

args – Positional arguments to be used by children inheriting from BaseExtractor.
dynamic_parameters – Set this to True when you would like to pass a list to a certain parameter, and have each URL/scraping target use a different value from that list based on an index. See also here.
n_return_values – Specifies the number of values that will be returned by the extractor. This is almost always 1, but there are cases such as DateExtractor which may return more values. See also here.
kwargs – Keyword arguments to be used by children inheriting from BaseExtractor.

run(website: scrawler.website.Website, index: Optional[int] = None)[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

scrawler.data_extractors.supports_dynamic_parameters(func) → Callable[source]¶: Function decorator to select correct parameter based on index when using dynamic parameters.

class scrawler.data_extractors.AccessTimeExtractor(**kwargs)[source]¶

Returns the current time as time of access. To be exact, the time of processing.

run(website: scrawler.website.Website, index: Optional[int] = None) → datetime.datetime[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.CmsExtractor(**kwargs)[source]¶

Extract the Content Management System (CMS) used for building the website.

Note: This method uses the HTML generator meta tag and some hard-coded search terms. Therefore, not all systems will be identified correctly.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.ContactNameExtractor(tag_types: tuple = 'div', tag_attrs: dict = {'class': 'employee_name'}, separator: str = ';', **kwargs)[source]¶

Find contact name(s) for a given website.

Parameters

tag_types – Specifies which kind of tags to look at (e. g., div or span)
tag_attrs – Provide additional attributes in a dictionary, e. g. {"class": "contact"}.
separator – When more than one contact is found, they are separated by the string given here.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.CustomStringPutter(string: Union[str, list], **kwargs)[source]¶

Simply returns a given string or entry from a list of strings. Background: Sometimes, a column should be appended with a custom label for a given website (for example, an external ID).

Parameters: string – The string to be returned by the run() method. Can optionally pass a list here and use a different value for different URLs/domains that are scraped. In that case, remember to also pass use_index=True.
Raises: IndexError – May raise an IndexError if a the parameter string is passed a list and use_index=True. This may occur when you pass a list of custom strings shorter than the list of URLs crawled.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.DateExtractor(tag_types: tuple = 'meta', tag_attrs: dict = {'name': 'pubdate'}, return_year_month_day: bool = False, **kwargs)[source]¶

Get dates by looking at passed tag. Can optionally parse dates to year, month and day.

Parameters

tag_types – Describes the tag types to find, e. g. meta.
tag_attrs – Specifies HTML attributes and their values in a key-value dict format. Example: {"name": "pubdate"}.
return_year_month_day – If True, returns date as 3 integers: year (YYYY), month (MM) and day (dd).

run(website: scrawler.website.Website, index: Optional[int] = None) → Union[datetime.datetime, Tuple[int, int, int]][source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.DescriptionExtractor(**kwargs)[source]¶

Get website description (the one shown in search engine results) using two common description fields.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.DirectoryDepthExtractor(**kwargs)[source]¶

Returns the directory level that a given document is in.

For example, https://www.sub.example.com/dir1/dir2/file.html returns 3.

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.ExpiryDateExtractor(return_year_month_day: bool = False, **kwargs)[source]¶

Get website expiry date from HTTP header or HTML Meta tag.

run(website: scrawler.website.Website, index: Optional[int] = None) → Union[datetime.datetime, Tuple[int, int, int]][source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.GeneralHtmlTagExtractor(tag_types: tuple, tag_attrs: dict, attr_to_extract: str, fill_empty_field: bool = True, **kwargs)[source]¶

General purpose extractor for extracting HTML tags and then extracting a single attribute from the tag.

Parameters

tag_types – Describes the tag types to find, e. g. div.
tag_attrs – Specifies the HTML attributes use to find the relevant HTML tag in a key-value dict format. Example: {"class": ["content", "main-content"]}. See also this explanation of HTML tag attributes.
attr_to_extract – The attribute that should be extracted from the found HTML tag.
fill_empty_field – Used in cases where the specified attribute in the HTML tag exists but is empty. If True, returns the value specified in DEFAULT_EMPTY_FIELD_STRING. Otherwise, returns an empty string.
kwargs –

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.GeneralHttpHeaderFieldExtractor(field_to_extract: str, fill_empty_field: bool = True, **kwargs)[source]¶

General purpose extractor for extracting HTTP header fields.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.HttpStatusCodeExtractor(**kwargs)[source]¶

Get status code of HTTP request.

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.KeywordsExtractor(**kwargs)[source]¶

Get keywords from HTML keyword meta tag (if present).

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.LanguageExtractor(**kwargs)[source]¶

Get language of a given website from its HTML tag lang attribute.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.LastModifiedDateExtractor(return_year_month_day: bool = False, **kwargs)[source]¶

Get website last-modified date from HTTP header or HTML Meta tag.

run(website: scrawler.website.Website, index: Optional[int] = None) → Union[datetime.datetime, Tuple[int, int, int]][source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.LinkExtractor(**kwargs)[source]¶

Find all links from a website (without duplicates).

run(website: scrawler.website.Website, index: Optional[int] = None) → set[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.MobileOptimizedExtractor(**kwargs)[source]¶

Checks whether website is optimized for mobile usage by looking up HTML viewport meta tag.

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.ServerProductExtractor(**kwargs)[source]¶

Get website Server info from HTTP header.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.StepsFromStartPageExtractor(**kwargs)[source]¶

Returns the number of links that have to be followed from the start page to arrive at this website.

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.TermOccurrenceCountExtractor(terms: Union[List[str], str], ignore_case: bool = False, **kwargs)[source]¶

Count the number of times the given terms occur in the website’s HTML text.

Parameters

terms – term or list of terms to search for.
ignore_case – Whether to respect the text’s casing (upper-/lowercase).

Returns

Total sum of all occurrences.

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.TermOccurrenceExtractor(terms: Union[List[str], str], ignore_case: bool = False, **kwargs)[source]¶

Checks if the given terms occur in the website’s HTML text. Returns 0 if no term occurs in the soup’s text, 1 if at least one occurs.

Parameters

terms – term or list of terms to search for.
ignore_case – Whether to respect the text’s casing (upper-/lowercase).

run(website: scrawler.website.Website, index: Optional[int] = None) → int[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.TitleExtractor(**kwargs)[source]¶

Get title of a website (the same that is shown in a browser in the tabs tray).

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.UrlBranchNameExtractor(branch_name_position: int = 1, **kwargs)[source]¶

Extract sub-domain names from URLs like subdomain.example.com, which often refer to an entity’s sub-branches.

Parameters: branch_name_position – Where in the URL to look for the name. If 0, the domain will be used. Otherwise, indexes into all available sub-domains: 1 would retrieve the first sub-domain from the right, 2 the second, and so on.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.UrlCategoryExtractor(category_position: int = 2, **kwargs)[source]¶

Try to identify the category of a given URL as the directory specified by category_position.

Parameters: category_position – Specify at which position in the path the category can be found.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.UrlExtractor(**kwargs)[source]¶

Returns the website’s URL.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

class scrawler.data_extractors.WebsiteTextExtractor(mode: str = 'auto', min_length: int = 30, tag_types: tuple = 'div', tag_attrs: dict = {'class': ['content']}, allowed_string_types: List[bs4.element.NavigableString] = [<class 'bs4.element.NavigableString'>], separator: str = '[SEP]', **kwargs)[source]¶

Get readable website text, excluding <script>, <style>, <template> and other non-readable text. Several modes are available to make sure to only capture relevant text.

Parameters

mode – Default mode is auto, which uses the readability algorithm to only extract a website’s article text. If all_strings, all readable website text (excluding script, style and other tags as well as HTML comments) will be retrieved. See also the BeautifulSoup documentation for the get_text() method. If by_length, the min_length parameter will be used to determine the minimum length of HTML strings to be included in the text. If search_in_tags, the tags dictionary will be used to identify the tags that include text.
min_length – If using mode by_length, this is the minimum length of a string to be considered. Shorter strings will be discarded.
tag_types – Describes the tag types to find, e. g. div.
tag_attrs – Specifies HTML attributes and their values in a key-value dict format. Example: {"class": ["content", "main-content"]}.
allowed_string_types – List of types that are considered to be readable. This makes sure that scripts and similar types are excluded. Note that the types passed here have to inherit from bs4.NavigableString.
separator – String to be used as separator when concatenating all found strings.

run(website: scrawler.website.Website, index: Optional[int] = None) → str[source]¶

Runs the extraction and returns the extracted data.

Parameters

website – Website object that data is extracted from.
index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example, CustomStringPutter may put a different string for each domain). See also here.

scraping¶

class scrawler.scraping.Scraper(urls: Union[list, str], search_attributes: scrawler.attributes.SearchAttributes, export_attributes: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, timeout: Optional[Union[int, aiohttp.client.ClientTimeout]] = None, backend: str = 'asyncio', validate_input_parameters: bool = True)[source]¶

Scrape website or multiple websites in parallel.

Parameters

urls – Website URL or list of all URLs to scrape.
search_attributes – Specify which data to collect/search for in websites.
export_attributes – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
timeout – Timeout to be used when making HTTP requests. Note that the values specified here apply to each request individually, not to an entire session. When using the asyncio_backend, you can pass an aiohttp.ClientTimeout object where you can specify detailed timeout settings. Alternatively, you can pass an integer that will be interpreted as total timeout for one request in seconds. If nothing is passed, a default timeout will be used.
backend – “asyncio” to use the asyncio_backend (faster when crawling many domains at once, but more unstable and may get hung). “multithreading” to use the multithreading_backend (more stable, but most likely slower). See also Why are there two backends?
validate_input_parameters – Whether to validate input parameters. Note that this validates that all URLs work and that the various attributes work together. However, the attributes themselves are also validated independently. You will need to also pass validate=False to the attributes individually to completely disable input validation.

export_data(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, export_as_one_file: bool = True) → None[source]¶

Export data previously collected from scraping task.

Parameters

export_attrs – ExportAttributes object specifying export parameters.
export_as_one_file – If True, the data will be exported in one CSV file, each line representing one scraped URL.

run(export_immediately: bool = False) → List[List[Any]][source]¶

Execute the scraping task and return the results.

Parameters: export_immediately – May be used when scraping many sites at once. In order to prevent a MemoryError, data will be exported as soon as it is ready and then discarded to make room for the next sites.
Returns: The result is a list with one entry per processed URL (result = [url1, url2, …]). Each URL entry is a list with one entry per extracted datapoint (url = [datapoint1, datapoint2, …]).

run_and_export(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) → None[source]¶

Shorthand for Scraper.run(export_immediately=True).

Parameters: export_attrs – ExportAttributes object specifying export parameters.

website¶

class scrawler.website.Website(url: str, steps_from_start_page: Optional[int] = None)[source]¶

The Website object is a wrapper around a BeautifulSoup object from a website’s HTML text, while adding additional information such as the URL and the HTTP response when fetching the website.

Parameters

url – Website URL.
steps_from_start_page – Specifies number of steps from start URL to reach the given URL. Note that this is an optional parameter used in conjunction with the Crawler object.

Raises

Exceptions raised during URL parsing.

fetch(**kwargs)[source]¶

Fetch website from given URL and construct BeautifulSoup from response data.

Parameters: kwargs – Are passed on to get_html().
Raises: Exceptions from making the request (using requests.get()) and HTML parsing.
Returns: Website object with BeautifulSoup properties.

async fetch_async(session: aiohttp.client.ClientSession, **kwargs)[source]¶

Asynchronously fetch website from given URL and construct BeautifulSoup from response data.

Parameters

session – aiohttp.ClientSession to be used for making the request asynchronously.
kwargs – Are passed on to async_get_html().

Raises

Exceptions from making the request (using aiohttp.ClientSession.get()) and HTML parsing.

Returns

Website object with BeautifulSoup properties.

html_text¶: Website’s HTML text as a string. Only available after retrieving the Website using fetch() or fetch_async().

http_response¶: HTTP response as requests.Response or aiohttp.ClientResponse (depending on whether the website was fetched with fetch() or fetch_async()). Only available after retrieving the Website using fetch() or fetch_async().

parsed_url¶: ParsedUrl object for accessing the various URL parts (hostname, domain, path, …).

steps_from_start_page¶: Number of steps from start URL to reach the URL in crawlings. This has to be passed during object initialization, which is done automatically in crawl_domain() and async_crawl_domain().

url¶: Website URL.

`is_media_file`	Checks whether the URL ends in a file extension on an allowlist, indicating it is not a media file.
`is_same_host`	Checks whether two URLs have the same host.