Reference¶
backends¶
asyncio_backend¶
- async scrawler.backends.asyncio_backend.async_crawl_domain(start_url: str, session: aiohttp.client.ClientSession, search_attributes: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, max_no_urls: int = inf, max_distance_from_start_url: int = inf, max_subdirectory_depth: int = inf, filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: Iterable = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, return_type: str = 'data', progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None, current_index: Optional[int] = None, semaphore: Optional[asyncio.locks.Semaphore] = None, **kwargs)[source]¶
Collect data from all sites of a given domain. The sites within the domain are found automatically be iteratively searching for all links inside all pages.
- Parameters
start_url – The first URL to be accessed. From here, links will be extracted and iteratively processed to find all linked sites.
search_attributes – Dictionary specifying what to search for and how to search it.
export_attrs – Optional. If specified, the crawled data is exported as soon as it’s ready, not after the entire crawling has finished.
user_agent – Optionally specify a user agent for making the HTTP request.
pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s
robots.txt
file.max_no_urls – Maximum number of URLs to be crawled (safety limit for very large crawls).
max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start_url.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema:
hostname/sub-directory1/sub-siteA
. If you would want to crawl all URLs of the same level assub-directory1
, specify 1.sub-siteA
will then not be found, but a sitehostname/sub-directory2
orhostname/sub-siteB
will be.filter_non_standard_schemes – See
filter_urls()
.filter_media_files – See
filter_urls()
.blocklist – See
filter_urls()
.filter_foreign_urls – See
filter_urls()
.strip_url_parameters – See
strip_unnecessary_url_parts()
.strip_url_fragments – See
strip_unnecessary_url_parts()
.return_type – Specify which values to return (“all”, “none”, “data”).
progress_bar – If a
ProgressBar
object is passed, prints a progress bar on the command line.current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
semaphore –
asyncio.Semaphore
used for controlling the number of concurrent processes run.session –
aiohttp.ClientSession
used to make requests in a concurrent manner.
- Returns
List of the data collected from all URLs that where found using
start_url
as starting point.
- async scrawler.backends.asyncio_backend.async_scrape_site(url: str, session: aiohttp.client.ClientSession, search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, current_index: Optional[int] = None, progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None) list [source]¶
Scrape the data specified in search_attrs from one website.
- Parameters
url – URL to be scraped.
session –
aiohttp.ClientSession
used to make requests in a concurrent manner.search_attrs – Specify which data to collect/search for in the website.
export_attrs – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
progress_bar – If a
ProgressBar
object is passed, prints a progress bar on the command line.
- Returns
List of data collected from the website.
multithreading_backend¶
- scrawler.backends.multithreading_backend.crawl_domain(start_url: str, search_attributes: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, max_no_urls: int = inf, max_distance_from_start_url: int = inf, max_subdirectory_depth: int = inf, filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: Iterable = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, return_type: str = 'data', progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None, current_index: Optional[int] = None, **kwargs)[source]¶
Collect data from all sites of a given domain. The sites within the domain are found automatically be iteratively searching for all links inside all pages.
- Parameters
start_url – The first URL to be accessed. From here, links will be extracted and iteratively processed to find all linked sites.
search_attributes – Dictionary specifying what to search for and how to search it.
export_attrs – Optional. If specified, the crawled data is exported as soon as it’s ready, not after the entire crawling has finished.
user_agent – Optionally specify a user agent for making the HTTP request.
pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s
robots.txt
file.max_no_urls – Maximum number of URLs to be crawled (safety limit for very large crawls).
max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start_url.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema:
hostname/sub-directory1/sub-siteA
. If you would want to crawl all URLs of the same level assub-directory1
, specify 1.sub-siteA
will then not be found, but a sitehostname/sub-directory2
orhostname/sub-siteB
will be.filter_non_standard_schemes – See filter_urls().
filter_media_files – See filter_urls().
blocklist – See filter_urls().
filter_foreign_urls – See filter_urls().
strip_url_parameters – See strip_unnecessary_url_parts().
strip_url_fragments – See strip_unnecessary_url_parts().
return_type – Specify which values to return (“all”, “none”, “data”).
progress_bar – If a
ProgressBar
object is passed, prints a progress bar on the command line.current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
- Returns
List of the data collected from all URLs that where found using
start_url
as starting point.
- scrawler.backends.multithreading_backend.scrape_site(url: str, search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, current_index: Optional[int] = None, progress_bar: Optional[scrawler.utils.general_utils.ProgressBar] = None) list [source]¶
Scrape the data specified in search_attrs from one website.
- Parameters
url – URL to be scraped.
search_attrs – Specify which data to collect/search for in the website.
export_attrs – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
current_index – Internal index needed to allow dynamic parameters (parameters where a list of values has been passed and only the values relevant to the currently processed URL should be used; for example, export_attrs may contain a list of filenames, and only the relevant filename for the currently processed URL should be used). See this explanation for details.
progress_bar – If a
ProgressBar
object is passed, prints a progress bar on the command line.
- Returns
List of data collected from the website.
utils¶
file_io_utils¶
Functions for local file import/export operations, e. g. CSV file reading and writing.
- scrawler.utils.file_io_utils.export_to_csv(data, directory: str, fn: str, header: Optional[Union[list, str, bool]] = None, encoding: str = 'utf-8', separator: str = ',', quoting: int = 0, escapechar: Optional[str] = None, current_index: Optional[int] = None, **kwargs) None [source]¶
Export data to a CSV file.
- Parameters
data – One- or two-dimensional data that will be parsed to a
pandas.DataFrame
.directory – Path to directory where file will be saved.
fn – Filename (without file extension).
header – If
None
orFalse
, no header will be written. Iffirst-row
orTrue
, uses first row of data as header. Else, pass list of strings of appropriate length.encoding – Encoding to use to create the CSV file.
separator – Column separator or delimiter to use for creating the CSV file.
quoting – Puts quotes around cells that contain the separator character.
escapechar – Escapes the separator character.
current_index – If
fn
is a list of filenames, use this to specify which filename to use.kwargs – Any parameter supported by
pandas.DataFrame.to_csv()
can be passed.
- scrawler.utils.file_io_utils.get_data_in_dir(directory: str, start_idx: int = 0, end_idx: Optional[int] = None, encoding: str = 'utf-8', separator: str = ',') list [source]¶
Read all CSV files within a directory. All files in the directory must be CSV files.
- Parameters
directory – Path to the directory.
start_idx – Sometimes, not all CSV files in the directory should be read. Together with
end_idx
, this parameter allows to specify an interval of files that should be read in, e. g. the first up to the 5th file.end_idx – See
start_idx
.encoding – The character encoding of the CSV files to be read.
separator – The separator/delimiter of the CSV files to be read.
- scrawler.utils.file_io_utils.multithreaded_csv_export(list_of_datasets: list, **kwargs) None [source]¶
Export a list of multi-column dataset to a CSV file in parallel using
multithreading
.- Parameters
list_of_datasets – List of two-dimensional data objects that will be parsed to a
pandas.DataFrame
.kwargs – Keywords arguments that are passed on to
export_to_csv()
.
general_utils¶
General purpose utility functions.
- class scrawler.utils.general_utils.ProgressBar(total_length: int = 0, progress: int = 0, custom_message: str = '', width_in_command_line: int = 100, progress_char: str = '█', remaining_char: str = '-')[source]¶
Print a progress bar in the command line interface.
Default looks like this:
Custom Message |██████████----------| 50.0% (5 / 10)
.- Parameters
total_length – Absolute length of concept (e.g. total download size = 20,000 bytes).
progress – Share of
total_length
already reached (e.g. 10,000 bytes already downloaded).custom_message – String to appear to the left of the progress bar.
width_in_command_line – Number of characters used in print to display the progress bar.
progress_char – Character to use for filling the progress bar.
remaining_char – Character to use for the space not yet filled by progress.
validation_utils¶
Functions to make sure the specifications for a crawling/scraping are valid and work together correctly.
- scrawler.utils.validation_utils.validate_input_params(urls: List[str], search_attrs: scrawler.attributes.SearchAttributes, export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, crawling_attrs: Optional[scrawler.attributes.CrawlingAttributes] = None, **kwargs)[source]¶
Validate that all URLs work and the various attributes work together.
web_utils¶
Functions for web operations (e. g. working with URLs and retrieving data from websites).
- class scrawler.utils.web_utils.ParsedUrl(url: str)[source]¶
Parse a URL string into its various parts. Basically a wrapper around
tld.Result
to make accessing elements easier.- Parameters
url – URL string to parse.
- Raises
Exception – Exceptions from TLD package if the URL is invalid.
- url¶
Entire URL. In the following, this example URL is used to illustrate the various URL parts:
http://username:password@some.subdomain.example.co.uk/path1/path2?param="abc"#xyz
- async scrawler.utils.web_utils.async_get_html(url: str, session: aiohttp.client.ClientSession, user_agent: Optional[str] = None, verify: bool = True, max_content_length: int = - 1, check_http_content_type: bool = True, return_response_object: bool = False, raise_for_status: bool = False, **kwargs) Union[str, Tuple[str, aiohttp.client_reqrep.ClientResponse]] [source]¶
Collect HTML text of a given URL.
- Parameters
url – URL to retrieve the HTML from.
session –
aiohttp.ClientSession
to be used for making the request asynchronously.user_agent – Allows to optionally specify a different user agent than the default Python user agent.
verify – Whether to verify the server’s TLS certificate. Useful if TLS connections fail, but should in general be
True
to avoid man-in-the-middle attacks.max_content_length – Check the HTTP header for the attribute
content-length
. If it is bigger than this specified parameter, a ValueError is raised. Set to-1
when not needed.check_http_content_type – Whether to check the HTTP header field
content-type
. If it does not includetext
, a ValueError is raised.return_response_object – If True, also returns the ClientResponse object from the GET request.
raise_for_status – If True, raise an HTTPError if the HTTP request returned an unsuccessful status code.
kwargs – Will be passed on to
aiohttp.ClientSession.get()
.
- Returns
HTML text from the given URL. Optionally also returns the HTTP response object.
- Raises
aiohttp.ClientError, aiohttp.HTTPError, ValueError – Errors derived from
aiohttp.ClientError
includeInvalidURL
,ClientConnectionError
andClientResponseError
. May optionally raiseaiohttp.HTTPError
(ifraise_for_status
isTrue
) or ValueError (ifcheck_http_content_type
ormax_content_length
areTrue
).
- async scrawler.utils.web_utils.async_get_redirected_url(url: str, session: aiohttp.client.ClientSession, max_redirects_to_follow: int = 100, **kwargs) str [source]¶
Find final, redirected URL. Supports both HTTP redirects and HTML redirects. Also follows up on multiple redirects.
- Parameters
url – Original URL.
session –
aiohttp.ClientSession
to be used for making the request asynchronously.max_redirects_to_follow – Maximum number of redirects to follow to guard against infinite redirects. If limit is reached,
None
is returned.kwargs – Passed on to
async_get_html()
.
- Returns
URL after redirects. If URL is invalid or an error occurs, returns
None
.
- async scrawler.utils.web_utils.async_get_robot_file_parser(start_url: str, session: aiohttp.client.ClientSession, **kwargs) Optional[urllib.robotparser.RobotFileParser] [source]¶
Returns
RobotFileParser
from given URL. If norobots.txt
file is found or error occurs, returnsNone
.- Parameters
start_url – URL from which
robots.txt
will be collected.session –
aiohttp.ClientSession
to use for making the request.kwargs – Will be passed to
get_html()
.
- Returns
- scrawler.utils.web_utils.extract_same_host_pattern(base_url: str) str [source]¶
Looks at the passed base/start URL to determine which mode for
is_same_host()
is appropriate. First looks at whether the given URL contains a non-empty path. If one is found, the number of directoriesX
is counted anddirectoryX
is returned. Otherwise, check whether the URL contains subdomains. If found, the number of subdomainsX
is counted andsubdomainX
is returned. If neither exist, returnsfld
.See also
- scrawler.utils.web_utils.filter_urls(urls: Iterable, filter_non_standard_schemes: bool, filter_media_files: bool, blocklist: Iterable, filter_foreign_urls: Union[str, callable], base_url: Optional[str] = None, return_discarded: bool = False, **kwargs) Union[set, Tuple[set, set]] [source]¶
Filter a list of URLs along some given attributes.
- Parameters
urls – List of URLs to filter.
filter_non_standard_schemes – If
True
, makes sure that the URLs start withhttp:
orhttps:
.filter_media_files – If
True
, discards URLs having media file extensions like.pdf
or.jpeg
. For details, seeis_media_file()
.blocklist – Specify a list of words or parts that if they appear in a URL, the URL will be discarded (e. g. ‘git.’, datasets.’).
filter_foreign_urls – Specify how to detect foreign URLs. Can either be a string that is passed to
is_same_host()
, or a customCallable
that has to include two arguments,url1
andurl2
. For details on possible strings seeis_same_host()
(note that thebase_url
parameter has to be passed for this to work). If you pass your own comparison function here, it has to include two parameters,url1
andurl2
. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should returnTrue
for URLs that belong to the same host, andFalse
for foreign URLs.base_url – Used in conjunction with the
filter_foreign_urls
parameter to detect foreign URLs.return_discarded – If
True
, also returns to discarded URLs.
- Returns
Set
containing URLs that were not filtered. Optionally also returns discarded URLs.
See also
Checks whether the URL ends in a file extension on an allowlist, indicating it is not a media file.
Checks whether two URLs have the same host.
- scrawler.utils.web_utils.fix_relative_urls(urls: Iterable, base_url: str) set [source]¶
Make relative URLs absolute by joining them with the base URL that they were found on.
- scrawler.utils.web_utils.get_directory_depth(url: str) Optional[int] [source]¶
Returns the directory level that a given document is in. For example,
https://example.com/en/directoryA/document.html
returns 3, because thedocument.html
is 3 directories deep into the website’s structure. Further,https://example.com/en/
returns 1 (the trailing/
is ignored), andhttps://example.com
returns 0.- Parameters
url – URL to be checked which subdirectory is used.
- Returns
Subdirectory level as path depth. If the URL is invalid, returns
None
.
- scrawler.utils.web_utils.get_html(url: str, timeout: int = 15, user_agent: Optional[str] = None, verify: bool = True, stream: str = True, max_content_length: int = - 1, check_http_content_type: bool = True, return_response_object: bool = False, raise_for_status: bool = False) Union[Tuple[str, requests.models.Response], str] [source]¶
Collect HTML text of a given URL.
- Parameters
url – URL to retrieve the HTML from.
timeout – If the server does not answer for the number of seconds specified here, a
Timeout
exception is raised.user_agent – Allows to optionally specify a different user agent than the default Python user agent.
verify – Whether to verify the server’s TLS certificate. Useful if TLS connections fail, but should in general be
True
to avoid man-in-the-middle attacks.stream – If
True
, only the header of the response is retrieved. This allows for HTTP content type checking before actually retrieving the content. For details see the Requests documentation.max_content_length – Check the HTTP header for the attribute
content-length
. If it is bigger than this specified parameter, aValueError
is raised. Set to-1
when not needed.check_http_content_type – Check the HTTP header for the attribute
content-type
. If it does not include ‘text’, aValueError
is raised.return_response_object – If
True
, also returns theResponse
object from the GET request.raise_for_status – If
True
, raise anHTTPError
if the HTTP request returned an unsuccessful status code.
- Returns
HTML text from the given URL.
- Raises
ConnectionError, Timeout, other RequestExceptions, HTTPError, ValueError – Raises some errors from the requests library when retrieval errors occur. Optionally raises
HTTPError
(ifraise_for_status
isTrue
) andValueError
(ifcheck_http_content_type
ormax_content_length
areTrue
).
- scrawler.utils.web_utils.get_redirected_url(url: str, max_redirects_to_follow: int = 100, **kwargs) str [source]¶
Find final, redirected URL. Supports both HTTP redirects and HTML redirects. Also follows up on multiple redirects.
- Parameters
url – Original URL.
max_redirects_to_follow – Maximum number of redirects to follow to guard against infinite redirects. If limit is reached,
None
is returned.kwargs – Passed on to
get_html()
.
- Returns
URL after redirects. If URL is invalid or an error occurs, returns
None
.
- scrawler.utils.web_utils.get_robot_file_parser(start_url: str, **kwargs) Optional[urllib.robotparser.RobotFileParser] [source]¶
Returns
RobotFileParser
object from given URL. If norobots.txt
file is found or error occurs, returnsNone
.- Parameters
start_url – URL from which
robots.txt
will be collected.kwargs – Will be passed to
get_html()
.
See also
- scrawler.utils.web_utils.is_media_file(url: str, disallow_approach: bool = False, check_http_header: bool = False) bool [source]¶
Checks whether the URL ends in a file extension on an allowlist, indicating it is not a media file.
- Parameters
url – URL to be checked.
disallow_approach – If
True
, uses a blocklist-approach, where file extensions known to be media file extensions are blocked. Note that while the blocklist used covers the most frequent file extensions, it certainly is not complete. Using the default allowlist-approach will guarantee no URLs with any but a text file extension are processed.check_http_header – Look up the HTTP header attribute
content-type
and checks whether it containstext/html
. Note that enabling this would make the function execute much slower, because an HTTP request is made instead of just checking a string.
- Returns
True
/False
- scrawler.utils.web_utils.is_same_host(url1: str, url2: str, mode: str = 'hostname') bool [source]¶
Checks whether two URLs have the same host. A comparison mode can be defined which determines the parts of the URLs that are checked for equality.
- Parameters
url1 – First URL to compare.
url2 – Second URL to compare.
mode – String describing which URL parts to check for equality. Can either be any one of the attributes of the
ParsedUrl
class (e.g.domain
,hostname
,fld
). Alternatively, can be set tosubdomainX
withX
representing an integer number up to which subdomain the URLs should be compared. E.g., comparinghttp://www.sub.example.com
andhttp://blog.sub.example.com
,sub
is the first level, while the second levels arewww
andblog
, respectively. Or, can be set todirectoryX
withX
representing an integer number up to which directory the URLs should be compared. E.g., forhttp://example.com/dir1/dir2/index.html
,directory2
would include all files indir2
.
- Returns
True
orFalse
. If exceptions occur, the method returnsFalse
.- Raises
ValueError – If invalid mode is specified.
- scrawler.utils.web_utils.strip_unnecessary_url_parts(urls: Iterable, parameters: bool = False, fragments: bool = True) set [source]¶
Strip unnecessary URL parts.
- Parameters
urls – URLs to be stripped (can be any Iterable).
parameters – If
True
, strips URL query parameters (always start with a?
) from the URL.fragments – If
True
, strips URL fragments (introduced with#
), except for relevant fragments using Google’s hash bang syntax.
- Returns
Iterable of URLs, optionally without (query) parameters.
attributes¶
Specifies the attribute objects used by crawlers and scrapers.
- class scrawler.attributes.CrawlingAttributes(filter_non_standard_schemes: bool = True, filter_media_files: bool = True, blocklist: tuple = (), filter_foreign_urls: Union[str, Callable] = 'auto', strip_url_parameters: bool = False, strip_url_fragments: bool = True, max_no_urls: Optional[int] = None, max_distance_from_start_url: Optional[int] = None, max_subdirectory_depth: Optional[int] = None, pause_time: float = 0.5, respect_robots_txt: bool = True, validate: bool = True)[source]¶
Specify how to conduct the crawling, including filtering irrelevant URLs or limiting the number of crawled URLs.
- Parameters
filter_non_standard_schemes – Filter URLs starting with schemes other than
http:
orhttps:
(e.g.,mailto:
orjavascript:
).filter_media_files – Whether to filter media files. Recommended:
True
to avoid long runtimes caused by large file downloads.blocklist – Filter URLs that contain one or more of the parts specified here. Has to be a
list
.filter_foreign_urls – Filter URLs that do not belong to the same host (foreign URLs). Can either be a string that is passed to
is_same_host()
, or a customCallable
that has to include two arguments,url1
andurl2
. Inis_same_host()
, the following string values are permitted: 1.auto
: Automatically extracts a matching pattern from the start URL (seeextract_same_host_pattern()
for details). 2. Any one of the attributes of theParsedUrl
class (e.g.domain
,hostname
,fld
). 3.subdomainX
withX
representing an integer number up to which subdomain the URLs should be compared. E.g., comparinghttp://www.sub.example.com
andhttp://blog.sub.example.com
,sub
is the first level, while the second levels arewww
andblog
, respectively. 4.directoryX
withX
representing an integer number up to which directory the URLs should be compared. E.g., forhttp://example.com/dir1/dir2/index.html
,directory2
would include all files indir2
.strip_url_parameters – Whether to strip URL query parameters (prefixed by
?
) from the URL.strip_url_fragments – Whether to strip URL fragments (prefixed by
#
) from the URL.max_no_urls – Maximum number of URLs to be crawled per domain (safety limit for very large crawls). Set to
None
if you want all URLs to be crawled.max_distance_from_start_url – Maximum number of links that have to be followed to arrive at a certain URL from the start URL.
max_subdirectory_depth – Maximum sub-level of the host up to which to crawl. E.g., consider this schema:
hostname/sub-directory1/sub-siteA
. If you would want to crawl all URLs of the same level assub-directory1
, specify 1.sub-siteA
will then not be found, but a sitehostname/sub-directory2
orhostname/sub-siteB
will be.pause_time – Time to wait between the crawling of two URLs (in seconds).
respect_robots_txt – Whether to respect the specifications made in the website’s
robots.txt
file.
- class scrawler.attributes.ExportAttributes(directory: str, fn: Union[str, list], header: Optional[Union[list, str, bool]] = None, encoding: str = 'utf-8', separator: str = ',', quoting: int = 0, escapechar: Optional[str] = None, validate: bool = True, **kwargs)[source]¶
Specify how and where to export the collected data.
- Parameters
directory – Folder where file(s) will be saved to.
fn – Name(s) of the file(s) containing the crawled data. Without file extension.
header – Have the final CSV file have a header. Possible parameters: If
None
orFalse
, no header will be written. Iffirst-row
orTrue
, uses first row of data as header. Else, pass list of strings of appropriate length.encoding – Encoding to use to create the CSV file.
separator – Column separator or delimiter to use for creating the CSV file.
quoting – Puts quotes around cells that contain the separator character.
escapechar – Escapes the separator character.
validate – Whether to make sure that input parameters are valid.
kwargs – Any parameter supported by
pandas.DataFrame.to_csv()
can be passed.
- class scrawler.attributes.SearchAttributes(*args: scrawler.data_extractors.BaseExtractor, validate: bool = True)[source]¶
Specify which data to collect/search for in the website.
- Parameters
args – Data extractors specifying which data to extract in websites (see built-in data extractors or for possibilities or define a custom data extractor).
validate – Whether to make sure that input parameters are valid.
- extract_all_attrs_from_website(website: scrawler.website.Website, index: Optional[int] = None) list [source]¶
Extract data from a website using data extractors specified in
SearchAttributes
definition.- Parameters
website – Website object to collect the specified data points from.
index – Optionally pass an index for data extractors that index into passed parameters. See this explanation for details.
crawling¶
- class scrawler.crawling.Crawler(urls: Union[str, List[str]], search_attributes: scrawler.attributes.SearchAttributes, export_attributes: Optional[scrawler.attributes.ExportAttributes] = None, crawling_attributes: scrawler.attributes.CrawlingAttributes = <scrawler.attributes.CrawlingAttributes object>, user_agent: Optional[str] = None, timeout: Optional[Union[int, aiohttp.client.ClientTimeout]] = None, backend: str = 'asyncio', parallel_processes: int = 4, validate_input_parameters: bool = True)[source]¶
Crawl a domain or multiple domains in parallel.
- Parameters
urls – Start URL of domain to crawl or list of all URLs to crawl.
search_attributes – Specify which data to collect/search for in websites.
export_attributes – Specify how and where to export the collected data (as CSV).
crawling_attributes – Specify how to conduct the crawling, e. g. how to filter irrelevant URLs or limits on the number of URLs crawled.
user_agent – Optionally specify a user agent for making the HTTP request.
timeout – Timeout to be used when making HTTP requests. Note that the values specified here apply to each request individually, not to an entire session. When using the
asyncio_backend
, you can pass anaiohttp.ClientTimeout
object where you can specify detailed timeout settings. Alternatively, you can pass an integer that will be interpreted as total timeout for one request in seconds. If nothing is passed, a default timeout will be used.backend – “asyncio” to use the
asyncio_backend
(faster when crawling many domains at once, but more unstable and may get hung). “multithreading” to use themultithreading_backend
(more stable, but most likely slower). See also Why are there two backends?parallel_processes – Number of concurrent processes/threads to use. Can be very large when using
asyncio_backend
. When usingmultithreading_backend
, should not exceed 2x the CPU count on the machine running the crawling.validate_input_parameters – Whether to validate input parameters. Note that this validates that all URLs work and that the various attributes work together. However, the attributes themselves are also validated independently. You will need to also pass
validate=False
to the attributes individually to completely disable input validation.
- export_data(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) None [source]¶
Export data previously collected from crawling task.
- Parameters
export_attrs –
ExportAttributes
object specifying export parameters.
- run(export_immediately: bool = False) List[List[List[Any]]] [source]¶
Execute the crawling task and return the results.
- Parameters
export_immediately – May be used when crawling many sites at once. In order to prevent a
MemoryError
, data will be exported as soon as it is ready and then discarded to make room for the next domains.- Returns
The result is a list with three layers. The first layer has one entry per crawled domain (result = [domain1, domain2, …]). The second layer (representing each crawled domain) is a list with one entry per processed URL (domain = [url1, url2, …]). The third layer (representing each URL) is a list with one entry per extracted datapoint (url = [datapoint1, datapoint2, …]).
- run_and_export(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) None [source]¶
Shorthand for
Crawler.run(export_immediately=True)
.- Parameters
export_attrs –
ExportAttributes
object specifying export parameters.
data_extractors¶
- class scrawler.data_extractors.BaseExtractor(*args, dynamic_parameters: bool = False, n_return_values: Optional[int] = None, **kwargs)[source]¶
Provides the basic architecture for each data extractor. Every data extractor has to inherit from
BaseExtractor
.- Parameters
args – Positional arguments to be used by children inheriting from
BaseExtractor
.dynamic_parameters – Set this to
True
when you would like to pass alist
to a certain parameter, and have each URL/scraping target use a different value from that list based on an index. See also here.n_return_values – Specifies the number of values that will be returned by the extractor. This is almost always 1, but there are cases such as
DateExtractor
which may return more values. See also here.kwargs – Keyword arguments to be used by children inheriting from
BaseExtractor
.
- run(website: scrawler.website.Website, index: Optional[int] = None)[source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- scrawler.data_extractors.supports_dynamic_parameters(func) Callable [source]¶
Function decorator to select correct parameter based on index when using dynamic parameters.
- class scrawler.data_extractors.AccessTimeExtractor(**kwargs)[source]¶
Returns the current time as time of access. To be exact, the time of processing.
- run(website: scrawler.website.Website, index: Optional[int] = None) datetime.datetime [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.CmsExtractor(**kwargs)[source]¶
Extract the Content Management System (CMS) used for building the website.
Note: This method uses the HTML generator meta tag and some hard-coded search terms. Therefore, not all systems will be identified correctly.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.ContactNameExtractor(tag_types: tuple = 'div', tag_attrs: dict = {'class': 'employee_name'}, separator: str = ';', **kwargs)[source]¶
Find contact name(s) for a given website.
- Parameters
tag_types – Specifies which kind of tags to look at (e. g.,
div
orspan
)tag_attrs – Provide additional attributes in a dictionary, e. g.
{"class": "contact"}
.separator – When more than one contact is found, they are separated by the string given here.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.CustomStringPutter(string: Union[str, list], **kwargs)[source]¶
Simply returns a given string or entry from a list of strings. Background: Sometimes, a column should be appended with a custom label for a given website (for example, an external ID).
- Parameters
string – The string to be returned by the
run()
method. Can optionally pass a list here and use a different value for different URLs/domains that are scraped. In that case, remember to also passuse_index=True
.- Raises
IndexError – May raise an
IndexError
if a the parameterstring
is passed a list anduse_index=True
. This may occur when you pass a list of custom strings shorter than the list of URLs crawled.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.DateExtractor(tag_types: tuple = 'meta', tag_attrs: dict = {'name': 'pubdate'}, return_year_month_day: bool = False, **kwargs)[source]¶
Get dates by looking at passed tag. Can optionally parse dates to year, month and day.
- Parameters
tag_types – Describes the tag types to find, e. g.
meta
.tag_attrs – Specifies HTML attributes and their values in a key-value dict format. Example:
{"name": "pubdate"}
.return_year_month_day – If True, returns date as 3 integers: year (
YYYY
), month (MM
) and day (dd
).
- run(website: scrawler.website.Website, index: Optional[int] = None) Union[datetime.datetime, Tuple[int, int, int]] [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.DescriptionExtractor(**kwargs)[source]¶
Get website description (the one shown in search engine results) using two common description fields.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.DirectoryDepthExtractor(**kwargs)[source]¶
Returns the directory level that a given document is in.
For example,
https://www.sub.example.com/dir1/dir2/file.html
returns 3.- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.ExpiryDateExtractor(return_year_month_day: bool = False, **kwargs)[source]¶
Get website
expiry
date from HTTP header or HTML Meta tag.- run(website: scrawler.website.Website, index: Optional[int] = None) Union[datetime.datetime, Tuple[int, int, int]] [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.GeneralHtmlTagExtractor(tag_types: tuple, tag_attrs: dict, attr_to_extract: str, fill_empty_field: bool = True, **kwargs)[source]¶
General purpose extractor for extracting HTML tags and then extracting a single attribute from the tag.
- Parameters
tag_types – Describes the tag types to find, e. g.
div
.tag_attrs – Specifies the HTML attributes use to find the relevant HTML tag in a key-value dict format. Example:
{"class": ["content", "main-content"]}
. See also this explanation of HTML tag attributes.attr_to_extract – The attribute that should be extracted from the found HTML tag.
fill_empty_field – Used in cases where the specified attribute in the HTML tag exists but is empty. If
True
, returns the value specified inDEFAULT_EMPTY_FIELD_STRING
. Otherwise, returns an empty string.kwargs –
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.GeneralHttpHeaderFieldExtractor(field_to_extract: str, fill_empty_field: bool = True, **kwargs)[source]¶
General purpose extractor for extracting HTTP header fields.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.HttpStatusCodeExtractor(**kwargs)[source]¶
Get status code of HTTP request.
- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.KeywordsExtractor(**kwargs)[source]¶
Get keywords from HTML keyword meta tag (if present).
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.LanguageExtractor(**kwargs)[source]¶
Get language of a given website from its HTML tag
lang
attribute.- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.LastModifiedDateExtractor(return_year_month_day: bool = False, **kwargs)[source]¶
Get website
last-modified
date from HTTP header or HTML Meta tag.- run(website: scrawler.website.Website, index: Optional[int] = None) Union[datetime.datetime, Tuple[int, int, int]] [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.LinkExtractor(**kwargs)[source]¶
Find all links from a website (without duplicates).
- run(website: scrawler.website.Website, index: Optional[int] = None) set [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.MobileOptimizedExtractor(**kwargs)[source]¶
Checks whether website is optimized for mobile usage by looking up HTML
viewport
meta tag.- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.ServerProductExtractor(**kwargs)[source]¶
Get website
Server
info from HTTP header.- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.StepsFromStartPageExtractor(**kwargs)[source]¶
Returns the number of links that have to be followed from the start page to arrive at this website.
- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.TermOccurrenceCountExtractor(terms: Union[List[str], str], ignore_case: bool = False, **kwargs)[source]¶
Count the number of times the given terms occur in the website’s HTML text.
- Parameters
terms – term or list of terms to search for.
ignore_case – Whether to respect the text’s casing (upper-/lowercase).
- Returns
Total sum of all occurrences.
- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.TermOccurrenceExtractor(terms: Union[List[str], str], ignore_case: bool = False, **kwargs)[source]¶
Checks if the given terms occur in the website’s HTML text. Returns 0 if no term occurs in the soup’s text, 1 if at least one occurs.
- Parameters
terms – term or list of terms to search for.
ignore_case – Whether to respect the text’s casing (upper-/lowercase).
- run(website: scrawler.website.Website, index: Optional[int] = None) int [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.TitleExtractor(**kwargs)[source]¶
Get title of a website (the same that is shown in a browser in the tabs tray).
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.UrlBranchNameExtractor(branch_name_position: int = 1, **kwargs)[source]¶
Extract sub-domain names from URLs like
subdomain.example.com
, which often refer to an entity’s sub-branches.- Parameters
branch_name_position – Where in the URL to look for the name. If
0
, the domain will be used. Otherwise, indexes into all available sub-domains:1
would retrieve the first sub-domain from the right,2
the second, and so on.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.UrlCategoryExtractor(category_position: int = 2, **kwargs)[source]¶
Try to identify the category of a given URL as the directory specified by
category_position
.- Parameters
category_position – Specify at which position in the path the category can be found.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.UrlExtractor(**kwargs)[source]¶
Returns the website’s URL.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
- class scrawler.data_extractors.WebsiteTextExtractor(mode: str = 'auto', min_length: int = 30, tag_types: tuple = 'div', tag_attrs: dict = {'class': ['content']}, allowed_string_types: List[bs4.element.NavigableString] = [<class 'bs4.element.NavigableString'>], separator: str = '[SEP]', **kwargs)[source]¶
Get readable website text, excluding
<script>
,<style>
,<template>
and other non-readable text. Several modes are available to make sure to only capture relevant text.- Parameters
mode – Default mode is
auto
, which uses thereadability
algorithm to only extract a website’s article text. Ifall_strings
, all readable website text (excluding script, style and other tags as well as HTML comments) will be retrieved. See also the BeautifulSoup documentation for theget_text()
method. Ifby_length
, themin_length
parameter will be used to determine the minimum length of HTML strings to be included in the text. Ifsearch_in_tags
, the tags dictionary will be used to identify the tags that include text.min_length – If using mode
by_length
, this is the minimum length of a string to be considered. Shorter strings will be discarded.tag_types – Describes the tag types to find, e. g.
div
.tag_attrs – Specifies HTML attributes and their values in a key-value dict format. Example:
{"class": ["content", "main-content"]}
.allowed_string_types – List of types that are considered to be readable. This makes sure that scripts and similar types are excluded. Note that the types passed here have to inherit from
bs4.NavigableString
.separator – String to be used as separator when concatenating all found strings.
- run(website: scrawler.website.Website, index: Optional[int] = None) str [source]¶
Runs the extraction and returns the extracted data.
- Parameters
website –
Website
object that data is extracted from.index – Used for extractors that should behave differently for each domain/site if multiple are processed. Usually, the extractor will be passed a list of values and use only the value relevant to the currently processed domain/site (for example,
CustomStringPutter
may put a different string for each domain). See also here.
scraping¶
- class scrawler.scraping.Scraper(urls: Union[list, str], search_attributes: scrawler.attributes.SearchAttributes, export_attributes: Optional[scrawler.attributes.ExportAttributes] = None, user_agent: Optional[str] = None, timeout: Optional[Union[int, aiohttp.client.ClientTimeout]] = None, backend: str = 'asyncio', validate_input_parameters: bool = True)[source]¶
Scrape website or multiple websites in parallel.
- Parameters
urls – Website URL or list of all URLs to scrape.
search_attributes – Specify which data to collect/search for in websites.
export_attributes – Specify how and where to export the collected data (as CSV).
user_agent – Optionally specify a user agent for making the HTTP request.
timeout – Timeout to be used when making HTTP requests. Note that the values specified here apply to each request individually, not to an entire session. When using the
asyncio_backend
, you can pass anaiohttp.ClientTimeout
object where you can specify detailed timeout settings. Alternatively, you can pass an integer that will be interpreted as total timeout for one request in seconds. If nothing is passed, a default timeout will be used.backend – “asyncio” to use the
asyncio_backend
(faster when crawling many domains at once, but more unstable and may get hung). “multithreading” to use themultithreading_backend
(more stable, but most likely slower). See also Why are there two backends?validate_input_parameters – Whether to validate input parameters. Note that this validates that all URLs work and that the various attributes work together. However, the attributes themselves are also validated independently. You will need to also pass
validate=False
to the attributes individually to completely disable input validation.
- export_data(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None, export_as_one_file: bool = True) None [source]¶
Export data previously collected from scraping task.
- Parameters
export_attrs –
ExportAttributes
object specifying export parameters.export_as_one_file – If
True
, the data will be exported in one CSV file, each line representing one scraped URL.
- run(export_immediately: bool = False) List[List[Any]] [source]¶
Execute the scraping task and return the results.
- Parameters
export_immediately – May be used when scraping many sites at once. In order to prevent a
MemoryError
, data will be exported as soon as it is ready and then discarded to make room for the next sites.- Returns
The result is a list with one entry per processed URL (result = [url1, url2, …]). Each URL entry is a list with one entry per extracted datapoint (url = [datapoint1, datapoint2, …]).
- run_and_export(export_attrs: Optional[scrawler.attributes.ExportAttributes] = None) None [source]¶
Shorthand for
Scraper.run(export_immediately=True)
.- Parameters
export_attrs –
ExportAttributes
object specifying export parameters.
website¶
- class scrawler.website.Website(url: str, steps_from_start_page: Optional[int] = None)[source]¶
The Website object is a wrapper around a BeautifulSoup object from a website’s HTML text, while adding additional information such as the URL and the HTTP response when fetching the website.
- Parameters
url – Website URL.
steps_from_start_page – Specifies number of steps from start URL to reach the given URL. Note that this is an optional parameter used in conjunction with the Crawler object.
- Raises
Exceptions raised during URL parsing.
- fetch(**kwargs)[source]¶
Fetch website from given URL and construct
BeautifulSoup
from response data.- Parameters
kwargs – Are passed on to
get_html()
.- Raises
Exceptions from making the request (using
requests.get()
) and HTML parsing.- Returns
Website object with
BeautifulSoup
properties.
- async fetch_async(session: aiohttp.client.ClientSession, **kwargs)[source]¶
Asynchronously fetch website from given URL and construct BeautifulSoup from response data.
- Parameters
session –
aiohttp.ClientSession
to be used for making the request asynchronously.kwargs – Are passed on to
async_get_html()
.
- Raises
Exceptions from making the request (using
aiohttp.ClientSession.get()
) and HTML parsing.- Returns
Website object with
BeautifulSoup
properties.
- html_text¶
Website’s HTML text as a string. Only available after retrieving the Website using
fetch()
orfetch_async()
.
- http_response¶
HTTP response as
requests.Response
oraiohttp.ClientResponse
(depending on whether the website was fetched withfetch()
orfetch_async()
). Only available after retrieving the Website usingfetch()
orfetch_async()
.
- steps_from_start_page¶
Number of steps from start URL to reach the URL in crawlings. This has to be passed during object initialization, which is done automatically in
crawl_domain()
andasync_crawl_domain()
.
- url¶
Website URL.