Getting Started¶
To get started, have a look at the templates folder. It contains four files, each one doing a different task. All templates include three sections:
Imports retrieves all code dependencies from various files.
Setup is where all parameters are specified.
In Execution, an instance of the respective Python object is created and its
run()
method executed.
As a starting point, you can copy-and-paste a template and make any adjustments you would like.
Let’s have a closer look at the Setup section.
First, the URL(s) to be processed are specified.
Then, the attributes that define how to accomplish the tasks are specified:
Specify which data to collect/search for in the website. |
|
Specify how and where to export the collected data. |
|
Specify how to conduct the crawling, including filtering irrelevant URLs or limiting the number of crawled URLs. |
For more details, see the section Attributes.
In the section Execution, these parameters are then passed to the relevant object (see next section).
Basic Objects¶
The basic functionality of scrawler is contained in two classes, Scraper
and Crawler
.
Functionality¶
The objects are passed all relevant parameters during object initialization. Then, three methods can be applied to them:
run()
: Execute the task and return the results.run_and_export()
: This may be used when scraping/crawling many sites at once, generating huge amounts of data. In order to prevent aMemoryError
, data will be exported as soon as it is ready and then discarded to make room for the next sites/domains.export_data()
: Export the collected data to CSV file(s).
Example Crawling¶
Let’s have a look at an example for crawling https://example.com
.
For the moment, you can ignore the variables search_attrs
, export_attrs
and crawling_attrs
.
We will get to them later.
from scrawler import Crawler
search_attrs, export_attrs, crawling_attrs = ..., ..., ...
crawler = Crawler("https://example.com",
search_attributes=search_attrs,
export_attributes=export_attrs,
crawling_attributes=crawling_attrs)
results = crawler.run()
crawler.export_data()
Example Scraping¶
Here, multiple sites are scraped at once.
from scrawler import Scraper
search_attrs, export_attrs = ..., ...
scraper = Scraper(["https://www.example1.com", "https://www.example2.com", "https://www.example3.com"],
search_attributes=search_attrs,
export_attributes=export_attrs)
results = scraper.run()
scraper.export_data()
Attributes¶
Now that we know the objects that will perform our tasks, we would like to specify exactly how to go about it.
Search Attributes¶
The SearchAttributes
specify which data to collect/search for in
the website (and how to do it). This is done by passing data extractor
objects to SearchAttributes
during initialization.
There are many data extractors already build into the project, see built-in data extractors. You can also specify your own custom data extractors.
In this example, we set up SearchAttributes
that will extract three different data points from websites,
specified using the built-in UrlExtractor
, TitleExtractor
and DateExtractor
data extractors.
Note how parameters for the data extractors are passed directly during initialization.
from scrawler.attributes import SearchAttributes
from scrawler.data_extractors import *
search_attrs = SearchAttributes(
UrlExtractor(), # returns URL
TitleExtractor(), # returns website <title> tag content
DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"}) # returns publication date from pubdate meta tag
)
See also
SearchAttributes
: More detailed documentation.
Export Attributes¶
The ExportAttributes
specify how and where to export the collected
data to. Data is always exported to the CSV format, therefore the
various parameters are geared towards the CSV format.
Two parameters must be specified here:
directory
: The directory (folder) that the file(s) will be saved to.fn
: Filename(s) of the exported CSV files containing the crawled data. You don’t have to specify the file extension.csv
, since the files will always be CSV files (for example, usecrawled_data
instead ofcrawled_data.csv
).
Here’s an exemplary ExportAttributes
object creation:
from scrawler.attributes import ExportAttributes
export_attrs = ExportAttributes(
directory=r"C:\Users\USER\Documents",
fn=["example1_crawled_data", "example1_crawled_data", "example1_crawled_data"],
header=["URL", "Title", "Publication Date"],
separator="\t"
)
See also
ExportAttributes
: More detailed documentation.
Crawling Attributes¶
The CrawlingAttributes
specify how to conduct the crawling, e.g.
how to filter irrelevant URLs or limits on the number of URLs crawled.
As implied by their name, they are only relevant for crawling tasks.
Some commonly adjusted parameters include:
filter_foreign_urls
: This parameter defines how the crawler knows that a given URL is still part of the target domain. For example, one may only want to crawl a subdomain, not the entire domain (only URLs fromsubdomain.example.com
vs. the entireexample.com
domain). Details on valid input values can be found in the documentation forCrawlingAttributes
. By default, this is set toauto
, which means that the correct mode will be inferred by looking at the passed base/start URL. For example, if the start URL contains a subdomain, only links from the subdomain will be crawled. For details, refer to the documentation for theextract_same_host_pattern()
function. Note that you can also pass your own comparison function here. It has to include two parameters,url1
andurl2
. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should returnTrue
for URLs that belong to the same host, andFalse
for foreign URLs.filter_media_files
: Controls whether to filter out (ignore) media files. Media files can be quite large and make the crawling process significantly longer, while not adding any new information because media file data can’t be parsed and processed. Therefore, the crawler filters media by looking at the URL (e.g. URLs ending in.pdf
or.jpg
), as well as the response header content-type.blocklist
: Some directories might not be interesting for the crawling process (e.g.,/media/
). Theblocklist
parameter makes it possible to pass a list of strings that might occur in a URL. If the URL contains any of the given strings, it is filtered out.max_no_urls
: Some domains contain many webpages. This parameter can be passed an integer as the maximum total amount of URLs to be crawled.
Here’s an exemplary CrawlingAttributes
object creation:
from scrawler.attributes import CrawlingAttributes
DOMAIN_TO_CRAWL = "https://www.blog.example.com"
crawling_attrs = CrawlingAttributes(
filter_foreign_urls="subdomain1", # only crawling the `blog` subdomain
filter_media_files=True,
blocklist=("git.", "datasets.", "nextcloud."),
max_no_urls=1000
)
Another example with a custom foreign URL filter:
import tld.exceptions
from scrawler.attributes import CrawlingAttributes
from scrawler.utils.web_utils import ParsedUrl
DOMAIN_TO_CRAWL = "https://www.blog.example.com/my_directory/index.html"
def should_be_crawled(url1: str, url2: str) -> bool:
"""Custom foreign URL filter: Crawl all URLs from host `www.blog.example.com` inside the directory `my_directory`."""
try: # don't forget exception handling
url1 = ParsedUrl(url1)
url2 = ParsedUrl(url2)
except (tld.exceptions.TldBadUrl, tld.exceptions.TldDomainNotFound): # URL couldn't be parsed
return False
return ((url1.hostname == url2.hostname) # hostname is `www.blog.example.com`
and ("my_directory" in url1.path) and ("my_directory" in url2.path))
crawling_attrs = CrawlingAttributes(
filter_foreign_urls=should_be_crawled, # pass custom foreign URL filter here
filter_media_files=True,
blocklist=("git.", "datasets.", "nextcloud."),
max_no_urls=1000
)
See also
CrawlingAttributes
: More detailed documentation.
Other Settings¶
Most parameters are encompassed in the three attribute objects above. However, there are some additional settings available for special cases.
If you look at the templates’ Setup section again, it includes a USER_AGENT
parameter that sets the
user agent to be used during scraping/crawling.
Finally, defaults.py contains standard settings that are used throughout the project.
FAQ¶
Why are there two backends?¶
The module backends contains two files with
the same functions for scraping and crawling, but built on different
technologies for parallelization: One uses asyncio
and the other multiprocessing
,
more precisely using multithreading by means of multiprocessing.dummy
.
In general, asyncio_backend
is preferable because more sites can be processed in parallel.
However, on very large sites, scrawler may get stuck, and the entire crawling will hang.
Also, aiohttp.ServerDisconnectedError
may occur a lot.
If you expect or experience these cases, it is preferable to use the
multithreading_backend
, which is slower, but more robust.