Built-in Data Extractors¶
The extractors listed below are already built into scrawler. If they don’t cover your needs, have a look at the documentation for custom data extractors.
Returns the current time as time of access. |
|
Extract the Content Management System (CMS) used for building the website. |
|
Find contact name(s) for a given website. |
|
Simply returns a given string or entry from a list of strings. |
|
Get dates by looking at passed tag. |
|
Get website description (the one shown in search engine results) using two common description fields. |
|
Returns the directory level that a given document is in. |
|
Get website |
|
General purpose extractor for extracting HTML tags and then extracting a single attribute from the tag. |
|
General purpose extractor for extracting HTTP header fields. |
|
Get status code of HTTP request. |
|
Get keywords from HTML keyword meta tag (if present). |
|
Get language of a given website from its HTML tag |
|
Get website |
|
Find all links from a website (without duplicates). |
|
Get website |
|
Returns the number of links that have to be followed from the start page to arrive at this website. |
|
Checks whether website is optimized for mobile usage by looking up HTML |
|
Checks if the given terms occur in the website’s HTML text. |
|
Count the number of times the given terms occur in the website’s HTML text. |
|
Get title of a website (the same that is shown in a browser in the tabs tray). |
|
Returns the website’s URL. |
|
Extract sub-domain names from URLs like |
|
Try to identify the category of a given URL as the directory specified by |
|
Get readable website text, excluding |