Built-in Data Extractors

The extractors listed below are already built into scrawler. If they don’t cover your needs, have a look at the documentation for custom data extractors.

AccessTimeExtractor

Returns the current time as time of access.

CmsExtractor

Extract the Content Management System (CMS) used for building the website.

ContactNameExtractor

Find contact name(s) for a given website.

CustomStringPutter

Simply returns a given string or entry from a list of strings.

DateExtractor

Get dates by looking at passed tag.

DescriptionExtractor

Get website description (the one shown in search engine results) using two common description fields.

DirectoryDepthExtractor

Returns the directory level that a given document is in.

ExpiryDateExtractor

Get website expiry date from HTTP header or HTML Meta tag.

GeneralHtmlTagExtractor

General purpose extractor for extracting HTML tags and then extracting a single attribute from the tag.

GeneralHttpHeaderFieldExtractor

General purpose extractor for extracting HTTP header fields.

HttpStatusCodeExtractor

Get status code of HTTP request.

KeywordsExtractor

Get keywords from HTML keyword meta tag (if present).

LanguageExtractor

Get language of a given website from its HTML tag lang attribute.

LastModifiedDateExtractor

Get website last-modified date from HTTP header or HTML Meta tag.

LinkExtractor

Find all links from a website (without duplicates).

ServerProductExtractor

Get website Server info from HTTP header.

StepsFromStartPageExtractor

Returns the number of links that have to be followed from the start page to arrive at this website.

MobileOptimizedExtractor

Checks whether website is optimized for mobile usage by looking up HTML viewport meta tag.

TermOccurrenceExtractor

Checks if the given terms occur in the website’s HTML text.

TermOccurrenceCountExtractor

Count the number of times the given terms occur in the website’s HTML text.

TitleExtractor

Get title of a website (the same that is shown in a browser in the tabs tray).

UrlExtractor

Returns the website’s URL.

UrlBranchNameExtractor

Extract sub-domain names from URLs like subdomain.example.com, which often refer to an entity’s sub-branches.

UrlCategoryExtractor

Try to identify the category of a given URL as the directory specified by category_position.

WebsiteTextExtractor

Get readable website text, excluding <script>, <style>, <template> and other non-readable text.