Built-in Data Extractors¶

The extractors listed below are already built into scrawler. If they don’t cover your needs, have a look at the documentation for custom data extractors.

`AccessTimeExtractor`	Returns the current time as time of access.
`CmsExtractor`	Extract the Content Management System (CMS) used for building the website.
`ContactNameExtractor`	Find contact name(s) for a given website.
`CustomStringPutter`	Simply returns a given string or entry from a list of strings.
`DateExtractor`	Get dates by looking at passed tag.
`DescriptionExtractor`	Get website description (the one shown in search engine results) using two common description fields.
`DirectoryDepthExtractor`	Returns the directory level that a given document is in.
`ExpiryDateExtractor`	Get website `expiry` date from HTTP header or HTML Meta tag.
`GeneralHtmlTagExtractor`	General purpose extractor for extracting HTML tags and then extracting a single attribute from the tag.
`GeneralHttpHeaderFieldExtractor`	General purpose extractor for extracting HTTP header fields.
`HttpStatusCodeExtractor`	Get status code of HTTP request.
`KeywordsExtractor`	Get keywords from HTML keyword meta tag (if present).
`LanguageExtractor`	Get language of a given website from its HTML tag `lang` attribute.
`LastModifiedDateExtractor`	Get website `last-modified` date from HTTP header or HTML Meta tag.
`LinkExtractor`	Find all links from a website (without duplicates).
`ServerProductExtractor`	Get website `Server` info from HTTP header.
`StepsFromStartPageExtractor`	Returns the number of links that have to be followed from the start page to arrive at this website.
`MobileOptimizedExtractor`	Checks whether website is optimized for mobile usage by looking up HTML `viewport` meta tag.
`TermOccurrenceExtractor`	Checks if the given terms occur in the website’s HTML text.
`TermOccurrenceCountExtractor`	Count the number of times the given terms occur in the website’s HTML text.
`TitleExtractor`	Get title of a website (the same that is shown in a browser in the tabs tray).
`UrlExtractor`	Returns the website’s URL.
`UrlBranchNameExtractor`	Extract sub-domain names from URLs like `subdomain.example.com`, which often refer to an entity’s sub-branches.
`UrlCategoryExtractor`	Try to identify the category of a given URL as the directory specified by `category_position`.
`WebsiteTextExtractor`	Get readable website text, excluding `<script>`, `<style>`, `<template>` and other non-readable text.