Skip to main content

API reference

crawlSite

instantiates a Spider object, initializing it based on your config file and settings, then invoking its crawl method.

crawlSite options:

PropertyRequiredTypeDescription
configFilePathNstringthe path to your config json file (see sample config: https://github.com/anansi-js/anansi/blob/main/config.sample.json or reference)
configNCrawlSiteOptionsCrawlerConfigalternatively to passing a config file path, can pass the config file's properties here
searchEngineOptsNSearchEngineOptssearch engine settings
logLevelN"debug" / "warn" / "error"log level
diagnosticsNbooleanwhether or not to output diagnostics
diagnosticsFilePathNstringpath to the file where diagnostics will be written to
timeoutNnumbertimeout in ms
maxIndexedRecordsNnumbermaximum number of records to index. If reached, the crawling jobs will terminate

CrawlSiteOptionsCrawlerConfig

PropertyRequiredTypeDescription
startUrlsYstring / string[]list of urls that the crawler will start from
scraperSettingsYScraperSettingshtml selectors for telling the crawler which content to scrape for indexing
allowedDomainsNstring / string[]list of allowed domains. When not specified, defaults to the domains of your startUrls
ignoreUrlsNstring / string[]list of url patterns to ignore
maxConcurrencyNnumbermaximum concurrent puppeteer clusters to run

ScraperSettings

all of the scraper settings groups (each group except the default ties to a specific URL pattern)

PropertyRequiredTypeDescription
defaultYScraperPageSettingsdefault scraper page settings - will be applied when the scraped url doesn't match any other scraper page settings group
[your scraper page-level settings group name]NScraperPageSettingspage-level settings group. Can add as many as you want. Each group will be applied to a given url pattern. During crawling, the settings for each page will be chosen based on which group's urlPatten field matches the page url. The default one will be chosen if no match was found
sharedYScraperPageSettingsshared scraper settings - settings defined here will be applied for all pages unless there is an overriding setting in the default or the specific settings group that is matches the current page

ScraperPageSettings

A group of a scraper settings - mostly hierarchy and metadata selectors, grouped by a specific URL pattern

PropertyRequiredTypeDescription
hierarchySelectorsYHierarchySelectorsselectors hierarchy (see below)
metadataSelectorsYRecord<string, string>metadata selectors. Mapping from html selectors to custom additional fields in the index, e.g. can scrape meta tags of a certain content pattern and store under a custom field
urlPatternYstringURL pattern. During crawling, the settings group for each page will be chosen based on which group's urlPatten field matches the page url. The default one will be chosen if no match was found
pageRankNnumbercustom ranking for the matched pages. Defaults to 0
respectRobotsMetaNbooleanwhether or not the crawler should respect noindex meta tag. Defaults to false
excludeSelectorsNstring[]list of html selectors to exclude from being scraped
userAgentNstringcustom user agent to set when running puppeteer
headersNRecord<string, string>request headers to include when crawling the site
basicAuthN{ user: string; password: string }basic auth credentials

HierarchySelectors

hierarchy selectors. Essentially a mapping from html selectors to indexed hierarchy levels

PropertyRequiredTypeDescription
l0NstringHTML selectors for matching l0, e.g. "span[class='myclass'], .myclass2"
l1NstringHTML selectors for matching l1
l2NstringHTML selectors for matching l2
l3NstringHTML selectors for matching l3
l4NstringHTML selectors for matching l4
contentNstringHTML selectors for matching content