Creating a config.json file
In the repo where you will be running the spider from, create a new config.json
file.
This file will include the spider's configurations. For a full reference of the different configurations, see the API reference section. In this Getting Started tutorial, we'll be creating a simplified config for demonstration purposes.
Populate your config.json
with the following content:
{
"maxConcurrency": 1,
"startUrls": [
"https://your-site-url.top-level-domain"
],
"allowedDomains": [
"your.domain"
],
"scraperSettings": {
"default": {
"hierarchySelectors": {
"l0": "title",
"l1": "main h1",
"l2": "main h2",
"l3": "main h3",
"l4": "main h4",
"content": "main p"
}
},
"shared": {
"onlyContentLevel": true
}
}
}
If your site has basic auth enabled, add the basicAuth
config option to the shared
settings group:
"shared": {
"onlyContentLevel": true,
"basicAuth": {
"user": "myuser",
"password": "mypass"
}
}