Web Spiders

In the Web Spiders, you can confirm, add, and update a list of created web spiderss.

Web Spiders List

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Item Description

Image from Gyazo

Item	Description
Enabled	Indicates whether the web spiders is enabled.
Title	Displays the title of the web spiders.
The Source for Crawling	Displays the target to crawl.
History	Click to view the crawl history.
Updated on	Displays the date and time when the web spiders was last updated.

Web spiders editor

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

From the Web Spiders list page, click on the [Title] of the web spiders you want to edit.

Image from Gyazo

Basic Settings

Image from Gyazo

Item	Description
Title	Set the title of the web spiders.
Memo	Enter a memo.
Data Import API	Select the endpoint for data import: Security: Dynamic Access Token Model/Operation: Topics::insert Lightweight Mode: Enabled Upsert by Columns: Slug *The content structure specified in topics_group_id must include certain required fields. Please refer to the fields listed in Content structure Required for Saving Crawl Data for details on the necessary content structure fields.
The Source for Crawling	Select the target for crawling. Currently supported targets: Crawl Web Pages
Crawl Limit	Set the crawl limit. Specify 0 for unlimited.
Collecting Images	Enable if you want to collect images.
Force Update	Enable for force update.
Status	Select the enabled status of the web spiders.

Website crawling settings

General

Image from Gyazo

Item	Description
Start Urls	Enter the URL to start crawling. Multiple entries can be made separated by line breaks.
Allowed Urls	Enter the URLs to allow crawling. Multiple entries can be made separated by line breaks.
Sitemap Urls	Enter the sitemap URL.
Denied Urls	Enter the URLs to deny crawling.
Allowed langs	Enter the languages to allow if there are multiple languages.
Follow the links	Enable to crawl by following HTML links.
Follow the secondary links	Enable to follow secondary links.

Data transformation and import settings

Image from Gyazo

Item	Description
CSS selector for identifying main content	Enter the CSS selector to identify as main content.
CSS selector for identifying categories	Enter the CSS selector to identify categories.
Strings to remove from title tag	Enter the strings to remove from the title tag.
CSS selector for the part to be removed from the main content	Enter the CSS selector to remove from the main content.

Content Structure Required for Saving Crawl Data

To save crawl results as content, the following content structure must be included.

Item Name (Optional)	Repetition	Item Setting	Slug	Annotation (Optional)
Date		Date picker Also include seconds (hh:mm:ss): Enabled	ymd	The updated date will be set.
Contents	1	HTML Allow all tags: Enabled	data	Contains content converted to markdown format.
URL	1	Single-line text	url
Hash Value	1	Single-line text	etag	Used to check for updates to the content.
Language	1	Single-line text	lang
Main Content CSS Selector	1	Single-line text	selector	Specifies the content to extract from the page.
Response Status	1	Number	response_status
Content Size	1	Number	content-length
Content Type	1	Single-line text	content-type
Manual Adjustment Flag	1	Single choice 0: Disabled (Default) 1: Enabled	manual_override_flag	When enabled, the crawler will not overwrite.
Domain	1	Single-line text	domain
Description	1	Single-line text	description
Icon URL	1	Single-line text	icon_url
OGP Image URL	1	Single-line text	ogp_image_url
Images	20	Grouping of the 3 Items Below	images
- Image URL		File (from File manager)	image_url
- Image src		Single-line text	image_src
- Alt Tag		Single-line text	alt
Last Modified	1	Date picker Also include time (hh:mm): Enabled	last-modified

Run the crawler History

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Click on the [History] of the web spiders you want to edit from the list of web spiderss on the web spiders list page.

Image from Gyazo

Item Description

Image from Gyazo

Item	Description
Status	Displays the current state of the crawl.
The Source for Crawling	Displays the target of the crawl.
Content	Displays the content definition name where the crawled pages are registered.
Start Urls	Displays the URL where the crawl started.
Start Date and Time	Displays the date and time when the crawl was started.
End Date and Time	Displays the date and time when the crawl ended.
Processing time	Displays the processing time of the crawl.
Reason for Termination	Displays the reason for the crawl ending.
Crawled count	Displays the number of pages processed during the crawl.
Log	Click to view logs related to the crawl.
Rerun	Click to rerun the crawl.

Support

If you have any other questions, please contact us or check out Our Slack Community.

Contact form

Join Kuroco Slack community

Web Spiders List​

Accessing the screen​

Item Description​

Web spiders editor​

Accessing the screen​

Basic Settings​

Website crawling settings​

General​

Data transformation and import settings​

Content Structure Required for Saving Crawl Data​

Run the crawler History​

Accessing the screen​

Item Description​

Support

Web Spiders List

Accessing the screen

Item Description

Web spiders editor

Accessing the screen

Basic Settings

Website crawling settings

General

Data transformation and import settings

Content Structure Required for Saving Crawl Data

Run the crawler History

Accessing the screen

Item Description