Skip to main content

Web Spiders

In the Web Spiders, you can confirm, add, and update a list of created web spiders.

Web Spiders List

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Item Description

Image from Gyazo

ItemDescription
EnabledIndicates whether the web spider is enabled.
TitleDisplays the title of the web spider.
The Source for CrawlingDisplays the target to crawl.
HistoryClick to view the crawl history.
Updated onDisplays the date and time when the web spider was last updated.

Web spider editor

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

From the Web Spiders list page, click on the [Title] of the web spider you want to edit.

Image from Gyazo

Basic Settings

ItemDescription
TitleSet the title of the web spider.
StatusToggle the enabled status of the web spider.
MemoEnter a memo.
Linked Content structuresRelease version: β / RCversionDisplays the list of content structures linked to this crawler (shown only when editing an existing web spider). Linkage is configured from the content structure edit page under "Crawler settings".
The Source for CrawlingSelect the target for crawling. Currently supported targets:
  • Crawl Web Pages
  • Crawl Files in the S3 Bucket of Kuroco RAG
  • Crawl Files in the specified S3 Bucket
Target directory of S3 bucketDisplayed when the crawl target is an S3 folder. Enter the target directory within the S3 bucket.
Crawl LimitSet the crawl limit. Specify 0 for unlimited.
ScheduleRelease version: β / RCversionConfigure scheduled crawl execution. When "Daily" is enabled, the crawl runs automatically at the specified time. If no time is specified, 03:00 is set as the default.
Collect TextsEnable to collect text data. Displayed when the crawl target is "Crawl Web Pages".
Collect Documents (PDF & Office Files)Enable to collect PDF and Office files (.pdf/.xlsx/.xls/.docx/.pptx). Displayed when the crawl target is "Crawl Web Pages".
Collecting ImagesEnable if you want to collect images.
Force UpdateEnable for force update.

Website crawling settings

Displayed when the crawl target is "Crawl Web Pages".

General
ItemDescription
Start UrlsEnter the URL to start crawling. Multiple entries can be made separated by line breaks.
Allowed UrlsEnter the URLs to allow crawling. Multiple entries can be made separated by line breaks.
Sitemap UrlsEnter the sitemap URL.
Denied UrlsEnter the URLs to deny crawling. Multiple entries can be made separated by line breaks.
Allowed Next UrlsEnter the URLs to allow for secondary link following. Multiple entries can be made separated by line breaks.
Denied Next UrlsEnter the URLs to deny for secondary link following. Multiple entries can be made separated by line breaks.
Allowed langsEnter the languages to allow if there are multiple languages.
Follow the linksEnable to crawl by following HTML links.
Follow the secondary linksEnable to follow secondary links from the allowed URLs.
Data transformation and import settings

This section contains settings related to HTML crawling. In the admin panel, it is displayed as a collapsible section with an "HTML" badge.Release version: β / RCversion

Image from Gyazo

ItemDescription
CSS selector for identifying main contentEnter the CSS selector to identify as main content.
CSS selector for identifying categoriesEnter the CSS selector to identify categories.
Strings to remove from title tagEnter the strings to remove from the title tag.
CSS selector for the part to be removed from the main contentEnter the CSS selector to remove from the main content.

Content Structure Required for Saving Crawl Data

To save crawl results as content, the following content structure must be included.

Item Name (Optional)RepetitionItem SettingSlugAnnotation (Optional)
DateDate picker
Also include seconds (hh:mm:ss): Enabled
ymdThe updated date will be set.
Contents1HTML
Allow all tags: Enabled
dataContains content converted to markdown format.
URL1Single-line texturl
Hash Value1Single-line textetagUsed to check for updates to the content.
Language1Single-line textlang
Main Content CSS Selector1Single-line textselectorSpecifies the content to extract from the page.
Response Status1Numberresponse_status
Content Size1Numbercontent-length
Content Type1Single-line textcontent-type
Manual Adjustment Flag1Single choice
0: Disabled (Default)
1: Enabled
manual_override_flagWhen enabled, the crawler will not overwrite.
Domain1Single-line textdomain
Description1Single-line textdescription
Icon URL1Single-line texticon_url
OGP Image URL1Single-line textogp_image_url
Images20Grouping of the 3 Items Belowimages
- Image URLFile (from File manager)image_url
- Image srcSingle-line textimage_src
- Alt TagSingle-line textalt
Last Modified1Date picker
Also include time (hh:mm): Enabled
last-modified

Run the crawler History

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Click on the [History] of the web spider you want to view from the Web Spiders list page.

Image from Gyazo

Item Description

Image from Gyazo

ItemDescription
StatusDisplays the current state of the crawl.
The Source for CrawlingDisplays the target of the crawl.
ContentDisplays the content definition name where the crawled pages are registered.
Start UrlsDisplays the URL where the crawl started.
Start Date and TimeDisplays the date and time when the crawl was started.
End Date and TimeDisplays the date and time when the crawl ended.
Processing timeDisplays the processing time of the crawl.
Reason for TerminationDisplays the reason for the crawl ending.
Crawled countDisplays the number of pages processed during the crawl.
LogClick to view logs related to the crawl.
RerunClick to rerun the crawl.

Support

If you have any other questions, please contact us or check out Our Slack Community.