Skip to main content

Web Spiders

In the Web Spiders, you can confirm, add, and update a list of created web spiderss.

Web Spiders List

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Item Description

Image from Gyazo

ItemDescription
EnabledIndicates whether the web spiders is enabled.
TitleDisplays the title of the web spiders.
The Source for CrawlingDisplays the target to crawl.
HistoryClick to view the crawl history.
Updated onDisplays the date and time when the web spiders was last updated.

Web spiders editor

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

From the Web Spiders list page, click on the [Title] of the web spiders you want to edit.

Image from Gyazo

Basic Settings

Image from Gyazo

ItemDescription
TitleSet the title of the web spiders.
MemoEnter a memo.
Data Import APISelect the endpoint for data import:
  • Security: Dynamic Access Token
  • Model/Operation: Topics::insert
  • Lightweight Mode: Enabled
  • Upsert by Columns: Slug
*The content structure specified in topics_group_id must include certain required fields. Please refer to the fields listed in Content structure Required for Saving Crawl Data for details on the necessary content structure fields.
The Source for CrawlingSelect the target for crawling. Currently supported targets:
  • Crawl Web Pages
Crawl LimitSet the crawl limit. Specify 0 for unlimited.
Collecting ImagesEnable if you want to collect images.
Force UpdateEnable for force update.
StatusSelect the enabled status of the web spiders.

Website crawling settings

General

Image from Gyazo

ItemDescription
Start UrlsEnter the URL to start crawling. Multiple entries can be made separated by line breaks.
Allowed UrlsEnter the URLs to allow crawling. Multiple entries can be made separated by line breaks.
Sitemap UrlsEnter the sitemap URL.
Denied UrlsEnter the URLs to deny crawling.
Allowed langsEnter the languages to allow if there are multiple languages.
Follow the linksEnable to crawl by following HTML links.
Follow the secondary linksEnable to follow secondary links.
Data transformation and import settings

Image from Gyazo

ItemDescription
CSS selector for identifying main contentEnter the CSS selector to identify as main content.
CSS selector for identifying categoriesEnter the CSS selector to identify categories.
Strings to remove from title tagEnter the strings to remove from the title tag.
CSS selector for the part to be removed from the main contentEnter the CSS selector to remove from the main content.

Content Structure Required for Saving Crawl Data

To save crawl results as content, the following content structure must be included.

Item Name (Optional)RepetitionItem SettingSlugAnnotation (Optional)
DateDate picker
Also include seconds (hh:mm:ss): Enabled
ymdThe updated date will be set.
Contents1HTML
Allow all tags: Enabled
dataContains content converted to markdown format.
URL1Single-line texturl
Hash Value1Single-line textetagUsed to check for updates to the content.
Language1Single-line textlang
Main Content CSS Selector1Single-line textselectorSpecifies the content to extract from the page.
Response Status1Numberresponse_status
Content Size1Numbercontent-length
Content Type1Single-line textcontent-type
Manual Adjustment Flag1Single choice
0: Disabled (Default)
1: Enabled
manual_override_flagWhen enabled, the crawler will not overwrite.
Domain1Single-line textdomain
Description1Single-line textdescription
Icon URL1Single-line texticon_url
OGP Image URL1Single-line textogp_image_url
Images20Grouping of the 3 Items Belowimages
- Image URLFile (from File manager)image_url
- Image srcSingle-line textimage_src
- Alt TagSingle-line textalt
Last Modified1Date picker
Also include time (hh:mm): Enabled
last-modified

Run the crawler History

Accessing the screen

Click on [AI/RAG] -> [Web Spiders].

Image from Gyazo

Click on the [History] of the web spiders you want to edit from the list of web spiderss on the web spiders list page.

Image from Gyazo

Item Description

Image from Gyazo

ItemDescription
StatusDisplays the current state of the crawl.
The Source for CrawlingDisplays the target of the crawl.
ContentDisplays the content definition name where the crawled pages are registered.
Start UrlsDisplays the URL where the crawl started.
Start Date and TimeDisplays the date and time when the crawl was started.
End Date and TimeDisplays the date and time when the crawl ended.
Processing timeDisplays the processing time of the crawl.
Reason for TerminationDisplays the reason for the crawl ending.
Crawled countDisplays the number of pages processed during the crawl.
LogClick to view logs related to the crawl.
RerunClick to rerun the crawl.

Support

If you have any other questions, please contact us or check out Our Slack Community.