Web Spiders
In the Web Spiders, you can confirm, add, and update a list of created web spiderss.
Web Spiders List
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].
Item Description
Item | Description |
---|---|
Enabled | Indicates whether the web spiders is enabled. |
Title | Displays the title of the web spiders. |
The Source for Crawling | Displays the target to crawl. |
History | Click to view the crawl history. |
Updated on | Displays the date and time when the web spiders was last updated. |
Web spiders editor
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].
From the Web Spiders list page, click on the [Title] of the web spiders you want to edit.
Basic Settings
Item | Description |
---|---|
Title | Set the title of the web spiders. |
Memo | Enter a memo. |
Data Import API | Select the endpoint for data import:
|
The Source for Crawling | Select the target for crawling. Currently supported targets:
|
Crawl Limit | Set the crawl limit. Specify 0 for unlimited. |
Collecting Images | Enable if you want to collect images. |
Force Update | Enable for force update. |
Status | Select the enabled status of the web spiders. |
Website crawling settings
General
Item | Description |
---|---|
Start Urls | Enter the URL to start crawling. Multiple entries can be made separated by line breaks. |
Allowed Urls | Enter the URLs to allow crawling. Multiple entries can be made separated by line breaks. |
Sitemap Urls | Enter the sitemap URL. |
Denied Urls | Enter the URLs to deny crawling. |
Allowed langs | Enter the languages to allow if there are multiple languages. |
Follow the links | Enable to crawl by following HTML links. |
Follow the secondary links | Enable to follow secondary links. |
Data transformation and import settings
Item | Description |
---|---|
CSS selector for identifying main content | Enter the CSS selector to identify as main content. |
CSS selector for identifying categories | Enter the CSS selector to identify categories. |
Strings to remove from title tag | Enter the strings to remove from the title tag. |
CSS selector for the part to be removed from the main content | Enter the CSS selector to remove from the main content. |
Content Structure Required for Saving Crawl Data
To save crawl results as content, the following content structure must be included.
Item Name (Optional) | Repetition | Item Setting | Slug | Annotation (Optional) |
---|---|---|---|---|
Date | Date picker Also include seconds (hh:mm:ss): Enabled | ymd | The updated date will be set. | |
Contents | 1 | HTML Allow all tags: Enabled | data | Contains content converted to markdown format. |
URL | 1 | Single-line text | url | |
Hash Value | 1 | Single-line text | etag | Used to check for updates to the content. |
Language | 1 | Single-line text | lang | |
Main Content CSS Selector | 1 | Single-line text | selector | Specifies the content to extract from the page. |
Response Status | 1 | Number | response_status | |
Content Size | 1 | Number | content-length | |
Content Type | 1 | Single-line text | content-type | |
Manual Adjustment Flag | 1 | Single choice 0: Disabled (Default) 1: Enabled | manual_override_flag | When enabled, the crawler will not overwrite. |
Domain | 1 | Single-line text | domain | |
Description | 1 | Single-line text | description | |
Icon URL | 1 | Single-line text | icon_url | |
OGP Image URL | 1 | Single-line text | ogp_image_url | |
Images | 20 | Grouping of the 3 Items Below | images | |
- Image URL | File (from File manager) | image_url | ||
- Image src | Single-line text | image_src | ||
- Alt Tag | Single-line text | alt | ||
Last Modified | 1 | Date picker Also include time (hh:mm): Enabled | last-modified |
Run the crawler History
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].
Click on the [History] of the web spiders you want to edit from the list of web spiderss on the web spiders list page.
Item Description
Item | Description |
---|---|
Status | Displays the current state of the crawl. |
The Source for Crawling | Displays the target of the crawl. |
Content | Displays the content definition name where the crawled pages are registered. |
Start Urls | Displays the URL where the crawl started. |
Start Date and Time | Displays the date and time when the crawl was started. |
End Date and Time | Displays the date and time when the crawl ended. |
Processing time | Displays the processing time of the crawl. |
Reason for Termination | Displays the reason for the crawl ending. |
Crawled count | Displays the number of pages processed during the crawl. |
Log | Click to view logs related to the crawl. |
Rerun | Click to rerun the crawl. |
Support
If you have any other questions, please contact us or check out Our Slack Community.