Create a web crawl data source

curl --request POST \ --url https://developer.qaip.com/api/v1/crawls \ --header 'Content-Type: application/json' \ --header 'x-api-key: <api-key>' \ --data ' { "name": "<string>", "start_url": "<string>", "max_depth": 5, "max_num_files": 50000, "path_filters": [ "<string>" ], "content_pattern": [ "<string>" ], "html_only": false, "use_browser": false, "file_extensions": [ "<string>" ], "rrule": "<string>" } '

{ "id": "<string>", "name": "<string>", "start_url": "<string>", "status": "unknown", "ingestion_setting_id": "<string>", "creation_time": 123, "start_time": 123, "end_time": 123, "error": { "title": "<string>", "message": "<string>" } }

Authorizations

x-api-key

string

header

required

API key for authentication

Body

application/json

name

string

required

Name of the web crawl data source

Maximum string length: 200

start_url

string

required

Start URL of the web crawl

Maximum string length: 2000

max_depth

integer

required

Maximum crawl depth

Required range: 1 <= x <= 10

max_num_files

integer

required

Maximum number of files to crawl

Required range: 1 <= x <= 100000

path_filters

string[]

Path filters for crawling. The total number of characters across all elements in the array must be 2000 or fewer.

Maximum array length: 2000

content_pattern

string[]

Content patterns for filtering. The total number of characters across all elements in the array must be 2000 or fewer.

Maximum array length: 2000

html_only

boolean

default:false

When true, only HTML files will be downloaded

use_browser

boolean

default:false

Whether to use a headless browser for crawling

file_extensions

string[]

File extensions to include (e.g. ".pdf", ".docx"). For supported file extensions, please refer to https://developer.qaip.com/docs/datasources#%E5%AF%BE%E5%BF%9C%E3%81%97%E3%81%A6%E3%81%84%E3%82%8B%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E5%BD%A2%E5%BC%8F

Maximum array length: 2000

Maximum string length: 10

rrule

string

Recurrence rule (RFC 5545 RRULE)

Response

Successfully created web crawl data source

string

required

Web crawl data source ID

name

string

required

Name of the web crawl ingestion setting

start_url

string

required

Start URL of the web crawl

status

enum<string>

required

Job status

Available options:

unknown,

queued,

not_started,

managed,

starting,

started,

success,

failure,

canceling,

canceled,

deleting,

delete_job_failure

ingestion_setting_id

string

Web crawl ingestion setting ID

creation_time

integer<int64>

Creation time (Unix timestamp in seconds)

start_time

integer<int64>

Job start time (Unix timestamp in seconds)

end_time

integer<int64>

Job end time (Unix timestamp in seconds)

error

object

Show child attributes

References

Authorizations

Body

Response