Website Crawling

Website crawling lets you extract content from any public website. KnowStack crawls all pages on the domain you specify and saves the text content for use in Knowledge Base generation.

How to Crawl a Website

  1. Go to Data Collection and select the Websites tab.
  2. In the 'Crawl Website' field, enter a domain name (e.g., example.com) or a full URL (e.g., https://example.com/docs).
  3. Click 'Start Crawling'. The system will begin extracting content from all pages it can find on that domain.
  4. A status message will appear showing progress. Crawling runs in the background -- you can leave the page and come back later.
  5. Once complete, the crawled website will appear in the 'Crawled Websites' list below the form.

Viewing Crawled Content

After a crawl completes, click on any website in the Crawled Websites list to view its individual pages. Each page shows the URL, the extracted text content, and the crawl status. You can view page content, delete individual pages you do not want to include, or re-crawl the website to pick up new content.

Managing Crawls

  • Re-crawl -- Click into a crawled website to start a new crawl job, which will pick up any pages that have changed since the last crawl
  • Delete individual pages -- Remove specific pages from the crawl results if they contain irrelevant content
  • Delete all crawls -- Use the 'Delete All' button to remove all crawled websites at once
  • View crawl job history -- Each website shows its crawl jobs with status (completed, failed, in progress) and the number of pages found

The number of pages you can crawl depends on your plan. Professional plans allow up to 50 pages per crawl. Business and Enterprise plans have unlimited crawling.

KnowStack can only crawl publicly accessible pages. Pages behind login walls, paywalls, or that require JavaScript rendering may not extract correctly. If a crawl finds fewer pages than expected, the site may be blocking crawlers.