Overview
The Spider Scrape Node provides a powerful web scraping tool for extracting content from websites. This node enables you to:- Extract text content from single web pages
- Crawl multiple subpages automatically
- Parse structured data from HTML
- Include metadata for citations
- Handle dynamic and JavaScript-rendered content
Deprecated. Please use the Web Search Node instead.
Configuration Parameters
Node Configuration
- Target Site URL: The URL of the webpage you want to scrape (e.g., https://www.example.com)
- Crawl Subpages: Enable crawling to automatically read multiple webpages linked from the target URL
Advanced Settings
Advanced Settings
- Include document metadata for citations: When enabled, includes XML-formatted metadata with source URLs for proper attribution
Expected Inputs and Outputs
-
Inputs:
- The node accepts text input that can be used to format the target URL dynamically
-
Outputs:
- content: Extracted text content from the webpage(s)
Use Case Examples
- Content Aggregation: Extract articles, blog posts, or documentation from websites for analysis or archiving.
- Competitive Intelligence: Monitor competitor websites for changes in pricing, features, or content.
- Data Collection: Gather structured data from multiple pages for market research or database population.
Error Handling and Troubleshooting
- Website Access Blocked: Some websites block scraping attempts. Respect robots.txt files and website terms of service.
- JavaScript Rendering Issues: If content isn’t loading properly, the website may require JavaScript execution which Spider handles automatically.
- Rate Limiting: Avoid making too many requests too quickly to prevent being blocked by the target website.