Overview
The Webpage Scraper node enables automated extraction of content from web pages, providing access to multiple scraping services and output formats. With token-based usage tracking and intelligent content extraction, this node serves as a powerful tool for web data collection and analysis within your workflows.
Usage Monitoring
Token Tracking
The node displays "Tokens used: 0" at the top, providing real-time monitoring of your scraping service usage:
Usage Transparency: Track consumption of scraping service tokens or credits
Resource Management: Monitor usage to stay within service limits
Cost Control: Understand the resource impact of your scraping operations
URL Configuration
Target URL Input
The URL field accepts the web page address you want to scrape:
Direct URL Entry: Enter complete URLs including the protocol (https://)
Dynamic URLs: Can connect to other workflow outputs for dynamic URL generation
URL Validation: Ensures proper URL format before attempting to scrape
Example Format: Shows "https://fluxprompt.ai" as a sample URL structure
Content Extraction Options
Output Format Selection
The node provides toggle buttons for different content extraction formats:
Text Option
Plain Text Extraction: Extracts readable text content from the webpage
Clean Content: Removes HTML tags and formatting, providing pure text output
Accessibility: Ideal for content analysis, text processing, and readability-focused applications
Links Option
URL Extraction: Extracts all hyperlinks found on the target webpage
Link Analysis: Provides comprehensive list of internal and external links
Navigation Mapping: Useful for site structure analysis and link discovery
Both options can be selected simultaneously to extract both text content and links from the same webpage.
Scraping Service Providers
Service Selection Dropdown
The node offers multiple professional scraping services to ensure reliable content extraction:
ScrapingBee (Default)
Reliable Service: Professional web scraping with high success rates
Anti-Bot Protection: Handles websites with anti-scraping measures
JavaScript Rendering: Supports dynamic content loaded via JavaScript
BuiltWith API
Technology Detection: Specialized in identifying website technologies and tools
Website Analysis: Provides insights into the technology stack of target websites
Professional Data: Reliable technology intelligence and website profiling
Oxy Lab
Enterprise-Grade: High-performance scraping infrastructure
Global Network: Distributed scraping from multiple geographic locations
Advanced Features: Sophisticated handling of complex websites and anti-bot systems
Service-Specific Features
Different services may provide unique capabilities and keyword extraction:
Enhanced Content Analysis
Some services (like BuiltWith API) provide additional content analysis features:
Keyword Extraction: Automatic identification of relevant keywords from scraped content
Content Categories: Classification of content into relevant categories
Technology Keywords: Identification of technology-related terms and concepts
Example Keywords Displayed:
"free" - General content indicators
"domain" - Website-related terms
"technology" - Technical content identification
"relationship" - Content relationship mapping
"keywords" - Meta-content analysis
"trends" - Trending topic identification
"companyToUrl" - Business relationship mapping
"trust" - Trust and credibility indicators
Execution and Output
Run Prompt Button
The "Run Prompt" button initiates the scraping operation:
Service Execution: Triggers the selected scraping service to extract content
Real-time Processing: Processes the webpage and returns results immediately
Error Handling: Provides feedback on failed scraping attempts
Output Display
The output section shows extracted content based on your configuration:
Formatted Results: Clean, organized presentation of scraped content
Multiple Formats: Displays both text and links when both options are selected
Structured Data: Organized output suitable for further processing in subsequent workflow nodes
Configuration Best Practices
URL Selection
Valid URLs: Ensure URLs are complete and accessible
Public Content: Focus on publicly accessible web pages
Stable URLs: Use permanent URLs rather than session-specific or temporary links
Target Specificity: Choose URLs that contain the specific content you need
Service Selection
Service Capabilities: Match the scraping service to your specific needs
Content Complexity: Use more advanced services for JavaScript-heavy or protected sites
Geographic Considerations: Some services offer better performance for specific regions
Rate Limiting: Consider service-specific rate limits and usage policies
Output Format Planning
Text Extraction: Choose text format for content analysis and processing
Link Extraction: Select links format for site mapping and navigation analysis
Combined Output: Use both formats when you need comprehensive page analysis
Use Cases
The Webpage Scraper node excels in various scenarios:
Content Monitoring: Track changes in website content over time
Competitive Analysis: Analyze competitor websites and content strategies
Research Automation: Collect information from multiple sources automatically
Link Building: Discover link opportunities and analyze site structures
Technology Tracking: Identify technologies used by target websites
Market Research: Gather market intelligence from public web sources
Compliance and Ethics
Responsible Scraping
Robots.txt Compliance: Respect website robots.txt directives
Rate Limiting: Avoid overwhelming target servers with excessive requests
Terms of Service: Review and comply with target website terms of service
Data Privacy: Handle scraped data in accordance with privacy regulations
Legal Considerations
Public Content: Focus on publicly available information
Copyright Respect: Avoid scraping copyrighted content for commercial use
Attribution: Provide proper attribution when required
Data Usage: Use scraped data responsibly and ethically
Troubleshooting
Common Issues
Access Denied: Some websites block scraping attempts; try different services
JavaScript Content: Dynamic content may require services with JavaScript rendering
Rate Limiting: Excessive requests may trigger temporary blocks
Service Availability: Different services may have varying uptime and reliability
Optimization Tips
Service Rotation: Test different services to find the best fit for your targets
URL Validation: Verify URLs are accessible before running large-scale operations
Output Testing: Test with sample URLs to verify output format meets your needs
Token Management: Monitor token usage to optimize resource consumption
The Webpage Scraper node provides professional-grade web scraping capabilities, enabling reliable content extraction while respecting web standards and ethical guidelines.