Web Scraper & Sitemap Generator - 网页抓取和站点地图生成

Posted on 六月 4, 2025

Web Scraper & Sitemap Generator

Three-in-One Web Analysis Tool: A comprehensive web scraping and sitemap generation solution that combines content extraction, site structure mapping, and link organization. Features dual-mode operation with both a user-friendly Web UI (port 7861) and an MCP Server API (port 7862), making it perfect for content migration, SEO audits, and AI training data preparation.

三合一网页分析工具: 功能全面的网页抓取和站点地图生成解决方案，结合内容提取、站点结构映射和链接组织功能。提供双模式操作，包括用户友好的 Web 界面（端口 7861）和 MCP Server API（端口 7862），非常适合内容迁移、SEO 审计和 AI 训练数据准备。

Overview | 概述

Web Scraper & Sitemap Generator is a hackathon project that emerged from the Agents MCP Hackathon, offering a unique three-pronged approach to web content analysis. Unlike traditional scrapers that focus solely on content extraction, this tool provides:

Content Scraping: Extracts clean text content from web pages and converts it to Markdown format
Sitemap Generation: Creates organized site maps based on discovered page links
Link Classification: Automatically distinguishes between internal and external links for better site structure understanding

What sets this server apart is its dual-mode architecture: it runs both a Gradio web interface for manual exploration and an MCP Server for programmatic access by AI assistants. This makes it equally useful for human operators performing one-off analyses and AI agents conducting automated content workflows.

Web Scraper & Sitemap Generator 是一个来自 Agents MCP Hackathon 的项目，提供了独特的三管齐下的网页内容分析方法。与传统的只专注于内容提取的爬虫不同，该工具提供：

内容抓取：从网页中提取干净的文本内容并转换为 Markdown 格式
站点地图生成：基于发现的页面链接创建组织化的站点地图
链接分类：自动区分内部链接和外部链接，以便更好地理解站点结构

该服务器的独特之处在于其双模式架构：它同时运行 Gradio Web 界面供手动探索和 MCP Server 供 AI 助手进行程序化访问。这使得它对执行一次性分析的人工操作员和进行自动化内容工作流的 AI 代理都同样有用。

Key Statistics | 关键数据

Popularity: 49 likes on Hugging Face
Platform: Hugging Face Space (Gradio SDK)
Language: Python
Project Type: Hackathon Project
Transport: HTTP SSE (Server-Sent Events)
Dual Ports: 7861 (Web UI) + 7862 (MCP Server)

Core Features | 核心特性

1. Web Content Scraping | 网页内容抓取

The scraping functionality extracts text content from any publicly accessible website and converts it into clean, readable Markdown format. This process:

Removes HTML tags and preserves semantic structure
Maintains headers, lists, links, and formatting
Filters out scripts, styles, and navigation elements
Produces clean Markdown suitable for documentation or analysis

抓取功能从任何公开访问的网站提取文本内容，并将其转换为干净、可读的 Markdown 格式。这个过程：

删除 HTML 标签并保留语义结构
维护标题、列表、链接和格式
过滤脚本、样式和导航元素
生成适合文档或分析的干净 Markdown

2. Sitemap Generation | 站点地图生成

The sitemap generator crawls through page links to create a comprehensive map of website structure. It:

Discovers all linked pages from a starting URL
Organizes links in a hierarchical structure
Maps the navigation paths between pages
Identifies the site’s content architecture

站点地图生成器通过页面链接爬取创建网站结构的全面地图。它：

从起始 URL 发现所有链接页面
以层次结构组织链接
映射页面之间的导航路径
识别站点的内容架构

3. Link Classification | 链接分类

The link analysis feature automatically categorizes discovered links into:

Internal Links: Links pointing to pages within the same domain
External Links: Links pointing to external domains and resources
Resource Types: Distinguishes between pages, images, documents, etc.

This classification is essential for SEO audits, understanding site structure, and identifying outbound link patterns.

链接分析功能自动将发现的链接分类为：

内部链接：指向同一域内页面的链接
外部链接：指向外部域和资源的链接
资源类型：区分页面、图片、文档等

这种分类对于 SEO 审计、理解站点结构和识别出站链接模式至关重要。

4. Dual-Mode Architecture | 双模式架构

Web Interface (Port 7861):

User-friendly Gradio interface
Manual URL input and instant analysis
Visual display of results
Suitable for exploratory work and one-off analyses

MCP Server (Port 7862):

Programmatic API access via MCP protocol
Integration with AI assistants like Claude
Batch processing capabilities
Ideal for automated workflows

Web 界面（端口 7861）：

用户友好的 Gradio 界面
手动 URL 输入和即时分析
结果的可视化展示
适合探索性工作和一次性分析

MCP Server（端口 7862）：

通过 MCP 协议进行程序化 API 访问
与 Claude 等 AI 助手集成
批处理能力
适合自动化工作流

5. Markdown Conversion | Markdown 转换

HTML to Markdown conversion maintains document structure while producing clean, portable text:

Headers: HTML heading tags (h1-h6) → Markdown headers (#, ##, ###)
Lists: ul/ol elements → Markdown bullet/numbered lists
Links: <a> tags → [text](url) format
Emphasis: <strong>, <em> → **bold**, *italic*
Code blocks: <pre>, <code> → fenced code blocks

HTML 到 Markdown 的转换在保持文档结构的同时产生干净、可移植的文本：

标题：HTML 标题标签（h1-h6）→ Markdown 标题（#、##、###）
列表：ul/ol 元素 → Markdown 项目符号/编号列表
链接：<a> 标签 → [text](url) 格式
强调：<strong>、<em> → **粗体**、*斜体*
代码块：<pre>、<code> → 围栏代码块

MCP Tools Documentation | MCP 工具文档

The Web Scraper MCP Server provides three powerful tools that can be accessed through any MCP-compatible client.

Tool 1: scrape_content

Description: Extracts and formats website content into clean Markdown format.

描述：提取网站内容并格式化为干净的 Markdown 格式。

Parameters | 参数:

{
  "url": {
    "type": "string",
    "description": "The URL of the website to scrape",
    "required": true,
    "example": "https://example.com/article"
  }
}

Returns | 返回:

{
  "content": "string",  // The extracted content in Markdown format
  "title": "string",    // The page title
  "url": "string",      // The scraped URL
  "timestamp": "string" // When the content was scraped
}

Example Usage | 使用示例:

# Via MCP Client
result = await mcp_client.call_tool(
    "scrape_content",
    {
        "url": "https://docs.python.org/3/tutorial/"
    }
)

# Result will contain:
# - Clean Markdown of the Python tutorial
# - Page title
# - Scraped timestamp

Use Cases | 适用场景:

Extracting blog articles for content analysis
Converting documentation to Markdown for migration
Collecting text data for AI training
Creating offline readable versions of web content

Common Scenarios | 常见场景:

提取博客文章进行内容分析
将文档转换为 Markdown 以进行迁移
收集文本数据用于 AI 训练
创建网页内容的离线可读版本

Tool 2: generate_sitemap

Description: Generates a comprehensive sitemap of all links found on the website, organized hierarchically.

描述：生成网站上发现的所有链接的全面站点地图，按层次组织。

Parameters | 参数:

{
  "url": {
    "type": "string",
    "description": "The URL of the website to analyze",
    "required": true,
    "example": "https://example.com"
  }
}

Returns | 返回:

{
  "sitemap": [
    {
      "url": "string",
      "title": "string",
      "level": "number",  // Depth in site hierarchy
      "parent": "string"  // Parent page URL
    }
  ],
  "total_pages": "number",
  "base_url": "string",
  "generated_at": "string"
}

Example Usage | 使用示例:

# Via MCP Client
sitemap = await mcp_client.call_tool(
    "generate_sitemap",
    {
        "url": "https://docs.example.com"
    }
)

# Sitemap structure:
# - Hierarchical list of all pages
# - Navigation structure
# - Page relationships

Use Cases | 适用场景:

Understanding website structure before migration
Creating navigation documentation
SEO audits to identify orphaned pages
Mapping documentation hierarchies

Common Scenarios | 常见场景:

在迁移前了解网站结构
创建导航文档
SEO 审计以识别孤立页面
映射文档层次结构

Tool 3: analyze_website

Description: Performs a complete website analysis, combining content extraction, sitemap generation, and link classification.

描述：执行完整的网站分析，结合内容提取、站点地图生成和链接分类。

Parameters | 参数:

{
  "url": {
    "type": "string",
    "description": "The URL of the website to analyze comprehensively",
    "required": true,
    "example": "https://example.com"
  }
}

Returns | 返回:

{
  "content": {
    "markdown": "string",
    "title": "string"
  },
  "sitemap": {
    "pages": ["array of page objects"],
    "total_pages": "number"
  },
  "links": {
    "internal": ["array of internal links"],
    "external": ["array of external links"],
    "internal_count": "number",
    "external_count": "number"
  },
  "analysis_summary": {
    "domain": "string",
    "analyzed_at": "string",
    "content_size": "number"
  }
}

Example Usage | 使用示例:

# Via MCP Client
analysis = await mcp_client.call_tool(
    "analyze_website",
    {
        "url": "https://blog.example.com"
    }
)

# Complete analysis includes:
# - Full content in Markdown
# - Complete site structure
# - Internal vs external link breakdown
# - Summary statistics

Use Cases | 适用场景:

Comprehensive site audits
Documentation migration planning
Content strategy analysis
Link profile evaluation for SEO

Common Scenarios | 常见场景:

全面的站点审计
文档迁移规划
内容策略分析
SEO 的链接配置文件评估

Installation & Configuration | 安装与配置

Method 1: Local Web Interface | 方式 1：本地 Web 界面

This method runs the Gradio web interface for manual, interactive web scraping.

此方法运行 Gradio Web 界面以进行手动交互式网页抓取。

Requirements | 前置要求:

Python 3.8 or higher
Git

Installation Steps | 安装步骤:

# Clone the repository from Hugging Face
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/web-scraper
cd web-scraper

# Install dependencies
pip install -r requirements.txt

# Run the web interface
python app.py

# The web interface will be available at:
# http://localhost:7861

Using the Web Interface | 使用 Web 界面:

Open your browser to http://localhost:7861
Enter the URL you want to scrape in the input field
Choose the operation:
- Scrape Content: Extract and convert to Markdown
- Generate Sitemap: Create site structure map
- Analyze Website: Perform complete analysis
Click the appropriate button to start the operation
View results directly in the interface

Method 2: Local MCP Server | 方式 2：本地 MCP Server

This method runs the MCP Server for programmatic access by AI assistants.

此方法运行 MCP Server 以供 AI 助手进行程序化访问。

Installation Steps | 安装步骤:

# Clone and install (same as Method 1)
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/web-scraper
cd web-scraper
pip install -r requirements.txt

# Run the MCP Server
python mcp_server.py

# The MCP Server will be available at:
# http://localhost:7862/gradio_api/mcp/sse

Claude Desktop Configuration | Claude Desktop 配置:

On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "web-scraper": {
      "url": "http://localhost:7862/gradio_api/mcp/sse"
    }
  }
}

After updating the configuration:

Restart Claude Desktop
The web scraper tools will appear in Claude’s tool list
You can now ask Claude to scrape websites, generate sitemaps, or analyze sites

配置更新后：

重启 Claude Desktop
网页抓取工具将出现在 Claude 的工具列表中
现在可以要求 Claude 抓取网站、生成站点地图或分析站点

Method 3: Remote Hugging Face Space | 方式 3：远程 Hugging Face Space

Use the hosted version on Hugging Face without local installation.

使用 Hugging Face 上的托管版本，无需本地安装。

Configuration | 配置:

{
  "mcpServers": {
    "web-scraper-remote": {
      "type": "sse",
      "url": "https://agents-mcp-hackathon-web-scraper.hf.space/gradio_api/mcp/sse"
    }
  }
}

Advantages | 优势:

No local setup required
Always up-to-date
No resource consumption on local machine

Considerations | 注意事项:

Requires internet connection
May have rate limits
Shared resource with other users

Use Cases & Workflows | 使用场景与工作流

1. Content Migration Workflow | 内容迁移工作流

Scenario: Migrating a legacy documentation site to a modern documentation platform (e.g., moving from WordPress to Docusaurus).

场景：将传统文档站点迁移到现代文档平台（例如，从 WordPress 迁移到 Docusaurus）。

Workflow | 工作流程:

Step 1: Analyze the site structure
- Use analyze_website tool on the documentation root URL
- Review the sitemap to understand content organization
- Identify internal link patterns

Step 2: Extract content
- Use scrape_content for each documentation page
- Content is automatically converted to Markdown
- Preserve the original URL structure for reference

Step 3: Organize content
- Use the sitemap to recreate directory structure
- Place Markdown files in appropriate directories
- Update internal links to match new structure

Step 4: Validate migration
- Check that all pages were migrated
- Verify internal links work correctly
- Ensure external links are preserved

Example Claude Conversation | Claude 对话示例:

User: I need to migrate our documentation from https://old-docs.example.com to a new Docusaurus site. Can you help analyze the structure first?

Claude: I'll analyze the documentation site structure for you.

[Uses analyze_website tool]

I've found:
- 47 documentation pages organized in 3 main sections
- 152 internal links connecting pages
- 23 external reference links
- Average content size: 3.5KB per page

The site has this structure:
- Getting Started (12 pages)
- API Reference (28 pages)
- Guides (7 pages)

Would you like me to start extracting the content from the Getting Started section?

2. SEO Audit Workflow | SEO 审计工作流

Scenario: Performing a comprehensive SEO audit to identify link issues and site structure problems.

场景：执行全面的 SEO 审计以识别链接问题和站点结构问题。

Audit Checklist | 审计清单:

Site Structure Analysis
- Generate sitemap to visualize site hierarchy
- Identify orphaned pages (pages with no internal links)
- Check navigation depth (pages more than 3 clicks from home)
Internal Linking
- Count internal links per page
- Identify pages with few inbound links
- Check for broken internal links
External Links
- List all external links
- Categorize by domain
- Identify potential link opportunities
Content Quality
- Extract content from key pages
- Analyze content length and structure
- Identify thin content pages

Example Analysis | 分析示例:

SEO Audit Results for https://example.com

Site Structure:
✓ 89 pages total
⚠ 5 orphaned pages found (no internal links)
⚠ 12 pages at depth 4+ (too deep)

Internal Linking:
✓ Average 8.3 internal links per page
⚠ 7 pages with <3 internal links
✗ 2 broken internal links detected

External Links:
✓ 34 external links total
✓ Mix of authority domains
⚠ 3 external links to deprecated resources

Content:
✓ Average content: 1,250 words
⚠ 4 thin content pages (<300 words)
✓ Good header structure on most pages

3. AI Training Data Preparation | AI 训练数据准备

Scenario: Collecting high-quality text data from curated websites for training or fine-tuning language models.

场景：从精选网站收集高质量文本数据用于训练或微调语言模型。

Workflow | 工作流程:

# Pseudo-code for data collection workflow

urls_to_scrape = [
    "https://docs.python.org",
    "https://docs.djangoproject.com",
    "https://flask.palletsprojects.com"
]

collected_data = []

for url in urls_to_scrape:
    # Step 1: Analyze to get all pages
    analysis = analyze_website(url)

    # Step 2: Extract content from each page
    for page in analysis['sitemap']['pages']:
        content = scrape_content(page['url'])

        # Step 3: Clean and structure data
        collected_data.append({
            'text': content['markdown'],
            'source': content['url'],
            'title': content['title'],
            'domain': extract_domain(content['url']),
            'category': categorize_content(content)
        })

# Step 4: Export for training
export_to_jsonl(collected_data, 'training_data.jsonl')

Data Quality Checks | 数据质量检查:

Remove boilerplate (headers, footers, navigation)
Filter out short content (<100 words)
Deduplicate similar content
Preserve code examples and formatting
Maintain source attribution

4. Documentation Monitoring | 文档监控

Scenario: Regularly monitoring documentation sites for changes, broken links, or structural issues.

场景：定期监控文档站点的更改、损坏的链接或结构问题。

Monitoring Workflow | 监控工作流:

Daily Check:
1. Generate current sitemap
2. Compare with previous day's sitemap
3. Identify:
   - New pages added
   - Pages removed
   - Changes in link structure
4. Alert on significant changes

Weekly Deep Check:
1. Scrape all pages
2. Check for broken external links
3. Validate internal link structure
4. Generate change report

Automated Monitoring Script | 自动化监控脚本:

import json
from datetime import datetime

def monitor_documentation(url, previous_state_file):
    # Get current state
    current = analyze_website(url)

    # Load previous state
    with open(previous_state_file, 'r') as f:
        previous = json.load(f)

    # Compare
    changes = {
        'new_pages': [],
        'removed_pages': [],
        'structure_changes': []
    }

    current_urls = {p['url'] for p in current['sitemap']['pages']}
    previous_urls = {p['url'] for p in previous['sitemap']['pages']}

    changes['new_pages'] = list(current_urls - previous_urls)
    changes['removed_pages'] = list(previous_urls - current_urls)

    # Save current state for next run
    with open(previous_state_file, 'w') as f:
        json.dump(current, f)

    # Send alert if significant changes
    if len(changes['new_pages']) > 5 or len(changes['removed_pages']) > 0:
        send_alert(changes)

    return changes

5. Knowledge Base Construction | 知识库构建

Scenario: Building a searchable knowledge base from multiple documentation sources.

场景：从多个文档来源构建可搜索的知识库。

Construction Steps | 构建步骤:

Source Identification | 识别来源
- List all documentation sources
- Prioritize by relevance and authority
Content Extraction | 内容提取
- Use analyze_website for each source
- Extract all pages to Markdown
- Maintain source metadata
Content Processing | 内容处理
- Clean and normalize Markdown
- Extract key sections and topics
- Generate embeddings for semantic search
Indexing | 索引
- Index content in search engine (e.g., Elasticsearch)
- Create hierarchical navigation
- Link related content across sources
Presentation | 呈现
- Build search interface
- Display content with source attribution
- Maintain links to original sources

Technical Architecture | 技术架构

System Components | 系统组件

┌─────────────────────────────────────────────┐
│         Gradio Application Layer            │
│  ┌──────────────┐      ┌──────────────┐    │
│  │  Web UI      │      │ MCP Server   │    │
│  │  Port 7861   │      │ Port 7862    │    │
│  └──────┬───────┘      └──────┬───────┘    │
│         │                     │             │
└─────────┼─────────────────────┼─────────────┘
          │                     │
          └──────────┬──────────┘
                     │
          ┌──────────▼──────────┐
          │   Core Engine       │
          │  ┌───────────────┐  │
          │  │ HTML Fetcher  │  │
          │  └───────┬───────┘  │
          │          │           │
          │  ┌───────▼───────┐  │
          │  │ HTML Parser   │  │
          │  └───────┬───────┘  │
          │          │           │
          │  ┌───────▼───────┐  │
          │  │ MD Converter  │  │
          │  └───────┬───────┘  │
          │          │           │
          │  ┌───────▼───────┐  │
          │  │ Link Analyzer │  │
          │  └───────────────┘  │
          └─────────────────────┘

Processing Pipeline | 处理管道

Content Scraping Pipeline | 内容抓取管道:

1 2	URL Input → HTTP Request → HTML Response → HTML Parsing → Content Extraction → Markdown Conversion → Output

Sitemap Generation Pipeline | 站点地图生成管道:

1
2
3

Start URL → Page Fetch → Extract Links → Filter Links
  → Categorize (Internal/External) → Recursive Crawl
  → Build Hierarchy → Generate Sitemap → Output

Full Analysis Pipeline | 完整分析管道:

1
2
3

URL Input → [Content Pipeline] + [Sitemap Pipeline]
  → Link Classification → Aggregate Results
  → Generate Summary → Output Combined Analysis

Technology Stack | 技术栈

Core Framework | 核心框架:

Gradio: Web UI and MCP server framework
Python: Primary implementation language

Likely Dependencies | 可能的依赖:

requests or httpx: HTTP client for fetching web pages
BeautifulSoup4 or lxml: HTML parsing
html2text or markdownify: HTML to Markdown conversion
urllib: URL parsing and manipulation

MCP Integration | MCP 集成:

Transport: Server-Sent Events (SSE)
Endpoint: /gradio_api/mcp/sse
Protocol: MCP (Model Context Protocol)

Performance Considerations | 性能考虑

Optimization Strategies | 优化策略:

Caching | 缓存
- Cache fetched pages to avoid redundant requests
- Store parsed results for repeated analysis
- Implement TTL for cache invalidation
Rate Limiting | 速率限制
- Respect robots.txt directives
- Implement polite crawling delays
- Limit concurrent requests
Parallel Processing | 并行处理
- Fetch multiple pages concurrently
- Process content in parallel threads
- Use async/await for I/O operations
Resource Management | 资源管理
- Limit crawl depth to prevent runaway scraping
- Set maximum page count per analysis
- Implement timeouts for slow sites

Best Practices | 最佳实践

For Content Scraping | 内容抓取

Respect Website Policies | 尊重网站政策
- Check and honor robots.txt
- Follow crawl-delay directives
- Don’t overwhelm servers with requests
Handle Errors Gracefully | 优雅处理错误
- Implement retry logic for failed requests
- Handle timeouts appropriately
- Log errors for debugging
Clean Content Effectively | 有效清理内容
- Remove navigation elements
- Strip advertisements and sidebars
- Preserve meaningful structure

For Sitemap Generation | 站点地图生成

Set Appropriate Limits | 设置适当限制
- Limit crawl depth (e.g., max 3-4 levels)
- Cap total pages analyzed (e.g., 500 pages)
- Set timeout for entire operation
Handle Redirects | 处理重定向
- Follow redirects automatically
- Track redirect chains
- Use final URL in sitemap
Normalize URLs | 规范化 URL
- Remove query parameters when appropriate
- Handle trailing slashes consistently
- Resolve relative URLs correctly

For Link Analysis | 链接分析

Accurate Classification | 准确分类
- Use domain matching for internal/external
- Handle subdomains correctly
- Consider protocol differences (http vs https)
Link Validation | 链接验证
- Check for broken links (optional)
- Identify redirect chains
- Flag suspicious links
Context Preservation | 上下文保留
- Maintain anchor text
- Record link position in content
- Note link relationships

Troubleshooting | 故障排除

Common Issues | 常见问题

Issue 1: MCP Server Not Connecting | MCP Server 无法连接

Symptom: Claude can't see the web scraper tools

Solutions:
1. Verify server is running: check http://localhost:7862/gradio_api/mcp/sse
2. Check configuration file syntax (valid JSON)
3. Restart Claude Desktop after config changes
4. Check firewall isn't blocking port 7862

Issue 2: Scraping Returns Empty Content | 抓取返回空内容

Symptom: scrape_content returns blank or minimal text

Possible causes:
1. JavaScript-heavy site (content loaded dynamically)
2. Content behind authentication
3. Anti-scraping measures (captcha, rate limiting)
4. Invalid URL or network issues

Solutions:
- Check if content is visible when disabling JavaScript
- Verify URL is publicly accessible
- Add delays between requests
- Check for error messages in logs

Issue 3: Sitemap Generation Timeout | 站点地图生成超时

Symptom: generate_sitemap operation times out

Possible causes:
1. Site has too many pages
2. Site has slow response times
3. Infinite redirect loops
4. Complex site structure

Solutions:
- Start with a more specific URL (not homepage)
- Set lower crawl depth limits
- Implement caching
- Use smaller site sections

Issue 4: Broken Internal Links in Output | 输出中的内部链接损坏

Symptom: Internal links in Markdown don't work

Possible causes:
1. Relative URLs not resolved correctly
2. Base URL handling issues
3. Fragment links not preserved

Solutions:
- Use absolute URLs in configuration
- Check URL normalization logic
- Verify base URL is set correctly

Debug Mode | 调试模式

Enable detailed logging to troubleshoot issues:

启用详细日志记录以排除问题：

# Add to configuration or environment
DEBUG = True
LOG_LEVEL = "DEBUG"

# This will show:
# - HTTP requests and responses
# - Parsing steps
# - Link extraction details
# - Error stack traces

Limitations & Considerations | 限制与注意事项

Technical Limitations | 技术限制

JavaScript-Rendered Content | JavaScript 渲染的内容
- Cannot scrape content loaded by JavaScript
- Single-page applications may not work well
- Dynamic content might be missed
Authentication-Protected Content | 身份验证保护的内容
- Cannot access content behind login
- No session management support
- Public pages only
Rate Limiting | 速率限制
- May trigger anti-scraping measures
- Shared Hugging Face space has limits
- Concurrent requests limited
Large Sites | 大型站点
- Full site analysis may be time-consuming
- Memory usage increases with site size
- May need to analyze in sections

Legal & Ethical Considerations | 法律与道德考虑

Copyright | 版权
- Respect content copyright
- Don’t republish scraped content without permission
- Use for analysis and personal purposes only
Terms of Service | 服务条款
- Check website’s terms of service
- Some sites prohibit scraping
- Respect robots.txt directives
Privacy | 隐私
- Don’t scrape personal information
- Be cautious with user-generated content
- Follow GDPR and other privacy regulations
Server Load | 服务器负载
- Don’t overwhelm target servers
- Use polite crawl delays
- Consider impact on site performance

Comparison with Alternatives | 与替代方案的比较

vs. Traditional Scrapers | 与传统爬虫对比

Web Scraper MCP Server | 本工具:

✓ Dual-mode operation (UI + API)
✓ Three-in-one analysis (content + sitemap + links)
✓ MCP integration for AI assistants
✓ Clean Markdown output
✗ No JavaScript rendering
✗ Limited to public content

Scrapy:

✓ Highly customizable
✓ Powerful crawling engine
✓ Large ecosystem of extensions
✗ Steeper learning curve
✗ No built-in Markdown conversion
✗ Requires coding

BeautifulSoup:

✓ Simple and flexible
✓ Excellent HTML parsing
✗ No crawling capabilities
✗ Manual implementation needed
✗ No sitemap generation

vs. Commercial Tools | 与商业工具对比

Web Scraper MCP Server | 本工具:

✓ Free and open source
✓ Self-hosted
✓ Customizable
✗ Limited features compared to commercial
✗ Manual setup required

Screaming Frog SEO Spider:

✓ Comprehensive SEO analysis
✓ Advanced crawling capabilities
✓ Detailed reporting
✗ Commercial license required
✗ Desktop application only
✗ No MCP integration

import.io:

✓ Visual scraping tool
✓ Cloud-based
✓ API access
✗ Expensive
✗ Proprietary
✗ Limited free tier

FAQ | 常见问题

General Questions | 一般问题

Q: What types of websites work best with this scraper?

A: Static HTML sites work best. This includes:

Documentation sites
Blogs and news sites
Corporate websites
Educational resources
Markdown-based sites

Sites that DON’T work well:

Single-page applications (SPAs) with heavy JavaScript
Sites requiring authentication
Content behind paywalls
Sites with aggressive anti-scraping

问：哪些类型的网站最适合这个爬虫？

答：静态 HTML 站点效果最好。包括：

文档站点
博客和新闻站点
企业网站
教育资源
基于 Markdown 的站点

不太适合的站点：

使用大量 JavaScript 的单页应用程序（SPA）
需要身份验证的站点
付费内容
具有激进反爬虫措施的站点

Q: Can I use this for commercial projects?

A: Check the project’s license on Hugging Face. Generally:

Personal use: Usually fine
Commercial use: May have restrictions
Always respect website terms of service
Don’t violate copyright

问：我可以将其用于商业项目吗？

答：查看 Hugging Face 上的项目许可证。通常：

个人使用：通常可以
商业使用：可能有限制
始终尊重网站服务条款
不要违反版权

Q: How is this different from using curl or wget?

A: This tool provides:

Automatic Markdown conversion
Sitemap generation
Link classification and analysis
Clean content extraction (removes navigation, ads)
MCP integration for AI assistants
User-friendly interface

curl/wget provide:

Raw HTML only
No content processing
No analysis features
More manual work required

问：这与使用 curl 或 wget 有何不同？

答：本工具提供：

自动 Markdown 转换
站点地图生成
链接分类和分析
干净的内容提取（删除导航、广告）
AI 助手的 MCP 集成
用户友好界面

curl/wget 提供：

仅原始 HTML
无内容处理
无分析功能
需要更多手动工作

Technical Questions | 技术问题

Q: Can I scrape multiple pages at once?

A: Yes, through different approaches:

Use analyze_website to get all pages at once
Call scrape_content multiple times via MCP
For batch processing, use a script with the MCP client

问：我可以一次抓取多个页面吗？

答：可以，通过不同方法：

使用 analyze_website 一次获取所有页面
通过 MCP 多次调用 scrape_content
对于批处理，使用带有 MCP 客户端的脚本

Q: Does this respect robots.txt?

A: Implementation may vary. Best practices:

Check robots.txt before scraping
Honor crawl-delay directives
Don’t scrape disallowed paths
Add your own checks if needed

问：这个工具是否遵守 robots.txt？

答：实现可能有所不同。最佳实践：

抓取前检查 robots.txt
遵守 crawl-delay 指令
不要抓取禁止的路径
如需要添加自己的检查

Q: Can I modify the Markdown output format?

A: Yes, since this is open source:

Clone the repository
Modify the Markdown conversion logic
Adjust the output format to your needs
Run your customized version

问：我可以修改 Markdown 输出格式吗？

答：可以，因为这是开源的：

克隆存储库
修改 Markdown 转换逻辑
调整输出格式以满足您的需求
运行您的自定义版本

Q: How do I handle sites with pagination?

A: Strategies:

Use generate_sitemap to find all paginated URLs
Scrape each page individually with scrape_content
Combine results in your application
Look for “next page” links in the sitemap

问：如何处理带分页的站点？

答：策略：

使用 generate_sitemap 查找所有分页 URL
使用 scrape_content 单独抓取每个页面
在您的应用程序中组合结果
在站点地图中查找”下一页”链接

Setup & Configuration | 设置和配置

Q: Do I need to install this locally or can I use the Hugging Face Space directly?

A: Both options work:

Local Installation:

More control and customization
No external dependencies
Better for high-volume usage
Privacy (data stays local)

Hugging Face Space:

No installation needed
Always up-to-date
Shared resources
May have usage limits

问：我需要在本地安装还是可以直接使用 Hugging Face Space？

答：两种选项都可以：

本地安装：

更多控制和自定义
无外部依赖
更适合大量使用
隐私（数据保留在本地）

Hugging Face Space：

无需安装
始终保持最新
共享资源
可能有使用限制

Q: Which port should I use for MCP - 7861 or 7862?

A: Always use port 7862 for MCP connections.

Port 7861: Web UI (for human use)
Port 7862: MCP Server (for AI assistants)

问：我应该使用哪个端口连接 MCP - 7861 还是 7862？

答：始终使用 端口 7862 进行 MCP 连接。

端口 7861：Web 界面（供人类使用）
端口 7862：MCP Server（供 AI 助手使用）

Q: Can I run both the Web UI and MCP Server at the same time?

A: Yes! They run on different ports:

Start both servers simultaneously
Use Web UI for manual exploration
Use MCP Server for automated workflows
No conflict between them

问：我可以同时运行 Web 界面和 MCP Server 吗？

答：可以！它们运行在不同端口上：

同时启动两个服务器
使用 Web 界面进行手动探索
使用 MCP Server 进行自动化工作流
它们之间没有冲突

Advanced Usage | 高级用法

Custom Scraping Workflows | 自定义抓取工作流

For advanced users who want to build custom workflows on top of the MCP tools:

对于想要在 MCP 工具基础上构建自定义工作流的高级用户：

import asyncio
from mcp_client import MCPClient

async def advanced_documentation_scraper():
    """
    Advanced workflow: Scrape documentation site,
    extract code examples, and organize by topic
    """
    client = MCPClient("http://localhost:7862/gradio_api/mcp/sse")

    # Step 1: Get site structure
    analysis = await client.call_tool(
        "analyze_website",
        {"url": "https://docs.example.com"}
    )

    # Step 2: Filter for API reference pages
    api_pages = [
        page for page in analysis['sitemap']['pages']
        if '/api/' in page['url']
    ]

    # Step 3: Extract content from each API page
    api_docs = []
    for page in api_pages:
        content = await client.call_tool(
            "scrape_content",
            {"url": page['url']}
        )

        # Parse Markdown to extract code examples
        code_examples = extract_code_blocks(content['markdown'])

        api_docs.append({
            'url': page['url'],
            'title': content['title'],
            'examples': code_examples
        })

    # Step 4: Organize by API category
    organized = organize_by_category(api_docs)

    # Step 5: Generate index and save
    generate_index(organized)
    save_to_files(organized)

    return organized

def extract_code_blocks(markdown_text):
    """Extract fenced code blocks from Markdown"""
    import re
    pattern = r'```[\w]*\n(.*?)```'
    return re.findall(pattern, markdown_text, re.DOTALL)

def organize_by_category(docs):
    """Organize docs by URL path category"""
    categories = {}
    for doc in docs:
        category = extract_category(doc['url'])
        if category not in categories:
            categories[category] = []
        categories[category].append(doc)
    return categories

# Run the workflow
asyncio.run(advanced_documentation_scraper())

Integration with Vector Databases | 与向量数据库集成

Use scraped content to populate a vector database for semantic search:

使用抓取的内容填充向量数据库以进行语义搜索：

from mcp_client import MCPClient
from sentence_transformers import SentenceTransformer
import chromadb

async def build_searchable_knowledge_base(urls):
    """
    Scrape multiple URLs and index in vector database
    """
    client = MCPClient("http://localhost:7862/gradio_api/mcp/sse")
    model = SentenceTransformer('all-MiniLM-L6-v2')
    chroma_client = chromadb.Client()
    collection = chroma_client.create_collection("docs")

    for url in urls:
        # Scrape content
        result = await client.call_tool(
            "scrape_content",
            {"url": url}
        )

        # Split into chunks
        chunks = split_into_chunks(result['markdown'], chunk_size=500)

        # Generate embeddings
        embeddings = model.encode(chunks)

        # Add to vector database
        collection.add(
            embeddings=embeddings.tolist(),
            documents=chunks,
            metadatas=[{"source": url, "title": result['title']}] * len(chunks),
            ids=[f"{url}_{i}" for i in range(len(chunks))]
        )

    return collection

Contributing & Community | 贡献与社区

How to Contribute | 如何贡献

This is a hackathon project hosted on Hugging Face. Ways to contribute:

这是一个托管在 Hugging Face 上的黑客马拉松项目。贡献方式：

Feedback
- Report bugs or issues
- Suggest new features
- Share use cases
Code Contributions
- Fork the space
- Implement improvements
- Submit pull requests
Documentation
- Improve usage examples
- Add tutorials
- Translate to other languages
Community
- Share your workflows
- Help other users
- Write blog posts about use cases

Roadmap & Future Enhancements | 路线图与未来增强

Potential improvements for future versions:

未来版本的潜在改进：

JavaScript Support: Add headless browser for JS-rendered content
Authentication: Support for basic auth and session management
Advanced Filtering: Custom rules for content extraction
Export Formats: Additional output formats (PDF, DOCX, HTML)
Scheduling: Built-in scheduled scraping for monitoring
Diff Detection: Automatic change detection between scrapes
API Extensions: More granular control over scraping behavior
Performance: Improved caching and parallel processing

Resources & References | 资源与参考

Official Links | 官方链接

Hugging Face Space: https://huggingface.co/spaces/Agents-MCP-Hackathon/web-scraper
Live Demo: Available on the Hugging Face Space
Organization: Agents-MCP-Hackathon

MCP (Model Context Protocol): https://modelcontextprotocol.io
Gradio: https://gradio.app
Claude Desktop: https://claude.ai/desktop

Learning Resources | 学习资源

Web Scraping Best Practices:

Respect robots.txt and website terms
Implement rate limiting
Handle errors gracefully
Use appropriate user agents

MCP Development:

MCP Specification
Gradio MCP integration guide
Building custom MCP servers

Markdown Format:

CommonMark specification
Markdown guide
Converting HTML to Markdown

Conclusion | 结论

Web Scraper & Sitemap Generator is a versatile tool that bridges the gap between web content and structured, analyzable data. Its three-in-one approach combining content extraction, sitemap generation, and link analysis provides comprehensive website insights, while the dual-mode architecture makes it accessible for both manual and automated workflows.

Whether you’re migrating documentation, conducting SEO audits, preparing AI training data, or building knowledge bases, this tool provides a solid foundation for web content analysis. The MCP integration enables AI assistants like Claude to autonomously scrape and analyze websites, opening up new possibilities for intelligent content processing.

As a hackathon project with 49 likes on Hugging Face, it demonstrates the power of combining modern technologies (Gradio, MCP) to create practical, user-friendly tools. While it has limitations (no JavaScript rendering, public content only), it excels at its core mission: transforming web content into clean, structured Markdown with comprehensive site analysis.

Web Scraper & Sitemap Generator 是一个多功能工具，在网页内容和结构化、可分析的数据之间架起桥梁。它将内容提取、站点地图生成和链接分析结合在一起的三合一方法提供全面的网站洞察，而双模式架构使其适用于手动和自动化工作流。

无论您是在迁移文档、进行 SEO 审计、准备 AI 训练数据还是构建知识库，此工具都为网页内容分析提供了坚实的基础。MCP 集成使 Claude 等 AI 助手能够自主抓取和分析网站，为智能内容处理开辟了新的可能性。

作为在 Hugging Face 上获得 49 个赞的黑客马拉松项目，它展示了结合现代技术（Gradio、MCP）创建实用、用户友好工具的力量。虽然它有局限性（无 JavaScript 渲染、仅公共内容），但它在其核心使命上表现出色：将网页内容转换为干净、结构化的 Markdown，并提供全面的站点分析。

Last Updated: 2025-06-04
Project Status: Active (Hackathon Project)
Platform: Hugging Face Space
License: Check Hugging Face Space for license details

Web Scraper & Sitemap Generator

Overview | 概述

Key Statistics | 关键数据

Core Features | 核心特性

1. Web Content Scraping | 网页内容抓取

2. Sitemap Generation | 站点地图生成

3. Link Classification | 链接分类

4. Dual-Mode Architecture | 双模式架构

5. Markdown Conversion | Markdown 转换

MCP Tools Documentation | MCP 工具文档

Tool 1: scrape_content

Tool 2: generate_sitemap

Tool 3: analyze_website

Installation & Configuration | 安装与配置

Method 1: Local Web Interface | 方式 1：本地 Web 界面

Method 2: Local MCP Server | 方式 2：本地 MCP Server

Method 3: Remote Hugging Face Space | 方式 3：远程 Hugging Face Space

Use Cases & Workflows | 使用场景与工作流

1. Content Migration Workflow | 内容迁移工作流

2. SEO Audit Workflow | SEO 审计工作流

3. AI Training Data Preparation | AI 训练数据准备

4. Documentation Monitoring | 文档监控

5. Knowledge Base Construction | 知识库构建

Technical Architecture | 技术架构

System Components | 系统组件

Processing Pipeline | 处理管道

Technology Stack | 技术栈

Performance Considerations | 性能考虑

Best Practices | 最佳实践

For Content Scraping | 内容抓取

For Sitemap Generation | 站点地图生成

For Link Analysis | 链接分析

Troubleshooting | 故障排除

Common Issues | 常见问题

Debug Mode | 调试模式

Limitations & Considerations | 限制与注意事项

Technical Limitations | 技术限制

Legal & Ethical Considerations | 法律与道德考虑

Comparison with Alternatives | 与替代方案的比较

vs. Traditional Scrapers | 与传统爬虫对比

vs. Commercial Tools | 与商业工具对比

FAQ | 常见问题

General Questions | 一般问题

Technical Questions | 技术问题

Setup & Configuration | 设置和配置

Advanced Usage | 高级用法

Custom Scraping Workflows | 自定义抓取工作流

Integration with Vector Databases | 与向量数据库集成

Contributing & Community | 贡献与社区

How to Contribute | 如何贡献

Roadmap & Future Enhancements | 路线图与未来增强

Resources & References | 资源与参考

Official Links | 官方链接

Related Technologies | 相关技术

Learning Resources | 学习资源

Conclusion | 结论