设计 Web 爬虫系统
问题
如何用 Python 设计一个高效的 Web 爬虫系统?Scrapy 的核心架构是什么?
答案
Scrapy 架构
Scrapy 爬虫示例
spiders/product_spider.py
import scrapy
from items import ProductItem
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products?page=1"]
# 自定义配置
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 0.5,
"RETRY_TIMES": 3,
}
def parse(self, response):
# 解析商品列表
for card in response.css("div.product-card"):
item = ProductItem()
item["name"] = card.css("h3::text").get()
item["price"] = card.css("span.price::text").get()
detail_url = card.css("a::attr(href)").get()
# 跟进详情页
yield response.follow(detail_url, self.parse_detail, meta={"item": item})
# 翻页
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_detail(self, response):
item = response.meta["item"]
item["description"] = response.css("div.detail::text").get()
yield item
异步爬虫(aiohttp)
async_crawler.py
import asyncio
import aiohttp
from urllib.parse import urljoin
class AsyncCrawler:
def __init__(self, max_concurrent: int = 10):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.visited: set[str] = set()
self.results: list[dict] = []
async def fetch(self, session: aiohttp.ClientSession, url: str) -> str | None:
async with self.semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
if resp.status == 200:
return await resp.text()
except (aiohttp.ClientError, asyncio.TimeoutError):
return None
async def crawl(self, start_url: str, max_pages: int = 100):
async with aiohttp.ClientSession() as session:
queue: asyncio.Queue[str] = asyncio.Queue()
await queue.put(start_url)
while not queue.empty() and len(self.visited) < max_pages:
url = await queue.get()
if url in self.visited:
continue
self.visited.add(url)
html = await self.fetch(session, url)
if html:
self.parse(html, url, queue)
def parse(self, html: str, base_url: str, queue: asyncio.Queue):
# 使用 BeautifulSoup/lxml 解析
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for link in soup.find_all("a", href=True):
abs_url = urljoin(base_url, link["href"])
if abs_url not in self.visited:
queue.put_nowait(abs_url)
反爬应对
middlewares.py
import random
class RandomUserAgentMiddleware:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)
class ProxyMiddleware:
def process_request(self, request, spider):
proxy = get_proxy_from_pool() # 从代理池获取
request.meta["proxy"] = f"http://{proxy}"
常见面试问题
Q1: 如何处理 JS 动态渲染的页面?
答案:
- Splash:轻量级浏览器渲染服务,Scrapy-Splash 集成
- Playwright/Selenium:无头浏览器执行 JS
- 逆向接口:分析 XHR/Fetch 请求直接调 API(首选)
Q2: 爬虫去重策略?
答案:
- URL 去重:布隆过滤器(内存小,允许少量误判)
- 内容去重:SimHash / MinHash 判断页面相似度
- Scrapy 内置:
RFPDupeFilter使用 URL 指纹
Q3: 分布式爬虫怎么实现?
答案:
- Scrapy-Redis:用 Redis 作为共享调度队列,多个 Worker 消费
- URL 分片:按域名 Hash 分配给不同 Worker
- 去重共享:Redis 布隆过滤器
Q4: 如何避免被封 IP?
答案:
- 降低请求频率(
DOWNLOAD_DELAY) - 随机 User-Agent
- IP 代理池轮换
- Cookie 池
- 遵守
robots.txt