Python 爬虫实战 - 批量抓取数据

发布于 2026-02-28 阅读量: 9 阅读时长: 1 分钟广东

Python 爬虫入门教程，包含实战案例和反爬策略

预计阅读时间：3 分钟

Python 爬虫实战 - 批量抓取数据

爬虫是数据分析师的必备技能！

目录

爬虫基本原理
环境准备
最简爬虫
实战案例：抓取新闻
反爬应对策略
伦理与法律

1. 爬虫基本原理

用户 → 请求 → 服务器 → 响应 → 解析 → 存储

2. 环境准备

pip install requests beautifulsoup4 lxml

常用库

库	用途
requests	发送 HTTP 请求
BeautifulSoup	解析 HTML
lxml	XML/HTML 解析器
Scrapy	专业爬虫框架

3. 最简爬虫

import requests
from bs4 import BeautifulSoup

# 1. 发送请求
url = "https://example.com"
response = requests.get(url)

# 2. 解析内容
soup = BeautifulSoup(response.text, 'lxml')

# 3. 提取数据
titles = soup.find_all('h2')
for title in titles:
    print(title.text)

4. 实战案例：抓取新闻

import requests
from bs4 import BeautifulSoup
import time

def get_news(page=1):
    url = f"https://news.example.com?page={page}"

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')

    news_list = []
    articles = soup.find_all('article')

    for article in articles:
        title = article.find('h2').text
        link = article.find('a')['href']
        news_list.append({
            'title': title,
            'link': link
        })

    return news_list

# 抓取前 5 页
all_news = []
for i in range(1, 6):
    all_news.extend(get_news(i))
    time.sleep(1)  # 礼貌爬取

print(f"共抓取 {len(all_news)} 条新闻")

5. 反爬应对策略

策略	方案
UA 检测	轮换 User-Agent
IP 限速	添加延时、使用代理
登录验证	模拟登录、Cookie
验证码	打码平台、机器学习
AJAX 动态	Selenium/Playwright

6. 伦理与法律

✅ 可以爬

公开数据
不登录即可访问
不影响网站正常运行

❌ 不能爬

隐私数据
商业机密
受版权保护的内容
明确禁止爬取的网站

建议

遵守 robots.txt
控制请求频率
注明数据来源
仅用于学习研究

总结

爬虫是获取数据的重要手段，但要注意： 1. 遵守法律法规 2. 尊重网站规则 3. 控制爬取频率 4. 保护隐私数据

标签: #Python #爬虫 #数据采集

本文由 suisui 发布