爬虫价格监控

点击这里，边看视频讲解，边学习本章

项目概述

本项目演示了如何使用 Python 和 Selenium 自动化工具监控一个网站的价格数据，并将这些数据自动填入到另一个网站中。这种技术在酒店房价同步、电商价格监控、数据聚合等场景中非常实用。

本项目中使用的酒店数据：

爬取数据源网址是：https://www.byhy.net/cdn2/pages/crawler/s001/price_ori.html

我的酒店代理网址：https://www.byhy.net/cdn2/pages/crawler/s001/price_my.html

开发基础

本程序使用 Python 和 Selenium 库来实现自动化功能。

Python作为开发语言；

Selenium作为自动化测试库。可以模拟用户在浏览器中的操作。在本项目中，我们使用Selenium来：

打开网页并浏览
提取页面数据
与页面元素交互（点击按钮、填写表单等）

所以要求大家在这2方面，有一定的基础。

可以在本网站自学这两方面的内容。

代码和说明

源代码

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

def get_source_prices(driver):
    """从源页面（price_ori.html）提取所有价格。"""
    driver.switch_to.window(driver.window_handles[0])
    print("正在从 price_ori.html 抓取价格...")
    price_elements = driver.find_elements(By.CSS_SELECTOR, 'tbody .price')
    
    prices = [p.text for p in price_elements]
    print(f"抓取到价格: {prices}")
    return prices

def update_target_prices(driver, prices):
    """更新目标页面（price_my.html）上的价格。"""
    driver.switch_to.window(driver.window_handles[1])
    print("正在向 price_my.html 更新价格...")
    
    # 进入编辑模式
    edit_button = driver.find_element(By.ID, 'editBtn')
    
    if edit_button.is_displayed():
        edit_button.click()
        print("进入编辑模式.")

    # 等待输入字段准备就绪
    time.sleep(1)    
    price_inputs = driver.find_elements(By.CSS_SELECTOR, '.price-input')
    
    if len(price_inputs) != len(prices):
        print(f"错误: 价格数量不匹配! 源: {len(prices)}, 目标: {len(price_inputs)}")
        return

    # 仅在价格发生变化时才填写价格
    updated_count = 0
    for i, new_price in enumerate(prices):
        current_input_price = price_inputs[i].get_attribute('value')
        if current_input_price != new_price:
            price_inputs[i].clear()
            price_inputs[i].send_keys(new_price)
            print(f"  - 更新价格: 从 {current_input_price} -> {new_price}")
            updated_count += 1

    if updated_count > 0:
        print(f"总共有 {updated_count} 个价格被更新。" )
    else:
        print("价格无变化，无需更新。" )

    # 保存更改
    save_button = driver.find_element(By.ID, 'saveBtn')
    save_button.click()
    print("价格已保存.")

# 主体代码
source_url = 'https://www.byhy.net/cdn2/pages/crawler/s001/price_ori.html'
target_url = 'https://www.byhy.net/cdn2/pages/crawler/s001/price_my.html'

driver = webdriver.Edge()
driver.implicitly_wait(5) # 隐式等待元素

try:
    # 1. 打开源页面
    print(f"打开源网页: {source_url}")
    driver.get(source_url)

    # 2. 在新选项卡中打开目标页面
    driver.execute_script(f"window.open('{target_url}', '_blank');")
    print(f"打开目标网页: {target_url}")
    time.sleep(1) # 留出时间让新标签页打开

    # 3. 初始价格同步
    print("--- 开始首次价格同步 ---")
    current_prices = get_source_prices(driver)
    update_target_prices(driver, current_prices)
    print("--- 首次同步完成 ---")

    # 4. 开始监控变化
    print("\n--- 开始监控价格变化 (每8秒检查一次) ---")
    while True:
        time.sleep(8) # 每8秒轮询一次
        new_prices = get_source_prices(driver)
        
        if new_prices != current_prices:
            print("\n! 检测到价格变化!")
            update_target_prices(driver, new_prices)
            current_prices = new_prices
            print("--- 监控继续 ---")
        else:
            # 使用回车符来显示活动，避免刷屏
            print("价格无变化, 持续监控中...", end='\r')

except KeyboardInterrupt:
    print("\n脚本被用户中断.")
except Exception as e:
    print(f"\n发生错误: {e}")
finally:
    print("关闭浏览器.")
    driver.quit()

代码分为以下几个步骤：

初始化浏览器：使用webdriver打开浏览器实例
打开源页面和目标页面：在不同标签页中打开需要监控和填入数据的页面
首次数据同步：获取源页面数据并填入目标页面
持续监控：定期检查源页面数据变化，如有变化则更新目标页面

主要功能模块

1. 获取源页面价格数据

def get_source_prices(driver):
    """从源页面（price_ori.html）提取所有价格。"""
    driver.switch_to.window(driver.window_handles[0])
    print("正在从 price_ori.html 抓取价格...")
    price_elements = driver.find_elements(By.CSS_SELECTOR, 'tbody .price')
    
    prices = [p.text for p in price_elements]
    print(f"抓取到价格: {prices}")
    return prices

该函数负责从源页面提取价格数据：

使用switch_to.window()切换到源页面标签
使用CSS选择器tbody .price定位所有价格元素
使用显式等待确保所有价格元素都已加载完成
提取价格文本并返回

2. 更新目标页面价格数据

def update_target_prices(driver, prices):
    """更新目标页面（price_my.html）上的价格。"""
    driver.switch_to.window(driver.window_handles[1])
    print("正在向 price_my.html 更新价格...")
    
    # 进入编辑模式
    edit_button = driver.find_element(By.ID, 'editBtn')
    
    if edit_button.is_displayed():
        edit_button.click()
        print("进入编辑模式.")

    # 等待输入字段准备就绪
    time.sleep(1)    
    price_inputs = driver.find_elements(By.CSS_SELECTOR, '.price-input')
    
    if len(price_inputs) != len(prices):
        print(f"错误: 价格数量不匹配! 源: {len(prices)}, 目标: {len(price_inputs)}")
        return

    # 仅在价格发生变化时才填写价格
    updated_count = 0
    for i, new_price in enumerate(prices):
        current_input_price = price_inputs[i].get_attribute('value')
        if current_input_price != new_price:
            price_inputs[i].clear()
            price_inputs[i].send_keys(new_price)
            print(f"  - 更新价格: 从 {current_input_price} -> {new_price}")
            updated_count += 1

    if updated_count > 0:
        print(f"总共有 {updated_count} 个价格被更新。" )
    else:
        print("价格无变化，无需更新。" )

    # 保存更改
    save_button = driver.find_element(By.ID, 'saveBtn')
    save_button.click()
    print("价格已保存.")

该函数负责将价格数据填入目标页面：

切换到目标页面标签
点击编辑按钮进入编辑模式
等待价格输入框加载完成
对比现有价格和新价格，仅更新变化的价格
点击保存按钮保存更改

3. 主监控循环

source_url = 'https://www.byhy.net/cdn2/pages/crawler/s001/price_ori.html'
target_url = 'https://www.byhy.net/cdn2/pages/crawler/s001/price_my.html'

driver = webdriver.Edge()
driver.implicitly_wait(5) # 隐式等待元素

try:
    # 1. 打开源页面
    print(f"打开源网页: {source_url}")
    driver.get(source_url)

    # 2. 在新选项卡中打开目标页面
    driver.execute_script(f"window.open('{target_url}', '_blank');")
    print(f"打开目标网页: {target_url}")
    time.sleep(1) # 留出时间让新标签页打开

    # 3. 初始价格同步
    print("--- 开始首次价格同步 ---")
    current_prices = get_source_prices(driver)
    update_target_prices(driver, current_prices)
    print("--- 首次同步完成 ---")

    # 4. 开始监控变化
    print("\n--- 开始监控价格变化 (每8秒检查一次) ---")
    while True:
        time.sleep(8) # 每8秒轮询一次
        new_prices = get_source_prices(driver)
        
        if new_prices != current_prices:
            print("\n! 检测到价格变化!")
            update_target_prices(driver, new_prices)
            current_prices = new_prices
            print("--- 监控继续 ---")
        else:
            # 使用回车符来显示活动，避免刷屏
            print("价格无变化, 持续监控中...", end='\r')

except KeyboardInterrupt:
    print("\n脚本被用户中断.")
except Exception as e:
    print(f"\n发生错误: {e}")
finally:
    print("关闭浏览器.")
    driver.quit()

主函数实现了完整的监控流程：

初始化WebDriver（本例中使用Edge浏览器）
打开源页面和目标页面
执行首次数据同步
进入无限循环，定期检查数据变化并更新

爬虫开发要点

1. 等待机制

在Web自动化中，等待是一个关键概念：

隐式等待：driver.implicitly_wait(5)设置全局查找元素的超时时间
显式等待：使用WebDriverWait和expected_conditions等待特定条件满足

2. 异常处理

良好的异常处理是爬虫稳定运行的关键：

使用try/except捕获异常
处理用户中断（Ctrl+C）
在finally块中确保资源释放（关闭浏览器）

3. 数据对比

只在数据发生变化时才更新，避免不必要的操作：

current_input_price = price_inputs[i].get_attribute('value')
if current_input_price != new_price:
    # 只有价格变化时才更新

实际应用场景

这种监控技术可以应用于以下场景：

酒店价格同步：监控竞争对手的价格并自动调整自己的价格
电商价格监控：监控商品价格变化并自动调整售价
数据聚合：从多个网站收集数据并整合到自己的系统中
库存同步：监控供应商库存并更新自己的库存信息

总结

本项目展示了使用Python和Selenium进行网站数据监控和自动填入的基本方法。通过模拟用户操作，我们可以实现跨网站的数据同步，这在很多业务场景中都非常有用。

在实际应用中，还需要考虑以下因素：

网站反爬虫机制
错误重试机制
日志记录和监控
更复杂的数据处理逻辑

您需要高效学习，找工作？点击这里白月黑羽实战班

点击查看学员就业情况

↑