飞飞媒体其他博客 Python爬取某wordpress资源网站实例 | 飞飞媒体

Python爬取某wordpress资源网站实例 | 飞飞媒体

作者: 飞飞媒体(ffmt.com) 发布: 2024年03月27日 15:34 187阅读 0评论

Python爬取某wordpress资源网站实例

以下是demo，大家可以自行调试

demo.py
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
#'Referer':'https://bba.xyxmh.cn/',
#自行获取cookie 方法在上一篇文章
'Cookie':'_zb_site_notify_auto=1; wordpress_test_cookie=WP%20Cookie%20check; wordpress_logged_in_e3dea29603180b10e64085872ec43081=fei123%7C1712652580%7Cc20cA1U4gDMQEOgPpERprJZ3ILZ0btldE9wq863RXVz%7C580643d7d80196'
}

pages = ["1","2","3"]
end_list = []

for page in pages:
    res=requests.get("https://bba.xyxmh.cn/page/" + page,headers=headers)
    soup=BeautifulSoup(res.text,'html.parser')


    # 初始化前一个href属性值
    previous_href = None

    # 初始化一个集合，用于存储已经输出的 href 属性值，以便去重
    output_hrefs = set()

    # 获取昨天的日期
    yesterday = datetime.now() - timedelta(days=1)
    yesterday_str = yesterday.strftime("%Y-%m-%d")

    # 遍历每个 col 元素
    for col_div in soup.find_all('div', class_='col'):
        end_dict = {}
        # 查找包含类名为 "media-img" 的 a 标签，并获取 href 属性内容
        a_tag = col_div.find('a', class_='media-img')
        href_attribute = a_tag['href'] if a_tag else None
        title_attribute = a_tag['title'] if a_tag else None
    
        # 如果 href_attribute 为 None，则跳过本次循环
        if href_attribute is None:
            continue
    
        # 查找包含类名为 "pub-date" 的 time 标签
        time_tag = col_div.find('time', class_='pub-date')
    
        # 获取 datetime 属性内容
        datetime_attribute = time_tag['datetime'] if time_tag else None
    
        # 如果 datetime_attribute 为 None，则跳过本次循环
        if datetime_attribute is None:
            continue
    
        # 将 datetime_attribute 转换为日期格式
        try:
            datetime_obj = datetime.strptime(datetime_attribute, "%Y-%m-%dT%H:%M:%S")
        except ValueError:
            # 处理包含时区信息的日期字符串
            datetime_obj = datetime.strptime(datetime_attribute.split('+')[0], "%Y-%m-%dT%H:%M:%S")
    
        # 如果日期不是昨天的日期，跳过本次循环
        if datetime_obj.strftime("%Y-%m-%d") != yesterday_str:
            continue
    
        # 如果 href_attribute 已经在 output_hrefs 中，表示重复，跳过本次循环
        if href_attribute in output_hrefs:
            continue
    
        # 更新 output_hrefs
        output_hrefs.add(href_attribute)
    
        # 打印 href_attribute 和 datetime_attribute
        #print("title_attribute:", title_attribute)
        #print("href_attribute:", href_attribute)
        # 大家根据需要，自行打开print
        #print("datetime_attribute:", datetime_attribute)
        end_dict['title_attribute'] = title_attribute
        end_dict['href_attribute'] = href_attribute
        end_dict['datetime_attribute'] = datetime_attribute
        end_list.append(end_dict)

print(end_list)

下一步便可以获取文章详情和资源链接了。

本文来自网络，不代表飞飞媒体立场，转载请注明出处：https://www.ffmt.com/boke/qboke/2740.html

34赞

标签:python 爬虫飞飞媒体

作者: 飞飞媒体(ffmt.com)

发表回复取消回复