用BeautifulSoup爬取BiliBili视频的历史弹幕

在网上看到一个有意思的爬虫程序,就两个函数,可以根据选择的日期对某一视频的历史弹幕进行爬取。

首先导入库
import requests
import time
from bs4 import BeautifulSoup
import pandas as pd
定义获取历史弹幕的url

这里有意思的事情是,在BiliBili请求获取历史弹幕是一个ajax数据包,请求url以这个形式进行:https://api.bilibili.com/x/v2/dm/history?type=1&oid=162446150&date=2020-11-09
关于这一点可以看这里
包含两个参数,一个oid即视频编号,date为请求日期。

先获取url列表

def get_url(oid, start, end):
    url_list = []
    date_list = [i for i in pd.date_range(start, end).strftime('%Y-%m-%d')]
    for date in date_list:
        url = f"https://api.bilibili.com/x/v2/dm/history?type=1&oid={oid}&date={date}"
        url_list.append(url)
    return url_list
requests获取数据,BeautifulSoup解析

这个就很常规了

def get_data(url_list, name):

    headers = {"cookie": "_uuid=95B98B41-85A5-1E4D-D10C-E63BE8B22B1F03151infoc; buvid3=9E7388FC-8811-41B2-9AFD-4743F60F0184138384infoc; CURRENT_FNVAL=80; blackside_state=1; rpdid=|(m~m~u|~)~0J'ulmmlYu|J); sid=ampo3dbp; DedeUserID=371933286; DedeUserID__ckMd5=e6c2dec3e2bc9a52; SESSDATA=34bf3bd3%2C1616068182%2C39dd1*91; bili_jct=e32edc18239aeafee29c8399a7d1b2f2; LIVE_BUVID=AUTO3616021459742083; PVID=15; bsource=search_baidu; _dfcaptcha=ba224a9c8b1e51767afab274b30b3641; finger=14af842e; bfe_id=6f285c892d9d3c1f8f020adad8bed553",
               "origin": "https://www.bilibili.com",
               "referer": "https://www.bilibili.com/video/BV1gW411b735",
               "sec-fetch-dest": "empty",
               "sec-fetch-mode": "cors",
               "sec-fetch-site": "same-site",
               "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}

    with open(f"{name}.txt", 'w',encoding='utf-8') as file:

        for i in range(len(url_list)):
            url = url_list[i]
            res = requests.get(url, headers=headers)
            res.encoding = 'utf-8'
            print(res.text)
            soup = BeautifulSoup(res.text,'lxml')
            data = soup.find_all("d")
            danmu = [data[i].text for i in range(len(data))]
            for items in danmu:
                file.write(items+'\n')    
            time.sleep(2)

#运行

if __name__ == '__main__':
    start='11/9/2020'
    end='11/10/2020'
    name='周杰伦 - 晴天MV'

    oid='162446150'
    url_list=get_url(oid,start,end)
    get_data(url_list,name)

这有个好玩的,统计某个视频出现数目最多的历史弹幕。以某个视频为例,很明显,盖亚大地破坏者的梗刷的最多,迪迦的粉丝数量还是不少。

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family'] = 'STSong' 

data=pd.read_csv('超奥特八兄弟.txt',header=None,names=['danmu'])
a=data['danmu'].value_counts()[:20]
pd.DataFrame(a).plot(kind='barh',color=(0.8666666666666667, 0.5176470588235295, 0.3215686274509804))
plt.show()

紧跟时事🐶🐶🐶🐶🐶🐶

来波马老师 2020年11月25更新

视频来源:https://www.bilibili.com/video/BV1dK411G73T?from=search&seid=14377148875392576157

def plot(filename):
    import matplotlib.pyplot as plt
    import seaborn as sns
    plt.rcParams['font.family'] = ['Arial Unicode MS']
    data = pd.read_csv(filename+'.txt', header=None, names=['danmu'])
    a = data['danmu'].value_counts()[:20]
    data=data[data['danmu'].isin(a.index)]
    plt.figure(figsize=(8,8))
    sns.countplot(data=data,y='danmu',order=a.index)
    plt.title(filename+'弹幕')
    plt.tight_layout()
    plt.savefig(filename+'.png')
name='TeacherMa'
plot(name)

换个视频源,再来亿遍:https://www.bilibili.com/video/BV1ky4y1B7DW?from=search&seid=14377148875392576157

总体来说:来骗,来偷袭,我六十九岁的老同志,这好吗?这不好。年轻人不讲武德,我劝耗子尾汁!!

总结

主要是数据传输加密的原理,requests,BeautifulSoup倒没啥。
浏览器请求的机制

Categories: Python

2 Comments

Anonymous · 2020年11月11日 at 13:31

喵喵版啊呜,到此一游,略略略~

马老师第二弹–弹幕到底说了什么 – xinzipanghuang.home · 2020年12月1日 at 22:09

[…] 在被人民日报被评价为”哗众取宠”后,bilibili下架了所有的马老师的视频。我之前爬取的视频历史弹幕竟成了这一段时间的唯一记录。这可以查看之前的文章:用BeautifulSoup爬取BiliBili视频的历史弹幕 […]

Leave a Reply

Your email address will not be published.