0%

油管爬虫

YouTube爬虫

参考链接

使用谷歌api爬视频评论

使用谷歌 api ,很方便。按照谷歌的官方文档快速开始

  1. conda环境中安装相应的包
1
2
pip install --upgrade google-api-python-client
pip install --upgrade google-auth-oauthlib google-auth-httplib2
  1. 设置凭证:按照别人步骤操作就行。
    • 这个网页中先创建一个工程(如果没有的话)
    • 打开侧边的 选项,进入,选择 YouTube Data API v3启用
    • 返回刚才的界面再选择 凭据页面进入,创建 API 密钥和 OAuth 2.0 客户端ID
      • API密钥是用于 web 访问用的(具体我也不知道)
      • 在本地python代码中使用谷歌api 要创建OAuth ID
  2. 创建 OAuth ID:注意选项中要 选择 桌面客户端类型(第一次选的Web客户端就不行),然后下载 json,在代码中会用到这个json
image-20210913171539492
  1. 代码:修改json文件为刚才下载的路径。vpn模式为 规则
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import os
import numpy as np
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
from googleapiclient.errors import HttpError
import pandas as pd
import json
import socket
import socks
import requests
## 科学上网,你也需要科学使用代理,不然科学不了外网,也许你不会需要。
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 10080)
socket.socket = socks.socksocket
##
scopes = ["https://www.googleapis.com/auth/youtube.force-ssl"]

def main():
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

api_service_name = "youtube"
api_version = "v3"
# 这个文件需要自己注册完,自己下载
client_secrets_file = "client_secret_303198188639-ptcmpb7m0urubvl0mvoip8tp05tp8lv6.apps.googleusercontent.com.json"
# Get credentials and create an API client
flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
client_secrets_file, scopes)
credentials = flow.run_console()

youtube = googleapiclient.discovery.build(
api_service_name, api_version, credentials=credentials)
videoId = '5YGc4zOqozo'
request = youtube.commentThreads().list(
part="snippet,replies",
videoId=videoId,
maxResults = 100
)
response = request.execute()
# print(response)

totalResults = 0
totalResults = int(response['pageInfo']['totalResults'])

count = 0
nextPageToken = ''
comments = []
first = True
further = True
while further:
halt = False
if first == False:
print('..')
try:
response = youtube.commentThreads().list(
part="snippet,replies",
videoId=videoId,
maxResults = 100,
textFormat='plainText',
pageToken=nextPageToken
).execute()
totalResults = int(response['pageInfo']['totalResults'])
except HttpError as e:
print("An HTTP error %d occurred:\n%s" % (e.resp.status, e.content))
halt = True

if halt == False:
count += totalResults
for item in response["items"]:
# 这只是一部分数据,你需要啥自己选就行,可以先打印下你能拿到那些数据信息,按需爬取。
comment = item["snippet"]["topLevelComment"]
author = comment["snippet"]["authorDisplayName"]
text = comment["snippet"]["textDisplay"]
likeCount = comment["snippet"]['likeCount']
publishtime = comment['snippet']['publishedAt']
comments.append([author, publishtime, likeCount, text])
if totalResults < 100:
further = False
first = False
else:
further = True
first = False
try:
nextPageToken = response["nextPageToken"]
except KeyError as e:
print("An KeyError error occurred: %s" % (e))
further = False
print('get data count: ', str(count))
### write to csv file
data = np.array(comments)
df = pd.DataFrame(data, columns=['author', 'publishtime', 'likeCount', 'comment'])
df.to_csv('google_comments.csv', index=0, encoding='utf-8')

### write to json file
result = []
for name, time, vote, comment in comments:
temp = {}
temp['author'] = name
temp['publishtime'] = time
temp['likeCount'] = vote
temp['comment'] = comment
result.append(temp)
print('result: ', len(result))

json_str = json.dumps(result, indent=4)
with open('google_comments.json', 'w', encoding='utf-8') as f:
f.write(json_str)

f.close()
if __name__ == "__main__":
main()

文本分析

英文文本挖掘预处理流程

参考链接 参考链接

  1. 去除非文本部分,例如表情符号,非英文字符等
  2. 英文分词(不一定需要 但是 像 New York 这种如果不做分词容易错)
  3. 拼写检查更正 pyenchant
  4. 词干提取 和 词型还原,即要找到词的原始形式 nltk
  5. 转换为小写,统计词频率需要
  6. 引入停用词,比如“a”,“to”,一些短词,还有一些标点符号,这些我们不想在文本分析的时候引入,因此需要去掉,这些词就是停用词
  7. 对文本做情感分析,词性标注(动词,形容词等),词频率分析,LDA主题分析 gensim,TF-IDF聚类

情感分析

使用 TextBlob 来判断单个词的情感

主题分析

参考链接 参考链接 参考链接

语义网络参考链接