不知不觉中十一在家的假期要结束了

0x00.前言

此文于2017-10-21 21:21:13开始补坑,2017-10-21 22:44:23补完

0x01.存数据

看到这个活动页,就想把“告白”内容存到数据库里,老样子F12+XHR看到相关接口有俩:
http://www.bilibili.com/activity/likes/list/10156?t=1508592478785&page=1&pagesize=1
http://www.bilibili.com/activity/likes/random/10156?t=1508592478788&count=100
根据命名个人觉着第一个接口适合我们用,毕竟第二个这种取随机不一定能获取完全。经过测试,构造出如下利于爬取的链接:
http://www.bilibili.com/activity/likes/list/10156?pagesize=49&page= < 此处填写页码 >
从活动开始到结束,现在留有104368条,如图所示,

给你们看一下表的数据条数,我确实全部保存下来了:

所以2130页就足以获取全部。
快速爬取(仅适用于非 b 的API服务器且无反爬)的话可以这样:

1
# -*- coding: utf-8 -*-
2
import requests
3
import json
4
import pymysql
5
from multiprocessing.dummy import Pool as ThreadPool
6
7
8
def get_source(page):
9
    url = "http://www.bilibili.com/activity/likes/list/10156?pagesize=49&page=" + str(page)
10
    response = requests.get(url).text
11
    jsDict = json.loads(response)
12
    if jsDict['code'] == 0:
13
        list_1 = jsDict['data']['list']
14
        for each in list_1:
15
            id = each['id']
16
            print id
17
            sid = each['sid']
18
            state = each['state']
19
            type = each['type']
20
            mid = each['mid']
21
            wid = each['wid']
22
            ctime = each['ctime']
23
            likes = each['likes']
24
            liked = each['liked']
25
            message = each['message']
26
            device = each['device']
27
            image = each['image']
28
            plat = each['plat']
29
            reply = each['reply']
30
            link = each['link']
31
32
            owner_mid = each['owner']['mid']
33
            owner_name = each['owner']['name']
34
            owner_face = each['owner']['face']
35
            owner_sex = each['owner']['sex']
36
37
            owner_level_info_current_level = each['owner']['level_info']['current_level']
38
            owner_level_info_current_min = each['owner']['level_info']['current_min']
39
            owner_level_info_current_exp = each['owner']['level_info']['current_exp']
40
            owner_level_info_next_exp = each['owner']['level_info']['next_exp']
41
42
            try:
43
                connection = pymysql.connect(
44
                    host='localhost', user='root', passwd='***', db='bilibili', charset='utf8')
45
                with connection.cursor() as cursor:
46
                    sql = "INSERT INTO `yourname` (`id`,`sid`,`state`,`type`,`mid`,`wid`,`ctime`,`likes`,`liked`, `message`,`device`,`image`,`plat`, `reply`,`link`,`owner_mid`,`owner_name`,`owner_face`, `owner_sex`,`owner_level_info_current_level`,`owner_level_info_current_min`,`owner_level_info_current_exp`,`owner_level_info_next_exp`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
47
                    cursor.execute(sql, (
48
                    id, sid, state, type, mid, wid, ctime, likes, liked, message, device, image, plat, reply, link,
49
                    owner_mid, owner_name, owner_face, owner_sex, owner_level_info_current_level,
50
                    owner_level_info_current_min, owner_level_info_current_exp, owner_level_info_next_exp))
51
52
                connection.commit()
53
            except Exception as e:
54
                print e
55
            finally:
56
                connection.close()
57
    else:
58
        print "Error"
59
60
i_1 = []
61
for i in range(0, 2030):
62
    i_1.append(i)
63
64
pool = ThreadPool(500)
65
try:
66
    results = pool.map(get_source, i_1)
67
except Exception as e:
68
    print e
69
    pool.close()
70
    pool.join()
71
72
pool.close()
73
pool.join()

0x02.MySQL导出纯文本

参考从mysql中导出一列数据到txt,把bilibili数据库yourname表中的message字段全部内容保存至一文本文档(本例存在C:/DARA/out.txt)中

1
mysql> use bilibili;
2
Database changed
3
mysql> select message into outfile "c:/DATA/out.txt" lines terminated by "\r\n" from yourname;
4
1290 - The MySQL server is running with the --secure-file-priv option so it cannot execute this statement

修改my.ini末尾添加secure_file_priv="C:/DATA/",保存并重启数据库。

1
mysql> show variables like '%secure%';
2
+------------------+----------+
3
| Variable_name    | Value    |
4
+------------------+----------+
5
| secure_auth      | OFF      |
6
| secure_file_priv | C:\DATA\ |
7
+------------------+----------+
8
2 rows in set

如上所示则已生效

1
mysql> select message into outfile "c:/DATA/out.txt" lines terminated by "\r\n" from yourname;
2
Query OK, 104368 rows affected

再次运行得到out.txtFTP传回本地。

0x03.结巴中文分词

用以提取关键词,可以这样:

1
import jieba.analyse
2
import cPickle as pickle
3
4
content = open("out.txt", 'r').read()
5
tags = jieba.analyse.extract_tags(content, topK=100, withWeight=True)
6
print "Finished extraction."
7
for tag in tags:
8
    print tag[0], "\t", tag[1]
9
10
with open("./assets/tags.pickle", "w") as f:
11
    pickle.dump(tags, f)

运行结果如下:

1
Building prefix dict from the default dictionary ...
2
Loading model from cache c:\users\yuange~1\appdata\local\temp\jieba.cache
3
Loading model cost 0.462 seconds.
4
Prefix dict has been built succesfully.
5
Finished extraction.
6
喜欢 	0.324251093864
7
我爱你 	0.141492171794
8
希望 	0.0872011423752
9
一起 	0.0645035960844
10
永远 	0.0624222569269
11
我会 	0.0570180337732
12
一直 	0.0555175407499
13
名字 	0.054767164132
14
我们 	0.0483923629279
15
告白 	0.0480520036241
16
遇见 	0.0404470021514
17
知道 	0.0380303658285
18
表白 	0.0375291125829
19
真的 	0.0346544513713
20
相遇 	0.0333705541785
21
还是 	0.0325035732867
22
忘记 	0.0282951578971
23
二次元 	0.0268951184777
24
虽然 	0.0267818329394
25
不会 	0.0255878815482
26
未来 	0.0251641649911
27
但是 	0.024407187094
28
幸福 	0.0238904170345
29
一定 	0.0235970363729
30
一个 	0.0222749620653
31
那个 	0.0220862023796
32
自己 	0.0208379896841
33
现在 	0.0208044554451
34
我要 	0.0199511389164
35
没有 	0.0193877103187
36
遇到 	0.0188617945085
37
如果 	0.0182736407689
38
再见 	0.0177817180204
39
找到 	0.0176360616325
40
世界 	0.0169725262572
41
七夕 	0.0168677168831
42
... 	0.0165303584246
43
加油 	0.0165085656864
44
可以 	0.0164258453739
45
谢谢 	0.0162835181917
46
一生 	0.0160791235199
47
啊啊啊 	0.0160787253266
48
记得 	0.0156403210769
49
三年 	0.0154236874264
50
好好 	0.0152625773588
51
身边 	0.015161018891
52
即使 	0.0148554953566
53
一辈子 	0.0145913427404
54
努力 	0.0143755498617
55
此生 	0.0143429949614
56
你们 	0.0142989803287
57
什么 	0.0139712428593
58
不管 	0.0138434591696
59
就是 	0.0137978195567
60
以后 	0.0133039175692
61
时候 	0.0131537926656
62
单身 	0.0130653440919
63
一天 	0.013018358808
64
开心 	0.0125408371338
65
我能 	0.0124950664672
66
无论 	0.0124649435768
67
守护 	0.0124503674775
68
一次 	0.0118370516219
69
陪伴 	0.0117597817649
70
女朋友 	0.0117143186911
71
不能 	0.0116363826448
72
三叶 	0.0112544888653
73
安好 	0.0111032197697
74
可能 	0.0107953851099
75
祝你幸福 	0.0107196615408
76
因为 	0.0106407939045
77
也许 	0.010609106987
78
感谢 	0.0106036173337
79
已经 	0.010543736305
80
哔哩 	0.0104262111997
81
心意 	0.0104215537926
82
暗恋 	0.0103705181058
83
只是 	0.0102918636606
84
快乐 	0.0102163069019
85
就算 	0.0101965684284
86
爱着 	0.0101303157723
87
不想 	0.0100458353843
88
下去 	0.00991074980489
89
一年 	0.0098914282562
90
相见 	0.00975838837469
91
对不起 	0.00967898614789
92
不要 	0.00959938169197
93
美好 	0.00956585198602
94
告诉 	0.00952197220922
95
一切 	0.00946857669345
96
那么 	0.00945186230346
97
哪里 	0.00934967618021
98
想要 	0.00915397039592
99
不是 	0.00903258766246
100
看到 	0.00883566669778
101
电影 	0.00882967071275
102
曾经 	0.00870491366138
103
时间 	0.00867462160313
104
相信 	0.00867211421126
105
愿意 	0.00843351469339

0x04.WordCloud绘制词云

简单使用啦,可以这样:

1
# -*- coding:utf-8 -*-
2
""" 生成词云图片 """
3
from wordcloud import WordCloud, ImageColorGenerator
4
import cPickle as pickle
5
import numpy as np
6
from PIL import Image
7
8
if __name__ == "__main__":
9
    # 读取词频
10
    with open("./assets/tags.pickle", "r") as f:
11
        tags = pickle.load(f)
12
        frequencies = {}
13
        for tag in tags:
14
            frequencies[tag[0]] = int(10000 * tag[1])
15
16
    wc = WordCloud(font_path='./assets/simhei.ttf',  # 设置字体
17
                   background_color="black",  # 背景颜色
18
                   max_words=100,  # 词云显示的最大词数
19
                   max_font_size=500,  # 字体最大值
20
                   # random_state=42,
21
                   width=1366,
22
                   height=768)
23
24
    wc.generate_from_frequencies(frequencies)
25
26
    # 颜色转换
27
    rainbow_coloring = np.array(Image.open("./assets/rainbow.jpg!webp"))
28
    image_colors = ImageColorGenerator(rainbow_coloring)
29
    wc.recolor(color_func=image_colors)
30
31
    # 保存图片
32
    wc.to_file("./assets/word_cloud1.png!webp")
33
    print "saved at ./assets/word_cloud1.png!webp"

0x05.结果图片

上面那张是词云,下面这张是点赞排行榜

0x06.引用

生成词云之python中WordCloud包的用法
把300W淘宝文胸评论绘制成词云