人人网最流行的那些日志都用什么词

今天读书的时候被启发，想要写一个程序校内网蛋疼文章过滤器，写了一部分发现工作量有点大。恰恰想起过去曾经读到过一篇叫《东风何处是人间》的很有意思的文章，于是转念一想，正好拿起前面写了一部分的程序统计了下校内上那些分享量最高的日志的用词频率。

我用python写了一个程序抓取校内分享栏目里给出的分享量最高的120篇文章，然后对其中所有两字词的出现频率进行统计，最后排序并进行人工筛选。于是这篇《人人网最流行的那些日志都用什么词》出炉了！下面给出统计结果，本人不作任何评论；源代码则附在文章的最后，各位可以在此基础上进一步发掘（以及，我不保证我写的代码没bug……）。

实意名词TOP15：

1，帅哥，295次

2，男人，184次

3，中国，178次

4，孩子，174次

5，蟑螂，171次

6，女人，140次

7，韩国，136次

8，朋友，135次

9，世界，118次

10，时间，113次

11，咖啡，108次

11，妈妈，108次

13，生活，97次

14，永远，96次

15，幸福，95次

注：虽然这里把这些词语算作实意名词，但实际上在文中出现的时候它们未必是以名词形式出现的，譬如“永远”一词，想必大多数出现都不是名词；又如“生活”一词，既可以是名词又可以是动词，所以它在此榜单和下面一张榜单上都有名字。

实意动词TOP10：

1，喜欢，184次

2，觉得，116次

3，开始，107次

4，生活，97次

5，到了，89次

6，看到，88次

6，发现，88次

8，需要，85次

9，起来，80次

10，出来，78次

注：和上面一个榜单一样，很多动词在文中出现也未必是动词形式，甚至可能不是以这个词的形式出现，譬如“起来”，在原文中还有可能是“站起来”、“坐起来”。

全词汇TOP15：

1，自己，537次

2，我们，517次

3，没有，469次

4，什么，383次

5，时候，319次

6，可以，307次

7，不是，299次

8，帅哥，295次

9，因为，275次

10，知道，264次

11，个人，257次

12，你的，246次

13，就是，243次

14，如果，208次

15，这个，205次

标题里出现最多的词语TOP5：

1，什么，9次

2，我们，8次

3，男人，8次

4，感谢，7次

5，一个，6次

5，没有，6次

5，世界，6次

最后是说好的源代码。注意也许你需要下载BeautifulSoup（传送门），以及请在使用前把post_data里的email和password项里填上自己在校内的用户名和密码。以及如果你使用RSS阅读到本文，这里不一定会正常显示。

#encoding=utf-8
import urllib, urllib2, cookielib, re
from BeautifulSoup import BeautifulSoup

myCookie = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(myCookie)
post_data = {
        'email':'email',
        'password':'passwd',
        'origURL':'http://www.renren.com/home',
        'domain':'renren.com'
        }
req = urllib2.Request('http://www.renren.com/PLogin.do', urllib.urlencode(post_data))
html_src = opener.open(req).read()
#print html_src
posts = []
pattern = re.compile(r'.*href="(.*?)".*?', re.I | re.X)
for i in range(6):
    theURL = 'http://share.renren.com/share/hotlist?curpage=' + str(i) + '&t=1&__view=async-html-reload'
    req = urllib2.Request(theURL)
    html_src = opener.open(req).read()
    #print "get page " + theURL
    #print html_src
    parser = BeautifulSoup(html_src)
    posts_list = parser.findAll('li','share')
    for post in posts_list:
        #print post
        post = post.h3.a.__str__('GB18030')
        m = pattern.match(post)
        if m:
            #print m.groups(0)[0]
            posts.append(m.groups(0)[0])

print "Got all posts"

statistic = {}
i = 0

def doStatistic(txt):
    for i in range(len(txt) - 1):
        word = txt[i] + txt[i + 1]
        #print word
        if statistic.get(word) != None:
            statistic[word] += 1
        else:
            statistic[word] = 1

for post in posts:
    #i += 1
    if i > 10:
        break
    req = urllib2.Request(post)
    html_src = opener.open(req).read()
    parser = BeautifulSoup(html_src)
    title = parser.find('h3','title-article')
    if title != None:
        title = title.strong.text
        content = parser.find('div', 'text-article').text
        doStatistic(title)
        doStatistic(content)

print "Statistic Finished"

word_list = statistic.keys()
word_list.sort(cmp = lambda x, y: cmp(statistic[y], statistic[x]))

for word in word_list:
    if statistic[word] > 4:
        print word + ": " + str(statistic[word])

print "Finished"

#encoding=utf-8

import urllib, urllib2, cookielib, re

from BeautifulSoup import BeautifulSoup

myCookie = urllib2.HTTPCookieProcessor(cookielib.CookieJar())

opener = urllib2.build_opener(myCookie)

post_data = {

'email':'email',

'password':'passwd',

'origURL':'http://www.renren.com/home',

'domain':'renren.com'

}

req = urllib2.Request('http://www.renren.com/PLogin.do', urllib.urlencode(post_data))

html_src = opener.open(req).read()

#print html_src

posts = []

pattern = re.compile(r'.*href="(.*?)".*?', re.I | re.X)

for i in range(6):

theURL = 'http://share.renren.com/share/hotlist?curpage=' + str(i) + '&t=1&__view=async-html-reload'

req = urllib2.Request(theURL)

html_src = opener.open(req).read()

#print "get page " + theURL

#print html_src

parser = BeautifulSoup(html_src)

posts_list = parser.findAll('li','share')

for post in posts_list:

#print post

post = post.h3.a.__str__('GB18030')

m = pattern.match(post)

if m:

#print m.groups(0)[0]

posts.append(m.groups(0)[0])

print "Got all posts"

statistic = {}

i = 0

def doStatistic(txt):

for i in range(len(txt) - 1):

word = txt[i] + txt[i + 1]

#print word

if statistic.get(word) != None:

statistic[word] += 1

else:

statistic[word] = 1

for post in posts:

#i += 1

if i > 10:

break

req = urllib2.Request(post)

html_src = opener.open(req).read()

parser = BeautifulSoup(html_src)

title = parser.find('h3','title-article')

if title != None:

title = title.strong.text

content = parser.find('div', 'text-article').text

doStatistic(title)

doStatistic(content)

print "Statistic Finished"

word_list = statistic.keys()

word_list.sort(cmp = lambda x, y: cmp(statistic[y], statistic[x]))

for word in word_list:

if statistic[word] > 4:

print word + ": " + str(statistic[word])

print "Finished"

优哉·幽斋

三尺微命，一介书生

人人网最流行的那些日志都用什么词

也许你还会喜欢我的这些文章：