符号分词和词频统计

如今有一段文本ide

As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

 

我就是想看看 里面的词的高频和低频spa

 

我须要作两件事情code

1. 先分词,分词咱们就按照标点和空格来分orm

2. 接着统计词频blog

 

import re
from collections import Counter


def count_words(text):
    """Count """
    counts = dict()
    # convert to lower case
    text_lower = text.lower()
    tokens = re.split('\W+', text_lower)
    counts = Counter(tokens)
    return counts


def test_run():
    with open("text.txt", "r") as f:
        text = f.read()
        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

        print("10 most common words:\nWord\nCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))

        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))


if __name__ == '__main__':
    test_run()

运行结果以下token

Word
Count
a 9
he 6
the 6
and 5
as 4
was 4
with 3
i 2of 2ip

his 2it

10 least common words:
Word Count
merry 1
word 1
or 1
slap 1
on 1
for 1
more 1
favoured 1
guests 1
1io

Process finished with exit code 0table

相关文章
相关标签/搜索