Python停用词表已更新,现在包含了最新的热门词汇。这些词汇在文本分析中可能会影响结果的准确性,因此需要被排除在外。
Python停用词表更新热词表
(图片来源网络,侵删)
1. 获取停用词表
我们需要从网上下载一个中文停用词表,这里我们使用jieba库的内置停用词表。
import jieba 获取停用词表 stopwords = set(jieba.analyse.stop_words)
2. 读取文本数据
我们需要读取文本数据,这里我们假设文本数据存储在一个名为text_data.txt
的文件中。
with open('text_data.txt', 'r', encoding='utf8') as f: text = f.read()
3. 分词并去除停用词
使用jieba库对文本进行分词,并去除停用词。
import jieba.posseg as pseg 分词并去除停用词 words = [word for word, flag in pseg.cut(text) if word not in stopwords]
4. 统计词频
(图片来源网络,侵删)
使用collections库中的Counter类统计词频。
from collections import Counter 统计词频 word_freq = Counter(words)
5. 更新热词表
将统计出的词频按照降序排列,取前N个作为热词。
更新热词表 hotwords = word_freq.most_common(N)
6. 输出热词表
将热词表输出到文件。
输出热词表 with open('hotwords.txt', 'w', encoding='utf8') as f: for word, freq in hotwords: f.write(f'{word}: {freq} ')
至此,我们已经完成了Python停用词表的更新热词表操作。
(图片来源网络,侵删)
以下是一个简单的介绍,包含了两列:一列是Python停用词表,另一列是更新热词表。
停用词表 | 更新热词表 |
a | 新冠病毒 |
about | 疫情 |
above | 云计算 |
after | 5G |
again | 人工智能 |
all | 大数据 |
almost | 区块链 |
along | 芯片 |
also | 无人驾驶 |
always | 虚拟现实 |
among | 生物技术 |
an | 量子计算 |
and | |
any | |
are | |
as | |
at | |
be | |
because | |
been | |
before | |
being | |
below | |
between | |
both | |
but | |
by | |
can | |
could | |
did | |
do | |
does | |
doing | |
down | |
during | |
each | |
few | |
for | |
from | |
further | |
had | |
has | |
have | |
having | |
he | |
her | |
here | |
hers | |
herself | |
him | |
himself | |
his | |
how | |
however | |
i | |
if | |
in | |
into | |
is | |
it | |
its | |
itself | |
just | |
kg | |
km | |
lb | |
left | |
like | |
ln | |
ltd | |
m | |
mg | |
might | |
ml | |
mm | |
more | |
most | |
mr | |
mrs | |
ms | |
much | |
must | |
my | |
myself | |
n | |
no | |
nor | |
not | |
of | |
off | |
often | |
on | |
once | |
only | |
or | |
other | |
our | |
ours | |
ourselves | |
out | |
over | |
own | |
part | |
per | |
perhaps | |
put | |
rather | |
re | |
s | |
same | |
she | |
should | |
since | |
so | |
some | |
such | |
t | |
than | |
that | |
the | |
their | |
theirs | |
them | |
themselves | |
then | |
there | |
these | |
they | |
thick | |
thin | |
this | |
those | |
through | |
to | |
too | |
under | |
until | |
up | |
very | |
was | |
we | |
well | |
were | |
what | |
when | |
where | |
which | |
while | |
who | |
whom | |
why | |
with | |
within | |
without | |
would | |
yet | |
you | |
your | |
yours | |
yourself | |
yourselves |
请注意,停用词表是英文的,而更新热词表是中文的,这个介绍仅作为示例,实际上停用词表和热词表的内容可以根据实际需求进行调整,停用词表通常包含一些常见的、没有实际意义的单词,而热词表则包含当前热门的话题或关键词。
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。
评论(0)