NLP初步——其它 | 编程电脑技术交流

2022年5月22日19:18:43 发表评论 547 views

和北大处理工具差不多

SPacy商业开源软件，速度最快，但是不支持中文

8、Gensim文本的向量表示特征提取。

TF-IDF、word2vec。Bag of Words BOW 磁带模型

pip install gensim

spacy.load（'en'）的一个bug

python -m spacy install en -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

然后load en_core_web_sm 安装时最好也用这个代替en

其它代码：

import thulac
import codecs

def ReadFile(filePath,encoding="utf-8"):
    with codecs.open(filePath,"r",encoding) as f:
        return f.read()
def WriteFile(filePath,content,encoding="gbk"):
    with codecs.open(filePath,"w",encoding) as f:
        f.write(content)
def UTF8_2_GBK(src,dst):
    content=ReadFile(src,encoding="utf-8")
    WriteFile(dst,content,encoding="gbk")
thu1 = thulac.thulac(seg_only=True)  #默认模式
text = thu1.cut("我爱北京天安门", text=True)  #进行一句话分词
print(text)
#2文件分词
thul_f=thulac.thulac()
UTF8_2_GBK("input.txt","input2.txt")
UTF8_2_GBK("output.txt","output2.txt")
thul_f.cut_f("input2.txt","output2.txt")
print("文件分词成功！")

# -*- coding:utf-8 -*-
from gensim import corpora
from gensim import models
import jieba
raw_documents = [
    '0无偿居间介绍买卖毒品的行为应如何定性',
    '1吸毒男动态持有大量毒品的行为该如何认定',
    '2如何区分是非法种植毒品原植物罪还是非法制造毒品罪',
    '3为毒贩贩卖毒品提供帮助构成贩卖毒品罪',
    '4将自己吸食的毒品原价转让给朋友吸食的行为该如何认定',
    '5为获报酬帮人购买毒品的行为该如何认定',
    '6毒贩出狱后再次够买毒品途中被抓的行为认定',
    '7虚夸毒品功效劝人吸食毒品的行为该如何认定',
    '8妻子下落不明丈夫又与他人登记结婚是否为无效婚姻',
    '9一方未签字办理的结婚登记是否有效',
    '10夫妻双方1990年按农村习俗举办婚礼没有结婚证 一方可否起诉离婚',
    '11结婚前对方父母出资购买的住房写我们二人的名字有效吗',
    '12身份证被别人冒用无法登记结婚怎么办？',
    '13同居后又与他人登记结婚是否构成重婚罪',
    '14未办登记只举办结婚仪式可起诉离婚吗',
    '15同居多年未办理结婚登记，是否可以向法院起诉要求离婚'
]
texts=[[word for word in jieba.cut(document,cut_all=False)]for document in raw_documents]
# texts=[]
# for document in raw_documents:
#     for word in jieba.cut(document):
#         texts.append(word)
# print(texts)

# 生成词典
dictionary=corpora.Dictionary(texts)
print(dictionary)
# 生成语料
corpus=[dictionary.doc2bow(text) for text in texts]
print(corpus)
#生成tf-idf模型
tfidf_model=models.TfidfModel(corpus)
corpus_tfidf=tfidf_model[corpus]
for item in corpus_tfidf:
    print(item)

# -*- coding:utf-8 -*-
import spacy
nlp=spacy.load('en')#不支持中文
text="I love coco! 5G is comming."
test_words=nlp(text)
print(8*"*","分词",8*"*")
for word in test_words:
    print(word)
#2 命名实体识别
text="It is a beautiful flower！"
#
# test_words=nlp(text)
# for ent in test_words.ents:
#     print(ent,ent.label_,ent.label)
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

spacy

发表评论取消回复