使用 Python 在 NLP 中练习解析文本

了解自然语言处理背后的基本概念。

图片来源：

WOCinTech Chat。由 Opensource.com 修改。CC BY-SA 4.0

自然语言处理 (NLP) 是一个专门用于分析和生成人类语言的领域。人类语言，理所当然地称为自然语言，高度依赖上下文，并且常常含糊不清，以便产生独特的含义。（记住这个笑话，妻子让丈夫“买一盒牛奶，如果他们有鸡蛋，就买六个”，所以他买了六盒牛奶，因为他们有鸡蛋。）NLP 提供了理解自然语言输入并适当生成自然语言输出的能力。

计算语言学 (CL) 是更大的语言理解和建模领域。NLP 是 CL 的一个子集，处理语言理解和生成的工程方面。NLP 是一个跨学科领域，涉及人工智能 (AI)、机器学习 (ML)、深度学习 (DL)、数学和统计学等多个领域。

您可以使用 NLP 构建的一些应用程序包括

机器翻译： 世界上有 6000 多种语言，NLP 与神经机器翻译相结合，可以简化从一种语言到另一种语言的文本翻译。
聊天机器人： 像 Alexa、Siri 和开源 Mycroft 这样的个人助理如今已融入我们的生活。NLP 是这些聊天机器人的核心，帮助机器分析、学习和理解语音，并提供语音响应。
语音支持： NLP 使以友好的方式为医疗保健、旅游、零售和其他行业的客户提供服务成为可能。
情感分析： 企业始终希望掌握客户的脉搏，并在感觉到不满时采取积极行动。NLP 使这成为可能。
人力资源生产力： 人力资源专业人员必须处理大量文档，而 NLP 可以使用文档流程自动化来减轻部分负担。

NLP 构建模块

就像摩天大楼是一砖一瓦建造起来的一样，您可以通过使用 NLP 的基本和必要的构建模块来构建上述大型应用程序。

有几个可用的开源 NLP 库，例如 Python 中的 Stanford CoreNLP、spaCy 和 Genism，Java 和其他语言中的 Apache OpenNLP 和 GateNLP。

为了演示 NLP 构建模块的功能，我将使用 Python 及其主要的 NLP 库 Natural Language Toolkit (NLTK)。NLTK 是在宾夕法尼亚大学创建的。它是进入 NLP 的广泛使用且方便的起点。在学习其概念之后，您可以探索其他库来构建您的“摩天大楼”NLP 应用程序。

本文涵盖的基本构建模块是

分词为句子和单词
停用词
搭配
词性识别
词干提取和词形还原
语料库

设置

本文假设您熟悉 Python。安装 Python 后，下载并安装 NLTK

pip install nltk

然后安装 NLTK Data

python -m nltk.downloader popular

如果您有大量的存储空间和良好的带宽，您也可以使用 python -m nltk.downloader all。有关帮助，请参阅 NLTK 的安装页面。

还有一个用户界面可以选择要下载的数据，您可以使用 Python shell 启动该界面

Python 3.8.2 ...
Type "help", ...

>>> import nltk
>>> nltk.download()

图片来源：

^{（Opensource.com，CC BY-SA 4.0）}

分词为句子和单词

文本分析和处理的第一步是将文本拆分为句子和单词，这个过程称为分词。对文本进行分词使进一步的分析更容易。几乎所有的文本分析应用程序都从这一步开始。

以下是一些使用这行文本的示例

text = "Computers don't speak English. So, we've to learn C, C++, ,C#, Java, Python and the like! Yay!"

句子分词

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print(len(sentences), 'sentences:', sentences)

单词分词

from nltk.tokenize import word_tokenize
words = word_tokenize(text)
print(len(words), 'words:', words)

29 word(s): ['Computers', 'do', "n't", 'speak', 'English', '.', 'So', ',', 'we', "'ve", 'to', 'learn', 'C', ',', 'C++', ',', ',', 'C', '#', ',', 'Java', ',', 'Python', 'and', 'the', 'like', '!', 'Yay', '!']

NLTK 在内部使用正则表达式进行分词。敏锐的读者可能会问，是否可以在不使用 NLTK 的情况下进行分词。是的，可以。但是，NLTK 在设计时考虑了所有变体；例如，像 nltk.org 这样的东西应该保留为一个单词 ['nltk.org'] 而不是 ['nltk', 'org']

text = "I love nltk.org"

如果您使用上面的代码进行分词，nltk.org 将保留为一个单词

1 sentence(s): ['I love nltk.org']
3 word(s): ['I', 'love', 'nltk.org']

NLTK 不提供将“don't”之类的缩略形式替换为“do not”以及将“we've”替换为“we have”的功能，但 pycontractions 库可以提供帮助。

自己试试

使用 Python 库，下载维基百科关于开源的页面并对文本进行分词。

停用词

像英语这样的语言有很多“无意义”的词（技术上称为“停用词”），这些词在口语和写作中是必要的，但在分析中没有价值。NLTK 可以识别和删除这些停用词，以帮助文本处理专注于必要的词。

查看被认为是停用词的词

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words), "stopwords:", stop_words)

179 stopwords: ['i', 'me', 'my', 'myself', 'we', ..., "wouldn't"]

首先对文本进行分词，然后过滤掉停用词

text = "Computers don't speak English. So, we've to learn C, C++, Java, Python and the like! Yay!"

from nltk.tokenize import word_tokenize
words = word_tokenize(text)

print(len(words), "in original text:", words)

25 words in original text: ['Computers', 'do', 'not', 'speak', 'English', '.', 'So', ',', 'we', 'have', 'to', 'learn', 'C', ',', 'C++', ',', 'Java', ',', 'Python', 'and', 'the', 'like', '!', 'Yay', '!']

words = [word for word in words if word not in stop_words]
print(len(words), "without stopwords:", words)

18 words without stopwords: ['Computers', 'speak', 'English', '.', 'So', ',', 'learn', 'C', ',', 'C++', ',', 'Java', ',', 'Python', 'like', '!', 'Yay', '!']

文本仍然有标点符号，这增加了噪声。要删除它们，请使用 Python 的字符串类。有些标点符号很重要，例如问号。这种方法可以用来删除标点符号（不使用 NLTK）。

查看被认为是标点符号的字符

import string
punctuations = list(string.punctuation)
print(punctuations)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

删除标点符号

words = [word for word in words if word not in punctuations]
print(len(words), "words without stopwords and punctuations:", words)

11 words without stopwords and punctuations: ['Computers', 'speak', 'English', 'So', 'learn', 'C', 'C++', 'Java', 'Python', 'like', 'Yay']

自己试试

使用 Python 库，下载维基百科关于开源的页面并删除停用词。页面中停用词的百分比是多少？

搭配

搭配是指倾向于经常一起出现的两个（或多个）词。搭配有助于理解文本形成，并有助于文本搜索和相似性比较。

在本示例中使用来自 Project Gutenberg 的更长的文本文件。（Project Gutenberg 是一项数字化书籍的倡议。）

下载文本

# coding: utf-8

import urllib.request

# Download text and decode
# Note: Set proxy if behind a proxy (https://docs.pythonlang.cn/2/library/urllib.html)
url = "http://www.gutenberg.org/files/1342/1342-0.txt"
text = urllib.request.urlopen(url).read().decode()
print(text)

The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen
This eBook is for the use of anyone anywhere at no cost and with
...
      Chapter 1
      It is a truth universally acknowledged, that a single man in
      possession of a good fortune
...
      bringing her into Derbyshire, had been the means of
      uniting them.

预处理（分词、删除停用词和删除标点符号）

# Tokenize
from nltk.tokenize import word_tokenize
text = word_tokenize(text)

# Remove stopwords
from nltk.corpus import stopwords
stops = stopwords.words('english')
# print(stops)
words = [word for word in text if word not in stops]

# Remove punctuations
import string
punctuations = list(string.punctuation)
# print(punctuations)

words = [word for word in words if word not in punctuations]
print("Without punctuations:", words)

Preprocessed: ['The', 'Project', 'Gutenberg', 'EBook', 'Pride', 'Prejudice', 'Jane', 'Austen', ...

二元语法（两个一起出现的词）

# Bigrams
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
bigram_collocation = BigramCollocationFinder.from_words(words)
# Top 10 most occurring collocations
print("Bigrams:", bigram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 10))

Bigrams: [('”', '“'), ('Mr.', 'Darcy'), ('Lady', 'Catherine'), ('”', 'said'), ('Mrs.', 'Bennet'), ('Mr.', 'Collins'), ('Project', 'Gutenberg-tm'), ('“', 'I'), ('Sir', 'William'), ('Miss', 'Bingley')]

敏锐的读者可能会观察到，在删除标点符号后，双引号字符——“”（代码点 8220）和“”（代码点 8221）——仍然出现在文本中。string.punctuation 不会将这些检测为与标准双引号“”（代码点 34）不同的字符。要处理这些字符，请将这些字符添加到标点符号列表中。

三元语法（三个一起出现的词）

# Trigrams
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
trigram_collocation = TrigramCollocationFinder.from_words(text)
# Top 10 most occurring collocations
print("Trigrams:", trigram_collocation.nbest(TrigramAssocMeasures.likelihood_ratio, 10))

Trigrams: [('late', 'Mr.', 'Darcy'), ('Mr.', 'Darcy', 'returned'), ('saw', 'Mr.', 'Darcy'), ('friend', 'Mr.', 'Darcy'), ('Mr.', 'Darcy', 'walked'), ('civility', 'Mr.', 'Darcy'), ('Mr.', 'Darcy', 'looked'), ('said', 'Mr.', 'Darcy'), ('surprised', 'Mr.', 'Darcy'), ('Mr.', 'Darcy', 'smiled')]

“达西先生”几乎无处不在！您可以推断他是这部小说的主角。这是使用 NLP 进行信息提取的一个例子。

自己试试

使用 Python 库，下载维基百科关于开源的页面。您可以假设“open source”是最常出现的二元语法，而“open source code”是最常出现的三元语法。看看你是否可以确认这一点。

词性识别

NLTK 能够识别单词的词性 (POS)。识别 POS 是必要的，因为一个词在不同的上下文中具有不同的含义。“code”作为名词可以表示“为了保密目的而使用的一套词语”或“程序指令”，而作为动词，它可以表示“将消息转换为秘密形式”或“为计算机编写指令”。这种上下文认知对于正确的文本理解是必要的。

这是一个使用此文本的示例

text = "Computers don't speak English. So, we've to learn C, C++, Java, Python and the like! Yay!"

像之前一样预处理文本

import nltk
from nltk.tokenize import word_tokenize

words = word_tokenize(text)

识别 POS 标签

pos_tagged_text = nltk.pos_tag(words)
print(pos_tagged_text)

[('Computers', 'NNS'), ('do', 'VBP'), ("n't", 'RB'), ('speak', 'VB'), ('English', 'NNP'), ('.', '.'), ('So', 'RB'), (',', ','), ('we', 'PRP'), ("'ve", 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('C', 'NNP'), (',', ','), ('C++', 'NNP'), (',', ','), ('Java', 'NNP'), (',', ','), ('Python', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('like', 'JJ'), ('!', '.'), ('Yay', 'NN'), ('!', '.')]

NNS、VBP 等是宾夕法尼亚大学定义的 POS 代码，您也可以通过编程方式查看它们

nltk.help.upenn_tagset()

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
...

您可以看到句子中每个单词的 POS 定义

for pos_tag_word in pos_tagged_text:
    print(pos_tag_word[0], ":")
    nltk.help.upenn_tagset(pos_tag_word[1])

Computers :
NNS: noun, common, plural
	...
do :
VBP: verb, present tense, not 3rd person singular
	...
n't :
RB: adverb
	...
speak :
VB: verb, base form
	...
English :
NNP: noun, proper, singular
	...
. :
.: sentence terminator

自己试试

使用 Python 库，下载维基百科关于开源的页面，并识别文本中所有单词的 POS。

词干提取和词形还原

单词通常会屈折变化（例如，后缀、词缀等），以表达其形式（例如，复数、时态等）。Dog -> Dogs 是屈折变化的一个例子。通常，必须以单词的原始形式进行比较才能进行有效的文本匹配。

词干提取和词形还原是将单词转换为非屈折形式的两种方法。词干提取和词形还原的本质是相同的：将单词还原为其最原始的形式。但它们在实现方式上有所不同。

词干提取使用一种简单的机制，删除或修改屈折变化以形成词根，但词根可能不是该语言中的有效词。
词形还原也删除或修改屈折变化以形成词根，但词根是该语言中的有效词。

词形还原使用单词数据集（称为语料库，在下一节中讨论）来获得词根；因此，它比词干提取慢。在某些情况下，词干提取就足够了，而在其他情况下，则需要词形还原。

NLTK 有几个词干提取器和词形还原器（例如，RegexpStemmer、LancasterStemmer、PorterStemmer、WordNetLemmatizer、RSLPStemmer 等）。您还可以从许多内置的词干提取器和词形还原器中进行选择（请参阅 nltk.stem 包）。

为了比较它们，请尝试 PorterStemmer 和 WordNetLemmatizer。

创建 PorterStemmer 的实例

import nltk
stemmer = nltk.stem.PorterStemmer()

提取单词“building”的词干

word = "building"
print("Stem of", word, stemmer.stem(word))

Stem of building : build

词干提取没有 POS 认知，因此单词“building”，无论是名词形式还是动词形式，都被提取为词干“build”。

使用 WordNetLemmatizer 进行词形还原的情况并非如此

lemmatizer = nltk.stem.WordNetLemmatizer()
word = "building"
pos = 'n';
print("Lemmatization of", word, "(" , pos, "):", lemmatizer.lemmatize(word, pos))
pos = 'v';
print("Lemmatization of", word, "(" , pos, "):", lemmatizer.lemmatize(word, pos))

Lemmatization of building ( n ): building
Lemmatization of building ( v ): build

词形还原比词干提取花费更多时间（在本例中略多，但很明显）。

自己试试

使用 Python 库，下载维基百科关于开源的页面，并预处理并将文本转换为其原始形式。尝试使用各种词干提取和词形还原模块。使用 Python 的 timer 模块来衡量它们的性能。

语料库

NLTK 中的语料库是文本数据集。NLTK 提供了几个语料库。语料库借助开箱即用的数据来辅助文本处理。例如，美国总统就职演说的语料库可以帮助分析和准备演讲。

NLTK 中有几个语料库阅读器可用。根据您正在处理的文本，您可以选择最合适的阅读器。必须使用 Data 安装所需的语料库（请参阅上面的设置部分）。

有几种类型的语料库指示语料库提供的数据的结构和类型。可在 nltk_data UI 中找到可用的语料库列表（请参阅设置）。

图片来源：

^{（Opensource.com，CC BY-SA 4.0）}

通过阅读器访问语料库。用于语料库的阅读器取决于语料库的类型。例如，Gutenberg 语料库以纯文本格式保存文本，并使用 PlaintextCorpusReader 访问。Brown 语料库具有分类的、标记的文本，并使用 CategorizedTaggedCorpusReader 访问。阅读器遵循树状结构。以下是一些语料库及其阅读器。

Various corpora and their associated readers

图片来源：

^{（Opensource.com，CC BY-SA 4.0）}

以下是如何访问语料库。

首先，创建一个实用程序函数，以根据语料库阅读器类型显示语料库信息

def corpus_info(corpus):
    print(corpus)
    print()
    print("README:", corpus.readme())
    print()
    files = corpus.fileids()
    print(len(files), "files:")
    print(files)
    print()
    file = files[0]
    text = corpus.raw(file)
    print("File", file, len(corpus.paras(file)), "paras", len(corpus.sents(file)), "sentences", len(corpus.words(file)), "words", ":")
    print(text.encode("utf-8"))
    print()
    if isinstance(corpus, nltk.corpus.TaggedCorpusReader):
        tagged_words = corpus.tagged_words()
        print(len(tagged_words), "tags:")
        print(tagged_words)
        print()
    if isinstance(corpus, nltk.corpus.CategorizedTaggedCorpusReader):
        categories = corpus.categories()
        print(len(categories), "categories:")
        print(categories)
        print()
        category = categories[-1]
        files = corpus.fileids(category)
        print(len(files), "files in category", category, ":")
        print(files)
        print()
        file = files[0]
        print("File:", file, len(corpus.paras(file)), "paras", len(corpus.sents(file)), "sentences", len(corpus.words(file)), "words")
        print()
        print("Raw text:")
        text = corpus.raw(file)
        print(text)
        print()
        print("Tagged text:")
        tagged_words = corpus.tagged_words(file)
        print(tagged_words)
        print()

以下是两个语料库示例

ABC 是澳大利亚广播公司的新闻集合。这是一个基本的纯文本语料库

corpus_info(nltk.corpus.abc)

<PlaintextCorpusReader in '.../corpora/abc' (not loaded yet)>
    
README: b'Australian Broadcasting Commission 2006\nhttp://www.abc.net.au/\n\nContents:\n* Rural News    http://www.abc.net.au/rural/news/\n* Science News  http://www.abc.net.au/science/news/\n\n'
    
2 files:
['rural.txt', 'science.txt']
    
File: rural.txt 2425 paras 13015 sentences 345580 words :
'PM denies knowledge of AWB kickbacks\nThe Prime Minister has denied ...

Brown 语料库包含大约一百万个当代美式英语单词，由布朗大学整理而成

corpus_info(nltk.corpus.brown)

<CategorizedTaggedCorpusReader in '.../corpora/brown' (not loaded yet)>
    
README: BROWN CORPUS
A Standard Corpus of Present-Day Edited American
...
    
500 files:
['ca01', 'ca02', 'ca03', ...]
    
File ca01 67 paras 98 sentences 2242 words :
b"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl ...
    
1161192 tags:
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
    
15 categories:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
    
6 files in category science_fiction :
['cm01', 'cm02', 'cm03', 'cm04', 'cm05', 'cm06']
    
File: cm01 57 paras 174 sentences 2486 words
    
Raw text:
Now/rb that/cs he/pps ...
    
Tagged text: 
[('Now', 'RB'), ('that', 'CS'), ('he', 'PPS'), ...]

想象一下您可以利用这些语料库做什么！例如，使用 Brown 语料库，您可以训练模型来分类和标记文本，以便聊天机器人更好地理解人类意图。您也可以创建自己的语料库。