python Python–使用附魔过滤文本

Python–使用附魔过滤文本

原文:https://www . geesforgeks . org/python-过滤-文本-使用-附魔/

Enchant是 Python 中的一个模块，用于检查单词的拼写，给出纠正单词的建议。同时，给出了单词的反义词和同义词。它检查字典中是否有一个单词。

Enchant还提供了 enchant.tokenize 模块对文本进行标记化。标记化包括从正文中拆分单词。但有时并不是所有的单词都需要进行标记。假设我们正在进行拼写检查，那么通常会忽略电子邮件地址和网址。这可以通过使用过滤器修改令牌化过程来实现。目前实施的过滤器有:

电子邮件过滤器
URL filter(URL 过滤)
WikiWordFilter

示例 1 : 电子邮件过滤器

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter

# the text to be tokenized
text = "The email is abc@gmail.com"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)

# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出:

不过滤打印令牌: [('The '，0)、(' email '，4)、(' is '，10)、(' abc '，13)、(' gmail '，17)、(' com '，23)]

过滤后打印令牌: [('The '，0)、(' email '，4)、(' is '，10)

示例 2 : URLFilter

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter

# the text to be tokenized
text = "This is an URL: https://www.geeksforgeeks.org/"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)

# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出:

不过滤打印令牌: [('This '，0)、(' is '，5)、(' an '，8)、(' URL '，11)、(' https '，16)、(' www '，24)、(' geeksforgeeks '，28)、(' org '，42)]

过滤后打印令牌: [('This '，0)、(' is '，5)、(' an '，8)、(' URL '，11)]

例 3:WikiWord filter WikiWord 是由两个或两个以上以大写字母开头的单词组成的单词，一起运行。

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter

# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)

# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

输出:

在不过滤的情况下打印令牌: [(' versionivedotthree '，0)、(' is '，20)、(' an '，23)、(' example '，26)、(' of '，34)、(' WikiWord '，37)]

过滤后打印令牌: [('is '，20)、(' an '，23)、(' example '，26)、(' of '，34)]

版权属于：月萌API www.moonapi.com，转载请注明出处

本文链接：https://moonapi.com/news/5844.html

python 查看更多书籍

《GeeksForGeeks Python 中文教程 2022-06-03》

分类

最近更新

python Python–使用附魔过滤文本

Python–使用附魔过滤文本

留言

联系客服

数据知识

系统公告

开发文档

python查看更多书籍

《GeeksForGeeks Python 中文教程 2022-06-03》

python Python–使用附魔过滤文本

Python–使用附魔过滤文本

留言

联系客服

python 查看更多书籍