python 伪原创近义词替换

简介

对于搜索引擎SEO，很多运营人员都有快速输出文章并添加到网站；以期搜索引起快速收入并提高排名的要求。现实是哪怕是李白也只是在喝完一斗酒才能诗百篇。运营人员或者编辑人员需要的是在短时间内能够生成伪原创文章，根据已有文章生成伪原创。

思路：

首先通过Scan函数获取文件夹中的文件路径列表，然后通过Read函数读取文件内容，并提取出标题和文章内容。接着使用正则表达式提取出标题和文章内容，并去除不需要的部分。然后使用jieba库对文章内容进行分词，并根据词性选择近义词替换。最后将替换后的标题和文章内容写入文件。在if name == ‘main’:条件下运行，即在直接运行该文件时执行run函数，开始文本处理的过程。

其中近义词的替换需要用到synonyms；synonyms可以用于自然语言理解的很多任务：文本对齐，推荐算法，相似度计算，语义偏移，关键字提取，概念提取，自动摘要，搜索引擎等。

代码解析：

def Scan():
    dir_list = os.listdir('./demo_txt')
    return dir_list

这个函数Scan()使用os模块的listdir()方法来获取当前目录（’./demo_txt’）下的所有文件和文件夹，并将这些项目的名字存储在一个名为dir_list的列表中。最后，这个函数返回这个列表。

def Read(path):
    with open('./demo_txt/' + path, 'r', encoding='utf-8') as f:
        content = f.read()
    title_list = re.findall('<title=(.*?)>', content)
    title = title_list[0] if len(title_list) != 0 else None
    article_list = re.findall('<neirong=([\s\S]*)>', (content.replace('\n', '')).replace('<p>', ''))
    article = article_list[0] if len(article_list) != 0 else None
    words_list = []
    string_list = article.split('</p>')
    for string in string_list:
        if string != '':
            words_list.append(string)
    if title is not None and len(words_list) > 0:
        return title, words_list
    else:
        return None, None

def write(path,content):
    with open('./new_txt/' + path, 'a+', encoding='utf-8') as f:
        f.write(content)

这个函数是一个写入文件的函数，它接受两个参数：path和content。其中，path是一个字符串类型的文件路径，content是要写入文件中的内容。
函数使用open函数打开文件，使用with语句确保文件在使用后被正确关闭。open函数接受的参数包括文件路径、文件模式和编码方式。这里使用了追加模式（’a+’），表示以追加的方式打开文件，如果文件不存在则创建新文件。同时指定了文件编码为utf-8。
文件对象赋值给变量f，这样就可以通过f对象来对文件进行操作了。使用f.write(content)将content内容写入文件。这里使用了文件对象的write方法，它将content内容写入文件当前指针位置，并将文件指针移动到下一个位置。
最后，文件路径使用了字符串的加法运算符，将’./new_txt/’与path拼接起来，得到完整的文件路径。函数执行完毕后，会将content内容追加写入到指定路径的文件中。

def words_change(words):  # 传入句子，变形返回
    words_tuple = pseg.lcut(words)
    print(words_tuple)
    word_list = []
    for word, flag in words_tuple:
        if flag == 'a' or flag == 'ad' or flag == 'v':  # 词性判断
            seg_list = (synonyms.nearby(word))[0]
            if len(seg_list) <= 1:
                word = word
            else:
                word = seg_list[1]
        word_list.append(word)
    return "".join(word_list)

该函数是一个用于将输入的句子中的某些词替换为近义词的函数。首先，使用pseg.lcut函数将输入的句子分词，返回一个词元组。然后，遍历每个词元组中的词和词性。如果词性的类型是’a’、’ad’或’v’，则使用synonyms.nearby函数找到该词的近义词。如果找到的近义词数量小于等于1，则保持原词不变；否则，将原词替换为第二个近义词。最后，将所有处理过的词组合成一个字符串并返回。

完整代码

# coding=utf-8
import os, re
import synonyms
from jieba import posseg as pseg


def Scan():
    dir_list = os.listdir('./demo_txt')
    return dir_list


def Read(path):
    with open('./demo_txt/' + path, 'r', encoding='utf-8') as f:
        content = f.read()
    title_list = re.findall('<title=(.*?)>', content)
    title = title_list[0] if len(title_list) != 0 else None
    article_list = re.findall('<neirong=([\s\S]*)>', (content.replace('\n', '')).replace('<p>', ''))
    article = article_list[0] if len(article_list) != 0 else None
    words_list = []
    string_list = article.split('</p>')
    for string in string_list:
        if string != '':
            words_list.append(string)
    if title is not None and len(words_list) > 0:
        return title, words_list
    else:
        return None, None


def write(path,content):
    with open('./new_txt/' + path, 'a+', encoding='utf-8') as f:
        f.write(content)


def words_change(words):  # 传入句子，变形返回
    words_tuple = pseg.lcut(words)
    print(words_tuple)
    word_list = []
    for word, flag in words_tuple:
        if flag == 'a' or flag == 'ad' or flag == 'v':  # 词性判断
            seg_list = (synonyms.nearby(word))[0]
            if len(seg_list) <= 1:
                word = word
            else:
                word = seg_list[1]
        word_list.append(word)
    return "".join(word_list)


def run():
    dir_list = Scan()
    for path in dir_list:
        title, words_list = Read(path)
        if title is not None and words_list is not None:
            title = words_change(title)
            write(path,'<title={}>'.format(title) + '\n')
            write(path,'<neirong=')
            for words in words_list:
                word = words_change(words)
                write(path,'\n<p>' + word + '</p>' + '')
            write(path,'>')


if __name__ == '__main__':
    run()