堀　浩一

五七五, 五七五七七などの俳句や短歌みたいな語列をテキストから抽出するプログラム

（老いぼれ人工知能研究者のプログラミング日記）

2022年10月17日

ちょっとやってみたいことがあって、その準備段階の遊びで、こんなプログラムを書いてみました。
自分の手持ちのplain textから、575, 57577, 7775などの、俳句や、短歌や、都々逸みたいな語列を抽出するプログラムです。

そういうことをやるボットがすでに世の中にある、と元教え子にあとから教えてもらったのですが、その元教え子も言うように、自分のテキストで楽しめるなら嬉しいかもしれないので、ここに、私のプログラムを公開します。
若い人ならもっと華麗なコードを書けることでしょう。華麗ではなく、加齢を感じさせるコードで恥ずかしい限りですが、まあ、僕と同世代でいまだにコーディングしている同業者は、あまり多くないので、年寄り予備軍の皆さんを誘い込むためにも、と思いまして、ははは。 :-)

 
import MeCab
import sys
import os

def rhythmicDocument(rhythm_pattern, document):
    tagger = MeCab.Tagger()
    tagger.parse("") #parse once a null string to avoid the error caused by garbage collection
    node = tagger.parseToNode(document)
    buffered_node = node
    rhythmic_text = ''
    while node:
        node = buffered_node
        # print('buffered_node starts with', buffered_node.surface)
        first_success = False
        for pattern in rhythm_pattern:
            # print('Now Searching a phrase whose length = ', pattern)
            candidate_phrase = ''
            candidate_length = 0
            success = False
            abandon = False
            head = True
            temp_node = node
            while not success:
                tango = temp_node.surface
                meta = temp_node.feature.split(",")
                hinshi = meta[0]
                # print('Now checking the word: ', tango)
                # print('  Hinshi is ', hinshi)
                if len(meta) > 8:
                    yomi = meta[-1]
                    nagasa = len(yomi) - yomi.count('ャ') - yomi.count('ュ') - yomi.count('ョ')
                else:
                    abandon = True
                    # print('ABaondon because of no yomi.')
                # print('  Nagasa is ', nagasa)
                if hinshi == "記号":
                    abandon = True
                    # print('Abandon because this is 記号.')
                if head:
                    if hinshi not in ["名詞", "動詞", "形容詞", "形容動詞","副詞"]:
                        abandon = True
                        # print('Abandon')
                if not abandon:
                    if nagasa <= pattern:
                        candidate_phrase += tango
                        candidate_length += nagasa
                        # print('    candidate_phrase = ', candidate_phrase)
                        # print('    candidate_length = ', candidate_length)
                        head = False
                        if candidate_length == pattern:
                            success = True
                            if not first_success:
                                first_success = True
                                buffered_node = node.next
                            node = temp_node.next
                            if candidate_phrase[-1] == 'っ':
                                candidate_phrase += node.surface[0]
                            rhythmic_text += candidate_phrase + "\n"
                            # print('SUCCESS! The new phrase is ', candidate_phrase)
                            break
                        elif candidate_length > pattern:
                            abandon = True
                            # print('ABANDON!')
                    else:
                        abandon = True
                        # print('Abandon!')
                if abandon:
                    candidate_phrase = ''
                    candidate_length = 0
                    node = node.next
                    temp_node = node
                    head = True
                    abandon = False
                else:
                    temp_node = temp_node.next
                    if not temp_node:
                        candidate_phrase = ''
                        candidate_length = 0
                        node = node.next
                        temp_node = node
                        head = True
                        abandon = False
                if not node:
                    return rhythmic_text
        rhythmic_text += '\n'
    return rhythmic_text


def convert(patternfilename, documentfilename):
    with open(documentfilename, "r") as documentfile:
        document = documentfile.read()
    with open(patternfilename, "r") as rhythmfile:
        rhythm_spec = rhythmfile.readline().strip()
    try:
        rhythm_pattern = list(map(int, list(rhythm_spec)))
    except Exception as e:
        print('Some error has occured while reading the rhythm spec file.')
        print('The rhythm specification should consist of only numbers.')

    rhythmic_document = rhythmicDocument(rhythm_pattern, document)

    return rhythmic_document


def convertfile(patternfilename, documentfilename, outputfilename):
    rhythmic_document = convert(patternfilename, documentfilename)
    with open(outputfilename, "w") as outputfile:
        outputfile.write(rhythmic_document)
        outputfile.write('\n')
    

if __name__ == '__main__':
    import sys
    argc = len(sys.argv)
#    print('argc = ', argc)
    if argc == 4 :
        patternfilename = sys.argv[1]
        documentfilename = sys.argv[2]
        outputfilename = sys.argv[3]
        convertfile(patternfilename, documentfilename, outputfilename)
    else:
        print('usage: python3 thistestprogram.py pattern_file input_file output_file')

インストール:
python3, MeCab, およびmecab-python3がインストールされている必要があります。
そのやり方は、インターネット上にたくさん出ていますので、簡単に見つかると思います。

使い方:
まず、上のプログラムをthistestprogram.pyというような名前のファイルにコピーしてください。

次に、５７５や５７５７７などのパタンを指示するためのファイルを用意します。たとえば


% echo "575" > pattern575.txt

そして、下のように実行します。


% python3 thistestprogram.py pattern575.txt filename_of_any_text_file_you_have temp_output_file.txt

結果の例:
ためしに私の昔の解説論文に適用してみたら、下のような結果が得られました。

57577の場合：

人間と
機械の両者
合わさって
構成されて
知的活動

文学も
人工知能
研究の
一部になって
人工知能

芸術も
人工知能
研究の
一部になって
人工知能

法学も
人工知能
研究の
一部になって
人工知能

7775の場合：

製品の夢
人工知能
ネットビジネス
誰のため

相手の心
読み取ることは
自分の心
探索に

与えていない
心とは言え
言わばまぼろし
心なの

To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.