Koichi Hori

A small program which extracts rhythmic word sequences such as Tanka(57577) or Haiku(575) from a plain text

(Diary of an Old AI Researcher who is still Programming)

17 Oct 2022

In preparation for my next studies, I have written a short program which extracts rhythmic word sequences such as Tanka or Haiku from a plain text.
As you may know, Japanese Haiku is composed of 5, 7, and 5 syllables, Japanese Tanka is composed of 5, 7, 5, 7, 7 syllables, and Japanese Dodoitsu is composed of 7, 7, 7, 5 syllables.
The program below extracts word sequences that may sound like Haiku or Tanka or Dodoitsu and so on from a plain text according to the rhythm pattern designated by the user.

My ex-student has informed me that such a bot that tweets Tanka-like phrases already exists in the internet, but since it may be interesting that we can test our own program with our own texts as my ex-student also says, I show my program, here. (Sorry, the program works only for Japanese texts.)

 
import MeCab
import sys
import os

def rhythmicDocument(rhythm_pattern, document):
    tagger = MeCab.Tagger()
    tagger.parse("") #parse once a null string to avoid the error caused by garbage collection
    node = tagger.parseToNode(document)
    buffered_node = node
    rhythmic_text = ''
    while node:
        node = buffered_node
        # print('buffered_node starts with', buffered_node.surface)
        first_success = False
        for pattern in rhythm_pattern:
            # print('Now Searching a phrase whose length = ', pattern)
            candidate_phrase = ''
            candidate_length = 0
            success = False
            abandon = False
            head = True
            temp_node = node
            while not success:
                tango = temp_node.surface
                meta = temp_node.feature.split(",")
                hinshi = meta[0]
                # print('Now checking the word: ', tango)
                # print('  Hinshi is ', hinshi)
                if len(meta) > 8:
                    yomi = meta[-1]
                    nagasa = len(yomi) - yomi.count('ャ') - yomi.count('ュ') - yomi.count('ョ')
                else:
                    abandon = True
                    # print('ABaondon because of no yomi.')
                # print('  Nagasa is ', nagasa)
                if hinshi == "記号":
                    abandon = True
                    # print('Abandon because this is 記号.')
                if head:
                    if hinshi not in ["名詞", "動詞", "形容詞", "形容動詞","副詞"]:
                        abandon = True
                        # print('Abandon')
                if not abandon:
                    if nagasa <= pattern:
                        candidate_phrase += tango
                        candidate_length += nagasa
                        # print('    candidate_phrase = ', candidate_phrase)
                        # print('    candidate_length = ', candidate_length)
                        head = False
                        if candidate_length == pattern:
                            success = True
                            if not first_success:
                                first_success = True
                                buffered_node = node.next
                            node = temp_node.next
                            if candidate_phrase[-1] == 'っ':
                                candidate_phrase += node.surface[0]
                            rhythmic_text += candidate_phrase + "\n"
                            # print('SUCCESS! The new phrase is ', candidate_phrase)
                            break
                        elif candidate_length > pattern:
                            abandon = True
                            # print('ABANDON!')
                    else:
                        abandon = True
                        # print('Abandon!')
                if abandon:
                    candidate_phrase = ''
                    candidate_length = 0
                    node = node.next
                    temp_node = node
                    head = True
                    abandon = False
                else:
                    temp_node = temp_node.next
                    if not temp_node:
                        candidate_phrase = ''
                        candidate_length = 0
                        node = node.next
                        temp_node = node
                        head = True
                        abandon = False
                if not node:
                    return rhythmic_text
        rhythmic_text += '\n'
    return rhythmic_text


def convert(patternfilename, documentfilename):
    with open(documentfilename, "r") as documentfile:
        document = documentfile.read()
    with open(patternfilename, "r") as rhythmfile:
        rhythm_spec = rhythmfile.readline().strip()
    try:
        rhythm_pattern = list(map(int, list(rhythm_spec)))
    except Exception as e:
        print('Some error has occured while reading the rhythm spec file.')
        print('The rhythm specification should consist of only numbers.')

    rhythmic_document = rhythmicDocument(rhythm_pattern, document)

    return rhythmic_document


def convertfile(patternfilename, documentfilename, outputfilename):
    rhythmic_document = convert(patternfilename, documentfilename)
    with open(outputfilename, "w") as outputfile:
        outputfile.write(rhythmic_document)
        outputfile.write('\n')
    

if __name__ == '__main__':
    import sys
    argc = len(sys.argv)
#    print('argc = ', argc)
    if argc == 4 :
        patternfilename = sys.argv[1]
        documentfilename = sys.argv[2]
        outputfilename = sys.argv[3]
        convertfile(patternfilename, documentfilename, outputfilename)
    else:
        print('usage: python3 thistestprogram.py pattern_file input_file output_file')

Installation:
You should first install python3, MeCab, and mecab-python3.
I hope you can easily find how to do this in the internet.

Usage:
Copy and paste the above program into a file named like thistestprogram.py.

Prepare a file to indicate the rhythmic pattern you want. For example,


% echo "575" > pattern575.txt

Then just run like below.


% python3 thistestprogram.py pattern575.txt filename_of_any_text_file_you_have temp_output_file.txt

Eample:
Applying the program to my old paper, I have gotten the following results.

In case of 57577:

人間と
機械の両者
合わさって
構成されて
知的活動

文学も
人工知能
研究の
一部になって
人工知能

芸術も
人工知能
研究の
一部になって
人工知能

法学も
人工知能
研究の
一部になって
人工知能

In case of 7775:
製品の夢
人工知能
ネットビジネス
誰のため

相手の心
読み取ることは
自分の心
探索に

与えていない
心とは言え
言わばまぼろし
心なの

To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.