Koichi Hori

Home About

   日本語

Unicode decode error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte" when using MeCab

(Diary of an Old AI Researcher who is still Programming)

14 May 2019

I am using mecab-python3 - morphological analyzer for Japanese text.

I do not know why but mecab-python3 sometimes causes the error "'utf-8' codec can't decode byte 0xfa in position 0: invalid start byte".

Searching on internet, I have found no answer about the cause of the error, but some people say that we can avoid this error by parsing a null string before carrying out parsing tasks.
I have tried this workaround and have found this certainly works.

Here is an example:


import MeCab

def extractNouns(text):
    tagger = MeCab.Tagger()
    normallyprocessed = True

    tagger.parse("")
    # No one seems to know why this works,
    # but this tagger.parse("") can avoid the unicode decoding error
    # in the following parsing.
    
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        try:
            word = node.surface
        except Exception as e:
            print(str(e))
            print('parsing error occured but ignored')
            normallyprocessed = False
        if normallyprocessed and word.isalpha():
            meta = node.feature.split(",")
            if meta[0] == '名詞':
               keywords.append(word)
        node = node.next
        normallyprocessed = True
    return keywords

CC0
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.




Related entries (automatically calculated):
Using unicode characters in Windows command line
css <pre> and <code> for mobile devices
Redirecting URL in Ruby on Rails
Login window freezes when making VNC connection from Windows to Mac
Showing the favicon in Google search results
Using Python on Windows
Aligning Facebook button and Twitter button
A small program which extracts rhythmic word sequences such as Tanka(57577) or Haiku(575) from a plain text
On This Day: Atomic Bomb Dropped on Nagasaki
Mechanical engineers and electrical engineers have different mental models of oscillation
UAV/UGV Autonomous Cooperation
UNESCO: `Do you know AI or AI knows you better? Thinking Ethics of AI'
Koichi Hori: Last Lecture
Toward AI-embedded Society where AI is Not Recognized as AI
AI support for Ethical AI Design
What an old AI researcher thinks after watching the movie "Green Book" - about Racism, Discrimination, and AI (Artificial Intelligence)
Culture as the base of our country: Prof. Inose
AI ELSI Award
AI (Artificial Intelligence) and Philosophy
The University of Tokyo Academic Archives Portal - UTokyo Digital Collections
Difference between Science and Engineering
Civilization, Culture, Science, and Technology
Koichi Hori Top page
What is Artificial Intelligence?
Koichi Hori