Today I released version 0.3 of TinySegmenter, a Japanese Tokenizer in pure Python (released in New BSD license), with a single minor fix for proper install on systems not-using UTF-8 (apparently that still exists! :P). Thanks to Mišo Belica for the patch. Apparently some of his Japanese users are using it for Sumy, his software to extract summary from texts.
About TinySegmenter and Japanese tokenization
It’s not much of a release, but it is a good occasion to tell about TinySegmenter. This is a “Tokenizer” for Japanese. What is a tokenizer? Basically it breaks sentences into words. For people who don’t know Japanese, it doesn’t use spaces or any other symbol to separate words. Theybasicallywritelikethis. Yet there are ways to break these sentences into words, usually based on statistical analysis (like most things in Natural Language Processing and Artificial Intelligence in general). For anyone who wants to know a bit more, this message from Kytea developer (another tokenizer, which is great) explains the 2 main methods with some links of software using them (among them Tinysegmenter) and especially keywords (allowing you to search more).
The reason why you want to “tokenize” Japanese or Chinese is that it is often a first step for further natural language analysis (for instance for automatic translation, grammar analysis, pronounciation hence speech synthesis, etc.).
Now the required example, “my name is Jehan” in Japanese is: 私の名前はJehanです。TinySegmenter breaks it like this:
In : segmenter.tokenize(u”私の名前はJehanです。”)
Out: [‘私’, ‘の’, ‘名前’, ‘は’, ‘Jehan’, ‘です’, ‘。’]
I am not planning on hacking much TinySegmenter anymore. I never was planning to; at the time I took over maintainership, I just wanted to use it for a project (which never went through) and the original developers were not answering. So I just properly packaged it, did minor changes (for instance better support of European words using Latin1 and extended Latin Unicode characters), added some tests, and that’s it. I don’t even use it anymore. Yet if more people are interested and want to use it, feel free to send me patches. I could also give commit rights, and even co-maintainership after a few patches. I just wanted to get these words out. 🙂
I also discover today the existence of a TinySegmenter3 on pypi, with less downloads than TinySegmenter (the older one I maintained, yes I know that’s a bit confusing, why would they keep the same name and just add a 3?) but worth looking at since they apparently improved performance a good deal (I haven’t checked but that’s what it says). Maybe I should look at their code and merge their commits at some points after talking to them?
Anyway have fun! 🙂
Reminder: my Free Software coding can be supported in USD on Patreon or in EUR on Tipeee.