Monday, July 29, 2013

Mandarin, a stream of syllables with meaning


There is something special about Chinese compared many other languages. Essentially every syllable has a meaning and as the number of possible sounds in the language are less than in English there are a relatively limited number of syllable sounds that need to be learned.

Learning Chinese lends itself to one mode of attack that doesn't seem to work so well in many languages, it is possible to listen to audio and even when most of the words are not understood, armed with a good dictionary and an ear for the syllables analyse and decipher most content.

The syllables are highly phonetic and aside from the famous/infamous Chinese writing system there is a pinyin writing system that uses Latin alphabet characters.

This mode of attack has helped my Chinese learning a lot, but there are problems. I intend to write some software to make this task easier, I have been working on an early prototype as a proof of concept to myself and will be soon working on a more general accessible prototype.


There are a proliferation of sounds that have the same meaning, especially if you still can't decipher the correct tone of the syllable. A tonal difference informs a Chinese speaker of the difference between the word for buy and sell for example but a European may just hear the sound for "mai" without capturing the tone.

Some sounds that are distinct to a Chinese speaker may sound very similar to a non Chinese speaker (effectively homonyms for the learner).

Speakers may not be distinct

As with other languages there are many distinct local variations etc, shi sounds may be pronounced as si sound, r sounds like l, n sounding like l etc. etc.

Added to people speaking fast, children, old people and the accuracy of listening becomes highly variable for the learner.

Word frequency

The Dictionary you are using may not order by word usage frequency, resulting in a long list of words (especially if you omit the tones) starting with the 'the term to denote the Emporors left big toenail". Context will help you if you have it but ideally you want to see more frequent words at the top of the results returned.

The sweet spot

The sweet spot for listening and learning from audio Chinese is probably when you are hearing sentences that you can translate to something like "she/he/it gave me XXXX and YYYY me very ZZZZ" being able to quickly pick up the meaning of XXXX and the other words from sound alone is a terrific learning experience, and this is easier to do than with languages with inconsistent spelling, a bewildering number of possible sounds and where the word boundaries are harder to determine.

The quest

At the moment the best solution I have found has been to use the Pleco dictionary on a portable device, aside from the cost (it is a very good product and well worth the price I feel), the main omission from my desired experience is the element of fuzziness.

The fuzziness I want would give the option to search for "gan tai" and see a list of options and result counts that included "gan cai", "gang cai", "gang tai" etc.

My first prototype requires the Scripting Layer for Android (Python specifically to be installed) it is based on the cc-cedit Chinese dictionary and I have integrated frequency data from the Leeds University frequency list. The code is available at Github and assuming you have SL4A and Python running on an android device (reasonably powerful it is not optimised for performance).

I am just going to write up the documentation over the next week or so and then start on a better prototype (probable Javascript based to run standalone on Android).

If by some miracle you do try out this version (might be best to await documentation)  and it works then questions or feedback will be gratefull received.