March 06, 2012

Chinese word frequency list - News

In the last post I analyzed 80 news articles from Taiwan over the period of 6 weeks, provided some basic statistics and tried to come up with a Chinese characters frequency list, by counting the occurrence of unique characters in these articles. In this post I would like to write about the word frequency analysis of these articles. 

Research method

I again analyzed the same 80 articles which were divided into 4 areas: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics) with 20 articles in each area.

During the whole word frequency analysis process, the biggest problem was to actually separate Mandarin words from each other. Like I mentioned in the previous post, as most of those studying or speaking Chinese know, words are not separated by spaces in Chinese. Counting the occurrence of unique words as opposed to counting the occurrence of unique characters therefore requires much more work, because unless you want to count word frequency with a pen and paper and would like to use a computer program to do the work for you, there has to be something that separates words from one another, in order for the program to know what to count. There are fairly complicated computer programs that can do this sort of indexation for Mandarin automatically, but since I didn't have any of those, I had to do indexation manually.

In order to count the occurrence of unique words in an English article for instance, the process would be much easier, because spaces between words in English texts mark very clearly where a word starts and where a word ends and a computer program can thus use these spaces as index markers to count words and consider everything in between those spaces to be separate word units. In Mandarin this is unfortunately not possible.

Take the following sentence for example:

February 21, 2012

Chinese character frequency list - News articles

I think a lot of those studying Mandarin Chinese have sooner or later started to wonder how many characters one really needs in order to normally function in a Chinese-only world or what for instance the most frequent 500 Chinese characters are. I personally have heard a lot of numbers and saw several Chinese character frequency lists, but often didn't understand why this or that character made it to the top 500 or why the list said I needed this or that number of characters to read something when I had the feeling the number was either overstated or understated so I decided to try to do a little study on my own. 

I tried to analyze how many characters and words are approximately necessary to read news in Mandarin. I chose four sections of Taiwanese news - politics, international, society and finance, all written in traditional Chinese characters during a 6 week sample period.

If I'm correct, the field of computational linguistics deals with projects of this kind and I'm sure that there are several teams of experts at linguistic departments worldwide that must have done similar researches using much more sophisticated methods than I have and after the amount of effort it took me to analyze these few articles, I have a lot of respect for what they do. 

January 06, 2012

Efficiency of Chinese characters

Efficiency of Chinese characters
By Vladimir Skultety, M.A., B.A.

A lot of people say that Chinese characters are inefficient, because they are too complicated and there is too may of them. By contrast they say that western alphabetic scripts are much easier to learn, much easier to write and are thus much more efficient.

In this article, I tried to somewhat objectively analyze the situation, which was a bit hard, because I like Chinese characters a lot, but either way I looked at it, I still think that characters are at least as efficient and in some cases even much more efficient than western alphabetic scripts. 

Negatives:
  • There’s a lot of them. I don’t like numbers but it is true, that you need to know at least 2500 – 3000 characters to read something.  (Edit 5.5.2012 - strangely enough, after my study I found that you would actually only need about 2180 characters to read the newspaper)
  • It’s much more difficult to remember characters compared to the simple 35 or so letters of an alphabetic script
  • They are easy to forget
  • They are easy to confuse
  • You not only need to learn how to recognize them, you need to learn how to write them by hand which doubles your effort
  • They are unpractical when you need to look up something in a list (dictionary, telephone list)
Positives:

December 11, 2011

矛楯 - Lances and shields

Hello everyone,

it's been a while since I made some Classical Chinese text analyses and with nothing to do on this misty Sunday afternoon I thought I'd write a short one just to practice a little. Although it might not look like this, thanks to Google there are some people who find their way to my blog because they are looking for translations of sentences or expressions in Classical Chinese so this post is mostly for those who are already interested in Classical Chinese for this reason or another, or for anyone who might fall in love with it like the rest of us have.

I say this all the time, but I am not an expert on Classical Chinese, I merely have a Bachelor’s degree in Chinese studies. Most of these analyses are based on our classes at the Chinese department and whenever I run into something for what I don’t remember the explanation for, I try to translate and explain it based on what I remember about Classical Chinese grammar and sentence structure, which might not always be correct and I apologize for any mistakes in advance.

November 14, 2011

Remembering Farsi

After almost a year I have finally picked up my studies of Farsi where I left them. This amazing language has been laying dormant on my wish list for a long time, I started learning it twice already and twice have I failed to carry on. The fact, that there are or were no Farsi speakers around was one of the reasons for my pause, but not a very good excuse for it to have taken such a long time. 

Either way it is or was, not having anyone around to practice the language at the beginning is not a very good motivating factor. I know now, how some people really have no choice but to rely on course books and I express my deep respect to those who live in places with little chances of meeting speakers of their target language and can mostly rely only on course books, make the most of their studies and really learn a foreign language to fluency this way.

Farsi really is a wonderful language and the sheer thought that I could freely converse or read in it one day is very exciting so one month ago, I have decided, that I just have to force myself into my studies and be persistent. I wanted to write a short blog entry about where I stand and what ideas I have about the language now. 

October 19, 2011

Mandarin Chinese tones – sound only approach

Mandarin Chinese tones – sound only approach
By Vladimir Skultety M.A., B.A.

I would try to talk about and build on a concept I wrote about in my earlier posts – to try to develop a system, in which students would remember Mandarin words without consciously knowing what tones or tonal combinations are in them and pronounce them correctly using less effort.

As the topic is quite complex, I would first like to go back to 2 earlier articles I wrote about tones and develop the thought from there.

Post from 11.30.2011 (edited):

When I first came to Taiwan, I remember being tired after even a 10-15 minute Mandarin conversation. I was unable to use the words I had learned before effortlessly even after I’ve used them a hundred times in conversation practice. Each time I wanted to use these words I had to make at least some effort in recalling them and constantly think of the tones, which was very tiring.

September 29, 2011

Learning an intermediate language - Italian

Hello all,

On my blog I have written articles about difficult and simple languages before and I realized that I didn’t write anything about intermediate languages yet, so I will try to dedicate an entire post to them now. As I mentioned earlier, there are probably much better divisions of languages based on their difficulty. I do not challenge them, but I find that up until now, all the languages that I’ve learned fall into three simple categories: simple, intermediate, difficult – depending on how far a language you already speak at a native/advanced fluency level is from these languages.

For me an intermediate language (or a language that I find to be intermediately difficult to learn) is:
  • A language that is outside of my native language group, or outside the language group of a language that I already speak well, but still within the same general family[1]
  • The grammar is at least 50% identical with the languages I already speak at an advanced/native fluency level
  • Another 30% of concepts present in the grammar are concepts that can also be found in the languages I already speak but are used rarely or formulated in a different way
  • At least 10% of grammar concepts are completely alien to me
  • There is a large number of cognates in the language, but different pronunciation might leave them unrecognized at first
  • The sound system is at least 50% identical[2] with the languages I already speak
  • Literal translations are often possible
  • Cultural difference is not a substantial issue
From a strictly analytical point of view, if you look at English and Italian for instance, you almost can go as far as saying that they are two distant dialects of Indo-European. They both share large amounts of Latin or Greek based vocabulary, Italian vocabulary has received a lot of influence from English, there are numerous grammar concepts that overlap, a lot of expressions in Italian can be directly translated into English, often literally.

September 25, 2011

Hiking in Slovakia – High Tatras

Waterfalls, mountain lakes, mountain streams, amazing views and weather
I went back to Slovakia during the summer and after a very very long time visited the High Tatras. My log is mostly about languages and I know that posts about climbing mountains or river tracing might not be interesting to all of the readers, but this trip was so amazing, that I decided to at least share some of the pictures we took. My high school friend bought a flat in Tatranska Strba, only a couple of stations of rack railway away from Strbske pleso (the main tourist hub), so we decided to go there on Friday evening and start the trip on Staurday morning. 


High Tatras are amazing in any weather and any season, but that Sarurday there were almost no clouds in the sky and the temperature was around 25 degrees, making the conditions very suitable for a good hike. When we arrived at Strbske pleso, we decided to take the yellow route to the Furkotsky peak, which had a lot of interesting sights on the way. The entire hike was about 22 km long, but was definitely worth the effort.

September 16, 2011

Interview with Luca Lampariello


Dear all,

a few weeks ago my good friend Luca Lampariello was kind enough to do an interview with me on his blog and I am very happy to say, that I can now return the favor and do an interview with him in return. Luca is a friend of mine whom I met about 3 years ago and based on our mutual passion for foreign languages and I think mutual respect as well, we became good friends. He speaks several languages at a C2 level and has been by many people proclaimed to be one of the best polyglots on youtube – a statement to which I subscribe.

I was thinking for a while about the topic, that would suit our interview best, since I didn’t want to talk about motivation or general language learning strategies, but rather something more specific, something that would be interesting and useful at the same time. I know very well, that I have lost the capability to acquire a 95-100% native pronunciation in a foreign language, but I think Luca is one of those people that still can do it and since it is something that interests me very much and something I personally can learn a lot from, I decided to ask Luca questions related mainly to his accent acquisition techniques and native-like pronunciation development.

August 26, 2011

有獻不死之藥於荊王者

This following text is a short story from the book of 韓非子. Contrary to the previous stories, there is no moral message that would arise from the text. The storyline is quite simple, there’s not a lot of difficult vocabulary and a lot of it is being repeated throughout the text. Most of the grammar I have already covered in my previous posts, so I will probably skip a lot of things this time and make the analysis more straightforward. I still feel that one of the greatest things while reading these texts is the fact that I can directly read and understand something that’s been written such a long time ago, absorb the atmosphere and I hope that after being able to read them at natural speeds you will feel the same.


Text:

有獻不死之藥於荊王者

有獻不死之藥於荊王者。謁者操之以入。中射之士問曰。可食乎。曰。可。因奪而食之。王大怒。使人殺中射之士。中射之士使人王曰。臣問謁者。曰。可食。臣故食之。是臣無罪而罪在謁者也。且客獻不死之藥。臣食之而王殺臣。是死藥也。是客欺王也。夫殺無罪之臣而明人之欺王也。不如釋臣。王乃不殺。