February 21, 2012

Chinese character frequency list - News articles

I think a lot of those studying Mandarin Chinese have sooner or later started to wonder how many characters one really needs in order to normally function in a Chinese-only world or what for instance the most frequent 500 Chinese characters are. I personally have heard a lot of numbers and saw several Chinese character frequency lists, but often didn't understand why this or that character made it to the top 500 or why the list said I needed this or that number of characters to read something when I had the feeling the number was either overstated or understated so I decided to try to do a little study on my own. 

I tried to analyze how many characters and words are approximately necessary to read news in Mandarin. I chose four sections of Taiwanese news - politics, international, society and finance, all written in traditional Chinese characters during a 6 week sample period.

If I'm correct, the field of computational linguistics deals with projects of this kind and I'm sure that there are several teams of experts at linguistic departments worldwide that must have done similar researches using much more sophisticated methods than I have and after the amount of effort it took me to analyze these few articles, I have a lot of respect for what they do. 


I just tried to do a little research on my own and one point where I think I might have achieved a slightly better research result compared to research done with more sophisticated computer programs is in the the analysis of Mandarin word frequency because I did the analysis manually. This however leaves my study with a higher error margin.


In this post I will present the results of the Mandarin character frequency analysis, in the posts to follow I will talk about the Mandarin word frequency analysis. 

Research method

I analyzed 80 articles in 4 sections (politics, international, society and finance), 20 articles per section which is a very very small amount of data any way you look at it, but some interesting things came out of the analysis anyway. One of the obvious problems with sampling such a small amount of data over a short period of time is, that there are a lot of 'seasonal' characters and expressions appearing in it. In the case of this small study these were expressions based around the elections in Taiwan, Greece, the Chinese New Year, European debt crisis and the Makiyo case. 

In my opinion, sampling characters over a very long period of time also has its flaws, because if for instance someone would like to know the approximate amount of characters necessary to read the news in Mandarin in Taiwan and in order to do that he would sample a very large amount of news articles over the course of one year instead of 6 weeks as was my case, then of course the number of unique characters occurring in those articles would be much higher than in the case of my study, simply because of the fact, that the probability of rare characters appearing over a longer  period of time is much higher. That in my opinion doesn't mean that the study is more accurate in telling you how many characters you need to approximately know, but it does make it more complete. In the case of my little study I simply tried to map out the number of unique characters, words and their frequency in the news in Taiwan in 4 sections over the course of 6 weeks. 

To analyze character frequency was relatively easy, to analyze word frequency was much more difficult. To analyze character frequency and the total amount of unique characters present in the 80 articles, I asked my good friend (many thanks to Martin again) to write a computer program which basically did the work for me. All I had to do was input any amount of text in any format into a text file, run the program  which produced a list of unique characters from the most frequent to the least frequent one with a count of unique characters at the end of the file.

To analyze word frequency took a lot more effort and time. As most of those learning Mandarin know, words are not separated by spaces in Mandarin so I had to manually do that. In addition, some constructions like 在..內 which have other characters in between them had to be manually rewritten as 在內 in order to be counted as separate words. 

I think this is the second most questionable part of my little research (first being the small amount of research data), because I decided by feel and my Taiwanese friends' suggestions what should and what should not be considered as a separate word.

Also, while analyzing character frequency, there was practically no margin for error as the computer program did everything for me. While manually separating the words however, I had to add a 5% error margin simply because as I was gradually working myself through the articles I changed my approach several times in what I did and what I did not consider as a separate word and even though I edited input data over and over again to even out these changes, I might have overlooked something.

As for the technical part, what I did was, that I put all 80 articles into MS Word, manually move each word or construction onto a new line and then ran my friends program on the file. 

After having 8 frequency files (4 character frequency files + 4 word frequency files) I transferred them to OpenOffice Calc and started doing some statistics. 

Data analysis


This is the data table containing the most important information about the total number of sampled characters and the number of unique characters per section in the 80 analyzed articles as a whole. To my surprise there were only 2105 unique characters found in the 80 articles I analyzed. 

There are several reasons to this: 
  1. The amount of the data I sampled was very small. 
  2. Direct speech is rarely used in news articles, it appears only in the form of citations so as I will later show, characters typical for dialogues 我, 你, 去 or 回 for instance are nowhere close the top 100 (我 - 183rd, 你 - 982nd, 去 - 204th, 回 - 238th) and a lot of them didn't even make it to the entire list. 
  3. In my opinion, Mandarin used in the news is really very different from Mandarin used in books  for instance so a lot of these characters didn't make it into the list either. Books again contain completely different expressions and sentence patterns, a lot of direct speech and therefore would have a different character count with a lot of characters that would not overlap with characters used in news. 


As can be seen from the table a higher total number of characters per section does not necessarily mean a higher amount of unique characters in that section. The total amount of characters in the  financial section 財經 is by 20% higher than the total amount of characters in the international 國際 section, but the number of unique characters in 財經 is 4% lower than it is in the 國際 section. The reason for this is that in the financial section a lot of the same expressions were used in a lot of articles over and over again (loan, debt, stock, growth ect.) whereas the international section, most of the articles were different, talking about different countries and people, with a lot of unique personal and country names appearing in them. 


This chart is a visual representation of the ratio between the number of unique characters you know and the percentage of characters you should recognize in the 80 articles. As can be seen from this chart, by knowing the first 50% of the unique characters, you should be able to recognize about 93% of all the characters in the 80 articles I analyzed. 

There are several reasons why this is not enough to flawlessly understand news articles, but an explanation would be very lengthy, as these reasons are related to how written Chinese and Chinese in general works, but one of the reasons (and this will be even more obvious in the Word frequency analysis section) can be better understood by looking at the following table:


In this table I stated the number of characters appearing only once, twice or three times (characters that are rare) in the texts I sampled and the percentage that they constitute out of all the unique characters. 

This table and the previous graph show the obvious, which is that even if a reader can recognize 93% of all the characters in these 80 articles, if he were to come across a character from the remaining 7% of the chart (which he will), it is highly probable that he will not recognize it even if he knows some random characters from the 'rare' 7% section of the graph.

Another obvious conclusion is that to learn how to understand the remaining 7% of the characters found in the texts will take the reader the same amount of effort as learning the first 93%. Or maybe another way to look at it is that by learning the remaining 50% of all the unique characters found in the texts, one will only increase his understanding abilities by 7%. 

For reference, I also stated those characters that that have a very high frequency of appearance compared to other characters. As you can see, there are only 15 characters that appeared more than 200 times and only 68 characters that appeared more than 100 times in the texts I sampled.



Character frequency table: 


Again for reference, I'm posting the list of the first 80 characters appearing in the articles I sampled, starting with the most frequent one. As I mentioned earlier, my small research was influenced by a lot of seasonal characters, but this table shows pretty nice results. As I mentioned earlier, typical direct speech characters like 我, 你, 去, 回, 來 ect. are nowhere to be seen. 

Also, another thing that I will try to talk about a little more in the Word frequency analysis section is, that knowing for instance the first 90% of all the unique characters really is not the same thing as being able to understand 90% of the information in the text. Chinese characters are representing morphemes rather than words and even though there are a lot of Chinese words that consist only of one character (or one morpheme only if you like), the majority of them consist of 2 characters. Doing a character frequency analysis is therefore the same thing as doing a morpheme frequency analysis. 

A way to look at it is that if a person would do the same frequency study as I did by sampling 80 English news articles, there might be an entry saying that the English morpheme tele appeared 220 times in the 80 sampled English articles. It does say a lot about the frequency of appearance of this morpheme, but nothing about the words it has appeared in (i.e: telephone, telegraph, television, teleport ect.).

The same thing  can be applied to the character frequency analysis. You can basically take any character from the above list and not know what it meant in the articles I sampled. For instance:

年 - year
今年 this year 
去年 last year 
龍年 the year of the dragon 
青少年 teenagers/teenage years

新 - new
新年 - new year
新聞 - news
重新 - repeat, again

Or a more extreme example:

馬 - horse
馬上 - immediately
馬英九 - Ma YingJiu (president of Taiwan)
歐巴馬 - Barack Obama
馬路 - road
馬政府 - abbreviation for 馬英九政府 - The government of Ma Yingjiu 

ect.

You can download the full character frequency list in the download section of this blog.

End of Part I. To be continued in Part II - Word frequency analysis. 

16 comments:

  1. It's pity this wasn't part of a master's thesis or some other similar type project.

    ReplyDelete
    Replies
    1. Ryan,

      I'm sure there are a lot of people out there who have done this sort of research many times before using more sophisticated methods. It was a lot of fun to do though.

      Delete
  2. Vladik,

    amazing stuff my friend. Great sense for detail and considering it to be a lot of fun - and I believe that you really enjoyed it - is wonderful. How are you doing anyway?
    Cheers
    Duri

    ReplyDelete
    Replies
    1. Duri,

      thank you for the nice comment my friend. Doing the statistics was a lot of fun, sampling the data not so much:)

      all is going well here, call you when I get the chance

      Vlad

      Delete
  3. Very interesting study, Vlad!

    I agree with you when you say that single character frequency lists are not an indication of which words one should actually know in order to develop reading ability in Chinese. As far as I am concerned (as a teacher/translator from Japanese), I came across the same issue in Japanese where (even in the official Proficiency Test)they always speak about the numer of characters you need to know at each level but not the nummber of actual "kanji words" (compounds) you need to know, which would by far be mooore useful...

    Looking forward to reading your next posts!

    Luca :-)

    ReplyDelete
    Replies
    1. Dear Luca,

      thank you for the nice comment.

      I have the Words frequency post almost finished, I will post it later next week.

      I don't know about the HSK tests, but the official Chinese language tests here in Taiwan called Test of proficiency Chinese do use word lists instead of character lists.

      Are you studying Chinese as well?

      Vlad

      P.S.: Have you read GTO? :)

      Delete
  4. Hi Vlad,
    thanks for your reply. I'm glad in Taiwan they use word lists instead of dealing with "number of characters"...

    I studied basic Chinese at college, then lived in Taiwan for almost 8 months, and finally 10 years in Japan... My Chinese is somewhere between low intermediate in speaking but higher in reading ability... I intend to take it to a higher level with time :-)

    Yes, I have read GTO... both Italian and Japanese, and likedi it a lot :-)

    Luca

    P.S. Complimenti per il tuo blog e soprattutto per le capacita' linguistiche che hai acquisito. Sei un'ispirazione per tutti coloro che come me amano studiare le lingue straniere e conoscere altre culture. Alla prossima ;)

    ReplyDelete
    Replies
    1. Caro Luca,

      grazie mille per i complimenti.

      Il Cinese che parlo non e perfetto, ma sarei felice di aiutarti - se tu avessi voglia naturalmente:)

      Se vuoi, possiamo metterci in contatto e fare qualche tipo di schambio linguistico. Mi piacerebbe molto imparare il Giapponese uno di questi giorni, ma purtroppo non so come e dove cominciare.

      con distinti saluti

      Vladimir

      p.s.: Io sto leggendo il GTO in Cinese, mi rimangono ancora 3 libri:)

      Delete
  5. Ciao Vlad,
    grazie... ti ho mandato una mail in privato :-)
    Luca

    ReplyDelete
  6. Hey Vlad,
    Just stumbled upon this post and I was wondering if you might have the complete list of 5000+ words (not characters, hehe) most commonly used in the taiwanese newspapers. I'm studying traditional chinese and would like to improve my chinese by beginning to readu chinese literature and would like to try and learn some new words before beginning. Thanks so much! Also, you should totally learn Kiswahili! It's a beautiful language :-)

    ReplyDelete
    Replies
    1. Hello dear Libby,

      actually a few days ago I was thinking that I have to learn an African language one day:) I don't know when this will happen since I am constantly doing other things that I should be doing, but I have a plan now:) I am ashamed to admit this but I know very little about Africa, its history, culture and languages and I really hope I can change this one day.

      What you can try to look at as far as traditional character word lists go (general language, not news as far as I can tell) is the word list for the Test of proficiency Chinese - advanced test:

      http://www.tw.org/top/word_adv.pdf

      If this is not good, let me know and I can send you the word list that I came up with while doing my analysis.

      kind regards

      Vladimir

      Delete
  7. Interesting articles. You ended up making a little six week corpus!

    One point I disagree with:
    >>In my opinion, sampling characters over a very long period of time also has its flaws...

    Research has shown that when dealing with language, bigger is always better. More data collected over a longer period ends up rounding out issues like words for New Years and elections being more common.

    Sure, you end up with more characters to analyze, but at the end of the day, you end up with a more accurate picture of the language which you can use for understanding which characters (or words) Chinese learners should study.

    If you ever need any tools for helping with future analysis, let me know. I have a bunch of them, and I'd be glad to help out wherever I can.

    ReplyDelete
  8. Dear Steven,

    thank you for your comment.

    I agree that if you have more data, you pretty much solve your 'seasonal' character problem. I am no expert, so these are only my wild guesses and unprofessional reasoning, but at least when it comes to the scope of the data, I would disagree that more is always better. I would agree that it would be better from a purely statistical point of view, but less so from a linguistic point of view and even less so from a student’s point of view.

    For instance, you could include Classical Chinese texts for a more complete analysis, but this way you would only simply bulk up your sample data by adding a new field to it. You would come up with a more accurate mathematical frequency list, but it would not mean that it would be more useful for a linguist or a student. It would only give you a much better overview about which characters are used and how often they are used in a huge variety of texts in absolute measures, but it would also mess up the character frequency order to an extent at which it would not be useful to students anymore.

    For example, in a frequency list, that I found online, the character 刊 - which in my opinion (empirically) is a relatively unimportant character, was on the 535th position. Not that it is unimportant, but it is far less important then say 站,空,報,保,錄,維,永,納,演,山,遇,缺,制,勢,千,照 - characters that were found starting on the position 535 and lower in a study that I did on manually selected and edited Interview articles. I'm guessing that the reason for this is, that a lot of newspapers have刊 in their name and thus this character can be found either at the beginning or at the end of a lot of news articles (as in for instance 周杰倫商業週刊報導) and thus if you sample say 50000 news articles, this character will have a high frequency rate. But for a beginner student (500 characters – beginner), 刊 is an unimportant character. This is only one example out of may others and so I think that a more bulky absolute frequency list is not something that would make a lot of sense to a beginner/intermediate student of Chinese and he or she would not find great use of it. What I want to say I guess is, that more is not always better and that the quantity of the input material is not as important as is its quality.

    How do you go about analyzing word frequency in Chinese texts? Do you compare your words to a word database, or do you set up word individuating algorithms? I did my analysis manually and it was terribly exhausting… and while I was doing it I kept wondering how the pros were doing it:) I’d guess that big linguistic departments have their own huge elaborate word databases that they compare sampled texts to and as soon as the program that they run to sample the data matches a part of the text with the database, it counts it as one unit. But it’s only my guess.

    Kind regards

    Vladimir

    ReplyDelete
  9. Ciao caro Vladimir,
    Mi chiamo Konstantinos, e sono greco.
    Scusa per una scelto come lingua da presentarmi l' italiano, ma e' la lingua che parlo meglio, coime lingua straniera.
    Trovo interessantissimi le tue lezioni di scrittura, che riguardano il cinese.
    Mi piacerebbe sapere se hai un profilo su FACEBOOK. Lo vedo piu' diretto come modo di comunicazione....Se si, mi protresti dare il link del tuo profilo, per chiederti amicizia ? Grazie !!!
    Fin adesso ci hai fatto mostrare piu' di 60 caratteri diversi... ma ci hai detto che almeno tu hai indificato piu' di 2000 !!! Ma ci farai piano piano caricare tutti i video che riguardano il resto dei caratteri trovati ???
    Veramente hai fatto un bellissimo lavoro di ricerca !!! Ti do' i miei complimenti !!!!
    Ti auguro Felice e Sereno anno nuovo !

    Konstantinos

    ReplyDelete
  10. Caro Konstantinos,

    Ti ringrazio molto per le belle parole ed anch'io ti auguro un felice anno nuovo.

    Sono felice, che ti sono piaciuti i miei video su caratteri cinesi. Mi piacerebbe continuare a fare questi video, ma non sono sicuro di quanti saro in grado di fare in totale. Faro del mio meglio:)

    Ho un profilo su facebook, ma se possibile preferirei usare skype. Se mi mandi una e-mail, con il tuo skype name, ti posso agiungere.

    con distinti saluti

    Vladimir

    ReplyDelete
  11. perfetto Vladimir, Ti do il mio skype. Mi potresti scrivere la tua mail personale, cmq io ho fatto richiesta per frequentare il tuo blog, tramite la mia mail !

    Grazie

    ReplyDelete