Monday, March 19, 2012

Amount of characters and words necessary to read news articles

Abstract
Hello everyone and welcome to my never ending study again. In the last two posts I was trying to count the number of unique Chinese characters and words in Taiwanese news by analyzing 80 news articles from Taiwan over the period of six weeks. In my study I found there there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics), but as I said, 80 articles was not enough and I tried to extend the study. Using the sampled data I did some calculations and tired to predict what the number of unique characters and words in any given number of articles would be. I found that there would be a total of 2174 unique characters and 8424 unique words and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, if these news articles were from the same 4 news sections I analyzed.

Introduction   

The main task was to predict what the evolution of the unique character and word charts would be and at what point on the y-axis they'd stop ascending. The corresponding x-axis value to that point would be the total amount of characters necessary for a person to know in order to recognize 100% of a random news article as long as it would be from one of the 4 sampled news sections. As you can see by looking at the following two charts, both of them have ascending trends with the Word knowledge chart having a sharply ascending ending with seemingly no approximation to any number.




I therefore looked at the source data again and with the help of OpenOffice Calc tried to calculate at what number would the two charts 'stall' (there is a mathematical expression for this, I only vaguely remember the Slovak term for it, which is probably 'asymptota') and what the total number of unique Mandarin characters and words in any given number of articles in Mandarin would be. 


Research method

Since my research method was the same both in the Character and Word data prediction analyses, I will only explain what I did by describing the former one. I noticed, that on the last 50% section of the character chart, there were small fluctuations that could be of help  in order to determine what the trend in the development would be. 



As you can see these fluctuations are too small in order to make a reasonable prediction possible. I had to come up with a way to augment these fluctuations and in order to do that, I had to choose a completely different approach, which turned out to be a pretty complicated thing to do.

The first two charts in this article only show that by knowing for instance the first 800 of the most frequent characters one will be able to recognize about 90% of the 80 analyzed articles. In order to calculate the trend for an X number of articles and since I could only work with the data I had at hand, I had to turn the whole study around and calculate, how many unique characters there would have been, had I only analyzed 40 articles, 45 articles or 50 articles and try calculate a reasonable prediction based on that.

Since I had the data already processed and all in one file, I didn't remember where one article finished and where another one started and since these articles were of different sizes, I chose an average calculation over an article-by article one, since I thought it would be more precise. 

What I basically did was, that I took all the sampled data (80 articles), that I used for the word analysis (the text file where I manually put each word onto a new line), put it into the first column in OpenOffice Calc (Microsoft Excel equivalent), calculated the exact 50% of that amount and ran my friends program to tell me how many unique characters were in those 50% of the total amount of data. In the next column I put 55% of the total amount of data, ran the program again and got the unique character number of characters for 55% of the data. In the third column I put 60% of the total amount of data and ran the program again to get the unique number of characters at 60% of the data until I got to 100%. What I had now was the unique character occurrence in the last 50% of the total amount of data from the original 80 articles, separated into 5% chunks. 

Percent of data 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Unique characters 1750 1813 1850 1892 1928 1966 2000 2013 2058 2077 2105


I now looked at the increases in these unique occurrences with the growing amount of data and tried to figure out how to calculate the trend past the 100% mark. As you can see the number of unique characters was increasing by amounts that were in general getting smaller and smaller with each incremental 5%. To my great delight, OpenOffice Calc has a function that can calculate trends for you, and although it took me a while to figure out how to exactly do it, I managed to produce the following chart


I assigned the value of '1' to 55% (since the value of 0 represented 50%), '2' to 60%, '3' to 65% ect.  with the value of '10' representing 100%. I plotted these values on the x-axis. On the y-axis I plotted the incremental number of unique characters corresponding to the increasing amount of data. This for instance means that at value 1, which is 55% of data, there were 63 incremental unique characters that occurred in this data compared to the previous 50% of data. 

Then I simply continued past the value of 10, which was equal to 100% of the original data and let OpenOffice calculate the trend for me. Everything you see on the chart that goes past the value of  10 on the x-axis is the trend calculated by OpenOffice Calc. As you can see on the chart, the graph hit zero at the value slightly over 17 (17,24 to be precise) which is equal to 141.2%, which means that at this point there would be no more incremental unique characters occurring regardless of whether the amount of data would continue increasing or not. 

Analysis results 

Percent of data 100% 105% 110% 115% 120% 125% 130% 140% 141%
Incremental unique characters 0 18,87 15,84 12,82 9,79 6,77 3,75 0,72 0


The number 18,87 in the third column for instance means, that had I sampled 105% of data instead of the original 100%, there would have been an additional 18,87 unique characters in it. All there was left to do was to take the original total amount of unique characters that I found in the 100% of the original articles (2105) and add it up with the sum of the predicted unique character increase. The total predicted number that I got to was 2174. This means, that according to my data and trend calculations, if you were to continue reading only these 4 news sections from Taiwan, you would not find more than 2174 unique characters in them, no matter how many articles you would read. I can thus say, that based on my data and trend calculations, one needs to know 2174 characters in order to be able recognize 100% of any given number of news articles, provided that they are found in the 4 news sections I analyzed.

Only for reference, the number of total characters at which the predicted graph hit zero, and thus the number of characters after which no more new unique characters occurred was 53 776 (141.2% of the original total amount of characters in the 80 sampled articles), which roughly corresponds to 113 news articles. This would mean, that after reading 113 news articles in the 4 sections I analyzed one would not come across any new characters.

Word prediction chart


I did the same trend calculations with the Unique word occurrence chart as I did with the Character prediction chart, but as I mentioned before, because of the 5% error margin, the number that I came up with is really just a very rough estimation. Plus it is evident from looking at the Word knowledge Vs. Text recognition chart in the beginning of this article, that the chart was still in sharp ascent at the end of the table and could develop in a lot of unpredictable ways, so the following chart is really only a very rough estimation



After doing the calculations and plotting the data on the chart I found that there would be a total of 8424 unique words that one would need to know in order to recognize all words in any given number of news articles in the 4 sections of Taiwanese news I analyzed. The predicted trend chart hit zero at the 36,54 value which corresponds to 235,69% of the original sampled data or 49 705 characters. These would be found in 189 articles, which means that according to my trend predictions, you would not encounter any new words after having read 189 articles. 

Conclusion

Below is a table, in which you can find the end results of the entire study. Even though the amount of articles I sampled was really small and predictions that I made were only estimates, in my opinion at least in the case of unique character calculations, the results were quite precise. To my biggest surprise it really seems like one does not need to know more than 2175 unique characters in order to recognize 100% of any number of news articles as long as they would be found in one of the 4 news sections I analyzed.



CharactersWords
Original articles8080
Total amount3808521089
Unique 21055901
Unique estimated21748424
Estimated at5377649705
Estimated at (articles)113189


My guess would be, that the reason for such a small number of unique characters necessary to read the news would be, that a great number of these characters would not be present or would rarely be used in books and on the other hand in books there would be characters that would not be present or rarely be used in news articles. Another interesting conclusion is, that you would need to know almost 4 times as many unique words as you would need to know unique characters, which again only shows, that the amount of unique words that one knows is more important than the amount of unique characters.

Finally, since this study was only a study of text recognition, in the future I would also like to do a study on text understanding and finally come up with a number of unique characters and words one would need to know in order to understand news articles, books or MSN chats. If everything goes well, I would like to do similar analyses of books texts and real life speech.

Tuesday, March 6, 2012

Chinese word frequency list - News

In the last post I analyzed 80 news articles from Taiwan over the period of 6 weeks, provided some basic statistics and tried to come up with a Chinese characters frequency list, by counting the occurrence of unique characters in these articles. In this post I would like to write about the word frequency analysis of these articles. 

Research method

I again analyzed the same 80 articles which were divided into 4 areas: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics) with 20 articles in each area.

During the whole word frequency analysis process, the biggest problem was to actually separate Mandarin words from each other. Like I mentioned in the previous post, as most of those studying or speaking Chinese know, words are not separated by spaces in Chinese. Counting the occurrence of unique words as opposed to counting the occurrence of unique characters therefore requires much more work, because unless you want to count word frequency with a pen and paper and would like to use a computer program to do the work for you, there has to be something that separates words from one another, in order for the program to know what to count. There are fairly complicated computer programs that can do this sort of indexation for Mandarin automatically, but since I didn't have any of those, I had to do indexation manually.

In order to count the occurrence of unique words in an English article for instance, the process would be much easier, because spaces between words in English texts mark very clearly where a word starts and where a word ends and a computer program can thus use these spaces as index markers to count words and consider everything in between those spaces to be separate word units. In Mandarin this is unfortunately not possible.

Take the following sentence for example:

伊拉克西部兩個警察辦公室遭到攻擊.
Two police stations in a western part of Iraq were attacked.

There is a total of 8 words in the Mandarin sentence, but as you can see none of the words are  separated by spaces so simple indexation based on spaces between words is not possible and as I said since I didn't have any program to index Mandarin words for me, I had to do it manually. What I did was, that I basically put my 80 articles into a text editor, read them and put each word onto a new line in the document. Since I had to press Enter every time I did that, each word was indexed with the 'enter' symbol. It was very time consuming, but I didn't come up with anything more intelligent.

After I did that, I used the same program my friend wrote to count the number of unique characters, to count the number of unique words, which were now indexed by the return symbol. I put the data the program produced into Open Office Calc and started to do some statistical work.

The most questionable part of the whole process was actually the word indexation itself because I subjectively decided what should and what should not be indexed as a unique word. Often I was wondering for instance, whether or not to further separate words, whether to index 遭到 as a unique word or index 遭 and 到 separately. I kept changing my own 'word separation rules' as I was going through the articles and even though I always tried to reedit the entire data file with every change I made, there most probably are some things that I overlooked. 

Here are some of the rules that I applied for word separation:

  1. I counted every numeral as a separate word, regardless of whether it was a cardinal number, ordinal number or a part of a larger number. (i.e.:第三百六十二 = 第+三+百+六+十+二)
  2. I counted every numerator separately (i.e.: 一個 = 一 + 個; 一些 = 一 + 些)
  3. I counted the ordinal prefix 第 separately (i.e.:第一 = 第 + 一)
  4. I counted 的 separately (他的 = 他+的)
  5. I counted every word in a name or compound that could be used as a separate word separately. (i.e.: 中央廣播電台 = 中央 + 廣播 + 電台.  中央廣播電台 means Central broadcasting station, which is the official name of the station, could be used as one word, but since  中央 - central, 廣播 - to broadcast and 電台 - station can be used separately, I decided to count them separately)
  6. I did't count the word city (市) or county (縣) in a name of a city or county separately (i.e.:台北市 = 台北市;台北縣 = 台北縣)
  7. I separated expressions such as 在開會前 in the following way: 在前 + 開會
  8. Where 上, 下 acted as verbs, I counted them as separate words
  9. Where 上,下 were parts of constructions such as: 在路上 I separated them in the following way: 在上 + 路
  10. I counted 上 or 一 used in fixed expressions such as 馬上 or 一樣 as one word
  11. I counted personal names, names of places, names of foreign organisations ect. as separate words (i.e.: 馬英九 = 馬英九) unless rule 5 could be applied. 
  12. I counted ad hoc abbreviations as separate words (if 美牛 (originally a non-existant word) was used to represent 美國牛肉, I counted  美牛 as a separate word, and did not further separate it as 美 + 牛)
Data analysis 


This is the basic word count table, showing the total number of words and unique number of words occurring in the four analyzed sections, as well as their total count.

As you can see, I found a total of 5901 unique words in these 80 articles. The first conclusion is that there were more unique words in these 80 articles (5901) than there were unique characters (2105 - please see last post for more information). This is simply due to the fact that the same character can be a part of several different words and there are therefore more unique character combinations (words) than there are unique characters.

Another fairly intuitive conclusion is that there was a lower total amount words (21089) compared to a higher total amount of characters (38085) in the articles. This is because of the fact that as mentioned before, a lot of words are built up of 2 or more characters and there were therefore less words than characters in these 80 articles in total.

A third conclusion is a supporting argument to the fact that the number of characters you know is less important than the number of words you know. As I mentioned in my previous post, characters represent first and foremost morphemes and not words. It is true that there are many single character words, but most Mandarin words are made up of two or more characters. Based on my character frequency count I found, that there were only 2105 unique characters in the 80 articles I analyzed. One could say that learning 2105 characters is not such a terribly hard thing to do, but actually characters as I said don't matter that much and as you see, I found a total of 5901 unique words in the articles I analysed which is almost 3 times as much as the total amount of unique characters I found. 

It is true, that once you know a lot of characters and what meaning they usually have in multiple character words it is easier to guess the meaning of a multiple character word that you've never seen before, if this word consists of characters that you already know. If for instance you know the character for ice 冰 and the character for box 箱 and you see the word 冰箱, which means refrigerator, you will probably guess correctly its meaning. You might wonder if it's a freezer or just a refrigerator, or some sort of a cooling box for food, but looking at the context in which this word appeared, it should be fairly easy to guess what it means. This however is not necessarily always the case. If you know the characters for electricity - 電 and to look - 視 and see the word 電視, even after looking at the context it appeared in you might have a hard time guessing that it means television. 

These are all only very simple examples. Mandarin words and sentences are sometimes ridiculously complicated, or are of foreign origin and simply cannot be guessed. On top of that, while reading news, if you are a translator for instance, a rough guess is often not enough. Needless to say, that as with any language that you learn, knowing every word and every character in a given sentence does not necessarily mean  that you will understand the sentence as a whole.

After plotting the data onto the x and y axes, the Word knowledge Vs. Text recognition looked something like this:

This chart is similar to the character frequency chart, but as you can see, there is quite a difference  on the x axis at around the 50% point, where the chart sort of takes off in a linear fashion. The reason for this is that this is the point on the chart where very rare words start to occur (words that occurred only once in the sampled 80 articles) and the chart thus turns into a linear ascending line. 

Again, as with the character analysis, if you look at this chart, you can see, that by knowing the first 50% of the most frequent words in the data I sampled, you should be able to recognize around 85% of the text I sampled but as I mentioned in my earlier post, even though 85% is a relatively high amount, should you come across a word from the remaining 15% of the chart (which you will), it is highly probable that you will not recognize it. 15% might not sound like a lot, but if you even it out, it actually accounts for almost every 6th word and this every 6th word might be evenly distributed across the whole article you read or could be jammed into the most important sentence in the article.  
For reference, as with the character analysis, I also made rare a word occurrence table, just to give you a sense of how much of the total amount of unique words do the rare words (words that occurred only once, twice ore three times) account for:



As you can see, this number is huge. 80% of all unique words that occurred in the articles I sampled, occurred only once, twice or three times and on the other hand there were only 35 words in total, that occurred more than 50 times. Rare words thus accounted for more than 80% of the text I sampled. This sadly means, that you simply have to learn a very large amount of words to be able to recognize a large amount of a text. Another way to look at it is that in order to cover the remaining 30% of text (roughly the amount for which the rare words account for) you will have to learn  the remaining 80% of unique words which is 4729 words in our case and by learning each new word you will cover only slightly more of the texts, since each new word you learn appeared only once, twice or three times in the texts I sampled.

Maybe another way to look at it would be, that should you substitute every rare word with a blank space, 30% of the articles I analyzed would be made up of blank spaces and in order to fill them up, you would need to learn 4 times as many words as you already know. 

Again for reference I am posting the list of the 80 most frequent words that occurred in the articles I sampled


For reference, here is the list of the most frequent 2 character words:

表示 - to say, to express
中央社 - Central news agency
報導 - to report, report
經濟 - economy
總統 - president
台北 - Taipei
下午 - afternoon
台灣 - Taiwan
大陸 - Mainland China
美國 - USA
上午 - Morning
指出 - to point out, to say
警方 - police, police officials 
中國 - China
記者 - reporter
政府 - government
今天 - today
今年 - this year
國民黨 - Kuomintang political party
馬總統 - president Ma YingJiu
歐巴馬 - Barack Obama
國家 - country
投資 - to invest

You can download the full word frequency list in the download section of this blog.

End of Part II. Since I based my study only on 80 articles, in Part III, I will try to make some estimations about how many characters and words one should know to be able to read news in general.