In the last post I analyzed 80 news articles from Taiwan over the period of 6 weeks, provided some basic statistics and tried to come up with a Chinese characters frequency list, by counting the occurrence of unique characters in these articles. In this post I would like to write about the word frequency analysis of these articles.
Research method
I again analyzed the same 80 articles which were divided into 4 areas: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics) with 20 articles in each area.
During the whole word frequency analysis process, the biggest problem was to actually separate Mandarin words from each other. Like I mentioned in the previous post, as most of those studying or speaking Chinese know, words are not separated by spaces in Chinese. Counting the occurrence of unique words as opposed to counting the occurrence of unique characters therefore requires much more work, because unless you want to count word frequency with a pen and paper and would like to use a computer program to do the work for you, there has to be something that separates words from one another, in order for the program to know what to count. There are fairly complicated computer programs that can do this sort of indexation for Mandarin automatically, but since I didn't have any of those, I had to do indexation manually.
In order to count the occurrence of unique words in an English article for instance, the process would be much easier, because spaces between words in English texts mark very clearly where a word starts and where a word ends and a computer program can thus use these spaces as index markers to count words and consider everything in between those spaces to be separate word units. In Mandarin this is unfortunately not possible.
Take the following sentence for example: