Forever a student: Chinese character frequency list

Abstract

In this study I tried to analyze the Chinese character composition of about 60 interview articles in two Taiwanese online magazines, evaluate the data, produce a character frequency chart, character knowledge vs text recognition chart, do absolute character prediction calculations and compare the data with previous analyses that I have done. I sampled a total of 45 235 characters and found that there was a total of 1865 unique characters in this sample. Based on my calculations I also found that in order to recognize 100% (using the word 'to recognize' and not 'to understand' on purpose throughout the article) of any given number of interview articles, one needs to know 2084 unique Chinese characters. When comparing this data to my previous news character analyses I found, that the interview character frequency list contains much more direct speech elements than the news article character frequency one does and I've mathematically proven, that interview articles are easier to read for beginner and intermediate students of Mandarin Chinese than news articles are.

Introduction

In the past posts I was trying to analyze the frequency of words and characters based on the data that I sampled over the period of 6 weeks from 4 section of Taiwanese news (please see the Character frequency analysis, Word frequency analysis and Character prediction analysis articles for more information).

In my study I found that there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics). After extending my research and trying to predict, what the number of unique characters and words in any given number of articles would be I found that if these news articles were from the same 4 news sections I analyzed, there would be a total of 2174 unique characters and 8424 unique words in any given number of news articles in these four sections and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, from these for news sections.

In this post I would like to present the results of my analysis of interview articles found in two Taiwanese magazines - Cheers and 天下雜誌. I chose to analyze interviews, because they come very close to spoken Mandarin and the words and expressions used in them differ greatly from the language used in news articles that I analyzed previously.

Research method

The data analysis method and character end ratio prediction method used in this paper are the same as in my previous posts. For more information, please see Character frequency analysis, Word frequency analysis and Character prediction analysis.

My data sampling method was slightly different in this case, since I only had to analyse character frequency and not word frequency. What I did was, that I entered as many interviews as I could find on the 天下雜誌's and Cheers magazine's websites into a text file, cleaned it up by removing any numerals, commas, roman letters ect. and produced a raw file containing 45 235 characters. I have sampled a larger amount of data for this analysis (data greater by 18,77%) than in the News character frequency analysis (38 085) and used this ratio where necessary while comparing the results of these two analyses to make up for the difference.

Analysis results

* numbers adjusted by 18,77% for comparison purposes with news article data

The first 100 most frequent interview characters

One thing you can notice right away is that the characters in this chart seem very basic and familiar, even to beginner students. This is by all means true, because we are dealing with data that comes from interview transcriptions, which even if edited for print still represent direct speech and spoken Mandarin much better than Mandarin found in news articles or books does (see 白話 for more information).

Character knowledge Vs. Interview text recognition

In the above chart you can see the ratio between the amount of characters you know and the percentage of interview text you recognize. As with my previous analyses the most frequent characters account for much more of the text than the less frequent ones do, which means that by knowing a relatively small amount of characters you are able to recognize a relatively large amount of text, but learning to recognize the remaining 5% of the text will take you the same amount of time as learning to recognize the first 95% did.

Character prediction chart

Based on the sampled data and using the methods I used in the Character prediction analysis I calculated that one would need to know a total of 2084 characters to be able to recognize 100% of any given number of interview articles. The 'Total sample' row represents the total amount of characters in my interview article data sample. The 'Unique' row represents the total number of unique characters found in this data sample. The 'Unique estimated' row represents the estimated total number of characters necessary to know in order to recognize 100% of any given number of interview articles. The 'Estimated at' row represents the amount of characters in which the Unique estimated characters would be found. This means that based on my calculations, after having read Interview articles that would contain a total of 82 803 characters, you would not encounter any new unique characters. This estimation is only a mathematical calculation and serves for orientation purposes only. Please see Character prediction analysis for more info.

News and Interview data comparison

The most interesting part of this study was the comparison of results between the News article analysis and this analysis. Following are the results from the News character frequency analysis from my previous posts:

The first obvious thing you notice is, that based on the Interview articles data sample, there were only 1867 unique characters found in the interview data compared to 2105 unique characters found in the news data, which is interesting, because the sample for the news analysis was smaller by almost 20%. The reason for this will be explained the later in this article.

The first 100 most frequent news characters

The first main difference when comparing this table with the interview frequency chart is the lack of typical direct speech elements: personal pronouns 我 and 你/妳 missing, verbs used for describing feelings and opinions, typical direct speech conjunctions ect.

As you might notice, the number of times that unique characters occurred in news articles is lower than the one of the interview articles (e.g. News: 的=737, Interviews: 的=1826). This is mostly due to the fact that I sampled more data in the interview articles analysis. Its influence on the character frequency order is relatively small, since we're dealing with two completely different sets of data. In other analytic operations, where this change would have caused major result differences, I adjusted the calculations by 18,77% (difference in the amount of sampled data between the two analyses) in order to make up for the difference.

News and Interview character knowledge Vs text recognition

Based on the above chart and some other calculations I found that, you need to know less characters in order to recognize a greater percentage of interview articles in Mandarin, than you need to know in news articles in Mandarin in the beginning stages of your studies (at the stage where the student knows up to about 1000 characters). Later on, at around 1500 characters, this advantage becomes marginal. Since interviews represent direct speech pretty well, one might imply that you would also need to know less characters or better yet, less morphemes to speak or write spoken Mandarin, than write articles in News article type Mandarin (please see 白話 for more information).

In my study, when it comes to Interview articles, the 100 most frequent characters accounted for 56% of the sampled text, while in the case of News articles, the 100 most frequent characters accounted only for 38% of the sampled text. This means, that by knowing the same amount of characters in the beginning stages one will recognize much more of the interview articles than news articles.

Based on my calculations, your greatest advantage comes at knowing the first 87 most frequent characters. By knowing the first 87 most frequent characters from the Interview articles frequency list you will be able to recognize 52,54% of the text found in interview articles, while knowing the same number of characters from the news frequency list will only let you recognize 35,37% of the news articles text, which is a 17,17% difference.

While a 17,17% difference might not seem like much and is a little hard to imagine when it comes to a concept as abstract as the relative difficulty of perception of understanding two different types of texts, I tried to present it in another way and thought of contrasting two blue colors, in which one would be 17,17% less blue than the other to give you a feel of what this difference in perception looks like in another case.

I know this example is a little far fetched, but since we're talking about the relative difference in perception of understanding two things (in this case the different perception of difficulty of news articles and interview articles paralleled with the different perception of the same color in two different shades) I thought I could try to make this parallel and see how it works. The color to the left is 17,17% less blue than the color to the right:

Absolute character estimation comparison

Interview character estimation News character estimation

The above two tables compare the predicted absolute amounts of characters necessary to read any given number of articles (for more information on this research method please see Character prediction analysis ). As you can see, the predictions differ by only 90 unique characters.

In order to recognize 100% of any given number of news articles, based on my calculations you would need to know 2174 unique characters and in order to recognize 100% of any given number of interview articles, you would need to know 2084 unique characters.

An interesting thing that I found was that even though this absolute character prediction difference is relatively small, based on this study you would still need a lot less characters to recognize a much greater amount of data in case of interviews than you'd need in the case of news articles (only 2084 unique characters to understand 82 803 characters of interview data compared with 2174 unique characters to understand only 53 776 characters of news data).

My explanation for this is that as can be seen in the chart above, the most frequent characters in Interviews (about the first 1000) are used much more often throughout the Interview articles and thus can cover more data.

Conclusion

You need to learn less characters in order to read Interview articles than you need to learn in order to read news articles in beginner/intermediate stages of your studies.
Interviews are easier to understand for beginner/intermediate students.
The estimated absolute amount of characters for reading both news and interviews is roughly the same (news 2174, interviews 2084), the difference is you can recognize more data by knowing less characters when reading interview articles at the beginning/intermediate stages, because more frequent characters account for more of the text in case of interview articles.
The composition of the characters found in the interview and frequency charts are different in that in the news character frequency chart, typical direct speech elements are much less frequent than they are in the interview frequency chart.
I would recommend interview reading to news reading for intermediate students.

You can download the full interview character frequency list in the download section of this blog.

Forever a student

Pages

November 05, 2012

Chinese character frequency list - Interview articles

No comments:

Post a Comment