March 19, 2012

Amount of characters and words necessary to read news articles

Abstract

Hello everyone and welcome to my never ending study again. In the last two posts I was trying to count the number of unique Chinese characters and words in Taiwanese news by analyzing 80 news articles from Taiwan over the period of six weeks. In my study I found there there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics), but as I said, 80 articles was not enough and I tried to extend the study. Using the sampled data I did some calculations and tired to predict what the number of unique characters and words in any given number of articles would be. I found that there would be a total of 2174 unique characters and 8424 unique words and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, if these news articles were from the same 4 news sections I analyzed.

Introduction

The main task was to predict what the evolution of the unique character and word charts would be and at what point on the y-axis they'd stop ascending. The corresponding x-axis value to that point would be the total amount of characters necessary for a person to know in order to recognize 100% of a random news article as long as it would be from one of the 4 sampled news sections. As you can see by looking at the following two charts, both of them have ascending trends with the Word knowledge chart having a sharply ascending ending with seemingly no approximation to any number.