Abstract
Hello everyone and welcome to my never ending study again.
In the last two posts I was trying to count the number of unique Chinese
characters and words in Taiwanese news by analyzing 80 news articles from
Taiwan over the period of six weeks. In my study I found there there was a
total of 2105 unique characters and 5901 unique words in the 80 articles I
analyzed which were separated into four sections: 國際
(international), 政治 (domestic politics), 社會
(society) and 財經 (economics), but as I
said, 80 articles was not enough and I tried to extend the study. Using the
sampled data I did some calculations and tired to predict what the number of
unique characters and words in any given number of articles would be.
I found that there would be a total of 2174 unique characters and 8424 unique
words and a person would thus need to know this many characters and words to
recognize 100% of any given number of news articles, if these news articles were
from the same 4 news sections I analyzed.
Introduction
The main task was to predict what the evolution of the unique character and word charts would be and at what point on the y-axis they'd stop ascending. The corresponding x-axis value to that point would be the total amount of characters necessary for a person to know in order to recognize 100% of a random news article as long as it would be from one of the 4 sampled news sections. As you can see by looking at the following two charts, both of them have ascending trends with the Word knowledge chart having a sharply ascending ending with seemingly no approximation to any number.


