Author |
Ho Li-hsing, Yue Ching-syang, Cheng Wen-huei
Abstract |
Is it possible for computers to tell whether a text was written in classical Chinese or vernacular modern Chinese? Can the new developments of digital humanities help find out the transformation of written Chinese language during the late Qing and early Republic? As previous scholars have pointed out, in the early stage of the history of modern Chinese, missionaries and reformists only used vernacular language as a tool to enlighten the public. Classical Chinese remained the standard written language until May Forth Movement in 1919, when Xin Qing Nian became the most influential publication. Throughout the last century, scholars have scrutinized the theoretical arguments and creative writing practices in Xin Qing Nian and several other progressive magazines to delineate the changing history of the language. But questions such as how long did it take for literati as well as the general public to adopt the vernacular language as the written standard, or how did the new standard spread from radical revolutionary magazines to other publications like entertainment magazines or newspapers, remain unanswered. If we can teach computers to distinguish between classical and modern Chinese, it would be possible to bring in much more digitized texts in that period to study and to answer those questions. To achieve this goal, we adopt the concept of “genome mapping” to differentiate between classical and modern Chinese in this study. We propose two approaches, supervised learning and un-supervised learning, to compare the differences in writing style between classical Chinese and modern Chinese. In addition to concepts and methods used in a lexical analysis, we also adapt the ideas in ecology. Supervised learning has long been used in linguistics to differentiate authorship via keywords.We choose ten function words for classical and modern Chinese each as the keywords, and we use Gini’s index of volumes 1 and 11 from Xin Qing Nian to demonstrate the comparison. There are no standard operating procedures for applying the unsupervised learning, and it is the main reason why this type of approaches is difficult to implement. In this study, we choose the diversity indices for un-supervising learning, for example, Gini’s index, entropy, and Simpson’s index, for measuring the statistical dispersion and evenness (or equality) of the words used. Based on our analyses, it seems that the later volumes (such as Volume 11) have lower species diversity, indicating that people can read articles without recognizing many words, which matches to the purpose of the May 4th Movement.
keywords |
Stylistic Analysis, May 4th Movement, La Jeunesse, Function Words Analysis, Species Diversity