110 likes | 127 Views
Gender Classification of Japanese Authors. David Edwards & Cybelle Smith. Gendered Speech in Japanese. Gender of speaker may be overtly marked: Gender-specific first-person pronouns 僕 , boku, male; 俺 , ore , male; 私 , watashi , female or neutral
E N D
Gender Classificationof Japanese Authors David Edwards & Cybelle Smith
Gendered Speech in Japanese Gender of speaker may be overtly marked: Gender-specific first-person pronouns 僕, boku, male; 俺, ore, male; 私, watashi, female or neutral Question: Does gender have less-overt effects on Japanesetexts as well? Can word choice, morphology, writing style indicate gender, even in noisy environments like fiction writing?
“Peace” Corpus 29 personal essays by middle school students Topic: “Peace” 29 authors: 22 female 7 male “Bookstudio” Corpus 485 installments of online novels Genre: Fantasy 40 authors 20 female 20 male Also collected ~181 installments from authors of unknown gender (for future research) Corpora
Classifiers Used Naïve Bayes: Build conditional probabilities of features given gender Calculate probability of test data given a particular gender Select highest-probability gender SVM: Used the LIBSVM free classifying tool Find dividing hyperplane in num-feature dimensional space - Requires problem-specific parameters chosen via cross-validation Apply hyperplane to test data Also attempted Logistic Regression
Chasen: Segmenter and POS-tagger Stem Pronun Lemma Part of Speech -ciation 記号-空白 光ヒカリ光名詞-一般 がガが助詞-格助詞-一般 彷徨ホウコウ彷徨名詞-サ変接続 うウうい形容詞-自立形容詞・アウオ段ガル接続 ようヨウよう名詞-接尾-一般 なナだ助動詞特殊・ダ体言接続 暗きクラキ暗い形容詞-自立 闇ヤミ闇名詞-一般
Features Stem Pron Lemma POS 暗きクラキ暗い形容詞-自立 KURAki kuraki KURAi adjective - independent
Features 私 わたし ワタシ Kanji (Chinese character) Hiragana (phonetic) Katakana (phonetic, like italics)
Single-feature performance on Naive-Bayes: Multi-feature performance on Naive-Bayes:
SVM Performance • Optimizations: • Scaling counts to avoid swamping low-frequency features • Selecting optimal error rate and kernel parameters
Conclusion • Without considering gendered pronouns, we achieved similar performance • Most-indicative feature: wordshape (use of kanji vs. hiragana vs. katakana etc.), especially where multiple options exist • Point of interest: male and female Japanese authors differ not just in the words they use, but how they choose to write those words