Klavertje chat male
In this paper, we start modestly, by attempting to derive just the gender of the authors 1 automatically, purely on the basis of the content of their tweets, using author profiling techniques.For our experiment, we selected 600 authors for whom we were able to determine with a high degree of certainty a) that they were human individuals and b) what gender they were.However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata.In this case, the Twitter profiles of the authors are available, but these consist of freeform text rather than fixed information fields.2006)), containing about 700,000 posts to (in total about 140 million words) by almost 20,000 bloggers. Slightly more information seems to be coming from content (75.1% accuracy) than from style (72.0% accuracy). We see the women focusing on personal matters, leading to important content words like love and boyfriend, and important style words like I and other personal pronouns.For each blogger, metadata is present, including the blogger s self-provided gender, age, industry and astrological sign. The creators themselves used it for various classification tasks, including gender recognition (Koppel et al. The men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions.
Two other machine learning systems, Linguistic Profiling and Ti MBL, come close to this result, at least when the input is first preprocessed with PCA. Introduction In the Netherlands, we have a rather unique resource in the form of the Twi NL data set: a daily updated collection that probably contains at least 30% of the Dutch public tweet production since 2011 (Tjong Kim Sang and van den Bosch 2013).These statistics are derived from the users profile information by way of some heuristics.For gender, the system checks the profile for about 150 common male and 150 common female first names, as well as for gender related words, such as father, mother, wife and husband.We also varied the recognition features provided to the techniques, using both character and token n-grams.
For all techniques and features, we ran the same 5-fold cross-validation experiments in order to determine how well they could be used to distinguish between male and female authors of tweets.
The authors do not report the set of slang words, but the non-dictionary words appear to be more related to style than to content, showing that purely linguistic behaviour can contribute information for gender recognition as well.