In text analysis, of which I am a student, there are a few concepts that are at the forefront of the study. Stylometrics is one of those concepts. Stylometrics is the study of linguistic markers applied to written language, usually in text analysis it is what is used for author attribution. The concept is based on the idea that in studying the high frequency words (the, of, and, at) you can find a sort of unconscious “signal” for the writer because most likely the author is not considering the amount of those words as they write.
This was one of the first concepts I learned in my studies with Dr. Matthew Jockers at UNL, and I feel like it is one of the most fun things you can begin with. I hope this brief example can help start some small adventure into text analysis. ( starting from where we left off in “creating a list object”)
Finding the spots that are not blanks. Then viewing to verify the positions,
not.blanks.v <- which(un.lowercase.novel!="") not.blanks.v[1:10]
Remove the not blank parts and check your work with length
words.v<- un.lowercase.novel[not.blanks.v] length(unique(words.v))
Table the results, making the order decreasing (one could reverse the order to see the least)
word.freqs.t<-table(words.v) sorted.freqs.t<-sort(word.freqs.t , decreasing=TRUE)
Grab the top ten in the list and plot!
nums.v<-sorted.freqs.t[c(1:10)] plot(nums.v[1:10], type="b", main="Top 15 words",xlab="Top 15 words", ylab="Occurence", xaxt ="n", cex=2,col="blue") axis(1,1:10, labels=names(sorted.freqs.t [1:10]))