I am so freaking pleased with this idea, and I'm still going to probably argue against it. I'm the privacy freak. I know that it sounds terrific, and I'm sure it will be used responsibly, but my inner paranoia bells are going off something awful.
But I gotta say I love the idea of affecting dictionary verbiage for all eternity. Or, you know, ten years. Whichever comes first.
Corpsified. Definitely Corpsified.
Wow, glad to see so many responses already.
It would be the words in context -- otherwise they're not data, they're just anecdotes.
We could certainly but the kibosh on releasing certain threads -- I was thinking Bitches, for example, has more personal info than any of the others. Natter, the Music, Fic, and Movie threads, and the show/spoiler threads would probably be the most valuable.
The data can be anonymized so that no user name, personal name, or place name appears. So that "I hung out in Somerville with Emily and VWbug last night" would appear as "I hung out in PLACENAME with PERSONNAME and PERSONNAME last night." Actual replacement strings would vary.
Let me know what other questions you have! Remember, I can't put foamy in the dictionary until I can show use ... like, in a major corpus of American English ...
The corpus researchers (and lexicographers) want this data specifically because it has not been professionally edited, and because it's so wide-ranging. Linguists go to great lengths to get this kind of data -- one project gave free phone calls to grad students as long as they let themselves be recorded, in order to get spoken language data.
This reads like such a fascinating project! Shiny.
This is what I'm wondering after catching up: if we decide to go along with this, and several Buffistas would like to be excluded, would that be possible? Or can it be done in an "opt in" basis only, to prevent the use of words of people who wish otherwise or are no longer posting (and therefore can't have a say)? Will that be enough to answer the privacy questions raised above? [Edit: this question is directed at everybody, I guess, not just erin]
Nilly, that is a very good question, and I don't know the answer. I will find out. Technically, I think it must be possible, but practically, if having this constraint means that the corpora programmer has to do a lot of fancy post-processing, it may mean that we can't be used.
It might be a while before I can know this -- a week or so.
I was thinking Bitches, for example, has more personal info than any of the others. Natter, the Music, Fic, and Movie threads, and the show/spoiler threads would probably be the most valuable.
Um, Natter can get fairly personal, too. Though I like the idea.
And, smiggle!
I like the idea, too. Do we need to do a lightbulbs vote on this?
Considering the anonymousness (that's not a word, is it?), I honestly wouldn't care if even very personal things I'd discussed were used. It's not like other people haven't had albino children or evil, but well-meaning, parents.
anonymousness (that's not a word, is it?)
Anonymity. (Which doesn't look like a word now that I've typed it.)
I'm in favor of the project.
It's not like other people haven't had albino children or evil, but well-meaning, parents.
Or asshead bosses.
I'm very much in favor of the project.