Unexpected Challenges Result in Important and Informative Discussions: a transparent discussion about stripping content and stopwords

19 Feb, 2014 · by admin
hacking-that-camps-yack

This post is the fourth in a set of 5 written by the Digital History Fellows at the Roy Rosenzweig Center for History and New Media. The original post by Jannelle Legg can be found on the Digital History Fellowship Blog.

As described in previous posts, the first year Digital Fellows at CHNM have been working on a project under the Research division that involves collecting, cleaning, and analyzing data from a corpus of THATCamp content. Having overcome the hurdles of writing some python script and using MySQL to grab content from tables in the backend of a Wordpress install, we moved on to the relatively straightforward process of running our stripped text files through MALLET.

As we opened the MALLET output files, excited to see the topic models it produced, we were confronted with a problem we didn’t reasonably anticipate and this turned into a rather important discussion about data and meaning.

As a bit of background: topic modeling involves a process of filtering “stopwords” from a data set. Frequently a list of stopwords includes function words, or terms that appear repeatedly in discourse, like: “a, an, the”. These are filtered out because they serve a grammatical purpose but have little lexical meaning. Additionally, errors, misspellings, and lines of code that were skipped in the previous steps can also be filtered out at this stage.

As we opened the file of keys produced by MALLET, we found that some terms appeared that raised questions about what should or should not be included in our analysis. In particular, the discussion centered around spelling errors and function words in Spanish and French.

The conversation that followed, reproduced below, was significant and as people look through the results of this project or consider their own efforts reproduce something like this elsewhere, we’d like to be transparent about the decisions we made and, perhaps, spur a discussion about how to address scenarios like this in the future.

Take a look at what MALLET spit out: there are some errors.
Stuff like “xe, xc, zijn, xb, en, la”.

Yeah, I saw that.
We can make a custom stoplist to remove those.
Make a list and we’ll add it.

Interesting keys though - did you see that "women" came up in #0?
I’m excited to see this once it’s all graphed.

There’s some stuff though that I’m not sure we want to remove: Is CAA an error or an abbreviation?
“socal” - is that social misspelled or Southern California, abbreviated?
Hmm, that could be an organization...

Yeah, SoCal, as in Southern California.
There was a camp there.
This adds a larger question: do we remove misspellings?
For clarity?

I have mixed feelings.

Me too.

I think it’s appropriate to remove backend stuff-tags and metadata, but content is not something we should modify.

I agree.
We don’t want to skew the results.
Some of it occurred when we stripped all the alphanumeric stuff out.
It took out apostrophes- causing words like “I’ve” to become “I ve”.

The errors in themselves are telling about the nature of THATCamps.
That the content is generated spontaneously
lends itself to deviations from appropriate spelling ect.

I agree
Look at 17 and 18.

Whoa, where did “humanidades” come from?
Oh, right! There are international conference posts in here too!

How do we handle this?
Its possible to strip out the camps that are not in English
or even to run analysis on them separately.
I don’t want to skew the results but this also throws things off.

I know.
I say we leave it. It shows a growing international influence.
We’ll be able to see the emergence of International THATCamps.

I’ve never run into this before.
It brings up some interesting issues -
I wonder if there is standard procedure for something like this

What about things like “en” which is Spanish for “is”- that would have been removed on an English stoplist.
And now function words in Spanish and French seem to appear more frequently because the English terms have been filtered out.
How do you do topic modeling with multiple languages?

What about special characters?
We’ve stripped stuff out, how about how that would affect the appearance of words?

We have to find a stoplist for each of the languages.
To strip out the function words of all of them.

Good call. This got complicated quick!

Agreed.

As outlined above, when we opened the text file with keys, new questions were raised about the relevance and complications of running a particular stoplist on a corpus of texts. Similarly, we were forced to rethink how we handle misspellings and unfamiliar abbreviations. In the end, we tracked down stoplists for Spanish (and French) terms, so that no function words in any language would skew the results of our analysis. We also carefully examined the keys to identify abbreviations and misspellings and decided that they are a significant contribution to the analysis.

A few questions remained for us: how might removing non-alphanumeric characters (a-z,A-Z,0-9) alter the meaning of special characters used in languages other than English? How have others responded to spelling errors? How significant are errors?

Hopefully a post of this nature will foster discussion and produce a stronger, more complete analysis on this and other documents.