Scrub the DataData comes from different sources and in different formats.
Community traits identify which posts should
be collected. If location is a factor the latitude and longitude of the post may be available. The
name of the city or post defined location can be confirmed against boundaries within which the constituency
is defined. User profiles and the profiles of folks who influence them may be indicators for the desired
community. In each source case conversion and integration play a role just as maintaining the integrity of
each post to make up the data from which information and insights are gleaned.
Corrupt and meaningless entries are outliers that offer little value with significant processing implications. Ambiguity, sarcasm, typos and online chat abbreviations require intervention that is consistent for the post to participate in the data set. An entry from 'Vancouver' should not be included with Vancouver BC if it is from Vancouver WA, USA. Retweets, repostings, links and blogs have unique importance as communication channels. Their sink, skip or ripple effects indicate leverage, channel value and community character. Communications paths between clusters, lurkers, post and blow versus share and know, help identify community dynamics. A consistent approach supports how the wording is used as it is taken apart for classification and cluster analysis.