orgainzing data for evaluation is over 80% of a projects work

Scrub the Data

Data comes from different sources and in different formats.

Community traits identify which posts should be collected. If location is a factor the latitude and longitude of the post may be available. The name of the city or post defined location can be confirmed against boundaries within which the constituency is defined. User profiles and the profiles of folks who influence them may be indicators for the desired community. In each source case conversion and integration play a role just as maintaining the integrity of each post to make up the data from which information and insights are gleaned.

Corrupt and meaningless entries are outliers that offer little value with significant processing implications. Ambiguity, sarcasm, typos and online chat abbreviations require intervention that is consistent for the post to participate in the data set. An entry from 'Vancouver' should not be included with Vancouver BC if it is from Vancouver WA, USA. Retweets, repostings, links and blogs have unique importance as communication channels. Their sink, skip or ripple effects indicate leverage, channel value and community character. Communications paths between clusters, lurkers, post and blow versus share and know, help identify community dynamics. A consistent approach supports how the wording is used as it is taken apart for classification and cluster analysis.