Data Mining Portfolio

Problems

There have been and still are several problems encountered in the course of mining this dataset.

Preprocessing

As mentioned earlier, several problems had to be overcome to allow RapidMiner to read the files.

Each flat file contained two spaces between each measurement. The second space acted as "sign bit", thus only positive numbers had this bit empty. A simple search would a single whitespace would result in two commas between positive values and a single comma when the second value was negative.

Since the PSD file only contained positive numbers, the solution became simplified to detecting the raw files and look for the necessary features.

No labels were given in the flat file. While the label describing the sensor location was trivial to add, some flat files included the class label as the final column.

Again, the solution was the type of file being analyzed at the time. This is very easy when an entire folder is being parsed (as the folder name indicated the types of files it stores).

Processing

RapidMiner does not perform statistical analysis via a widget. While this may seem like a minor inconvenience, it is difficult to identify patterns otherwise. Also, having too much data causes RapidMiner to crash. This issue is significant because trimming data may decrease the accuracy of the model. This problem becomes worse due to the fact that the all the data points are measurements, and as a result, are immensely important to the creation of an accurate model.

Next: Conclusion
Back to: Results II