CSCI 568: Data Mining

Fall/Winter 2009

Lectures

Dec 4: libSVM, Ensemble Methods

What is libSVM?

What is the difference between multi-label and multi-class classification?

What are two assumptions about classifiers used in a successful ensemble approach?

What are four general approaches to ensemble classification?

What are boosting and bagging and how do they work?

Assignment: Complete project 5.1.

Dec 2: Linear Classifiers, Kernel Functions, SVMs

What is a linear classifier?

What is a limitation of linear classifiers and non-trivial datasets?

What is the "kernel trick?"

What are kernel functions?

What is meant by "maximum margin hyperplane?"

What are the two main goals of an SVM's function? (find support vectors, determine MM hyperplane)

Assignment: Complete project 5.1.

Nov 30: Matchmaker Data, DM Process

Discussion: Collective Intelligence chapter 9

Assignment: Complete project 5.1.

Nov 23: Project Discussion

Assignment: Complete project 5.1.

Nov 20: Feature Subset Selection

What is feature subset selection?

What is one general algorithmic approach to feature selection?

Assignment: Complete project 5.1.

Nov 18: NMF, Support Vector Machines, Mining Process

Research outline (NMF, SVMs).

What is a support vector machine?

For what kind of dataset is an SVM proven to be effective with?

Assignment: Complete project 5.1.

Nov 16: Document Mining & Non-Negative Matrix Factorization

What is non-negative matrix factorization?

For what kind of dataset is NMF proven to be an effective strategy?

Assignment: Complete project 5.0.

Nov 13: Student Dataset Presentations

You should have finished project 4.0 and project 5.0.

You and your partner will briefly present your dataset and mining ideas in class.

Assignment: Complete reading assignment 13.

Nov 11: Stochastic Optimization & Data Mining

You should hand in the midterm and be working on project 4.0 and project 5.0.

What are some common optimization techniques?

What is the "hill-climbing" algorithm? What are it's drawbacks?

What about "random-restart hill-climbing" and it's drawbacks?

What is simulated annealing, and how might you use it in a DM context?

How do simple genetic algorithms work?

Assignment: Complete project 5.0.

Nov 9: Simple Datacubes, Midterm

You should be working on project 4.0.

What is a datacube?

How might you transform a flat set of data into a three-dimensional datacube?

Assignment: Complete midterm exam.

Nov 6: Tools & Process

You should have finished assignment 12.

What are the two main problems facing any data miner?

What is the solution?

Assignment: Complete project 4.0.

Nov 4: Anomaly Detection

What are some common contexts for anomaly detection?

What are common causes of anomalies?

What are some issues to keep in mind when choosing your detection strategy?

How does a statistical approach to anomaly detection work?

How does a proximity-based approach to anomaly detection work?

How does a density-based approach to anomaly detection work?

How does a clustering-based approach to anomaly detection work?

What are some issues to keep in mind when using each of the above strategies?

Assignment: Complete assignment 12.

Nov 2: ANN Backpropogation, Association Analysis

You should have finished project 3.1

What is backpropogation and how does it work?

What is association analysis?

How are association rules similar to classification rules?

What are the two major steps in association analysis?

How do we generate frequent itemsets?

What is the apriori principle?

Oct 30: No class

You have finished assignment 11 and be working on assignment 10.

Assignment: Finish project 3.1.

Oct 28: No class

You have finished assignment 10.

Assignment: Complete assignment 11 and project 3.1.

Oct 26: Deeper into kNN, Rule-Based and Bayesian Classifiers

You should have finished assignment 9 and Project 2.2.

What is the general algorithm for directly generating rulesets?

What are the concepts behind "sequential covering, and Learn-One-Rule" ideas?

What are two rule-growing strategies?

How is rule quality evaluated?

How are rulesets extracted from decision trees?

What are some characteristics of rule-based classifiers?

What are some characteristics of the kNN decision boundaries?

What two things are important to keep in mind when using kNN classifiers?

What's a simple way of representing Bayesian conditional probability for classification?

Assignment: Complete assignment 10.

Oct 23: Bayesian Classifiers, Artificial Neural Networks

You should have finished assignment 8 and Project 2.1.

What is Bayes Theorem? How is it applied to classification?

What is an artificial neural network?

What is a perceptron?

How is an ANN applied to classification?

Assignment: Complete assignment 9 and Project 2.2.

Oct 21: Rule-Based Classifiers, Nearest-Neighbor Classifiers

You should have completed Project 1.6.

Introduction of new dataset (weather data).

What is a rule-based classifier?

What is a nearest-neighbor classifier?

Assignment: Begin reading assignment 8 and complete Project 2.1.

Oct 16: Classification Trees

You should have read about decision trees and be working on Project 1.6.

What are some characteristics of decision trees?

What is model overfitting?

What is cross-fold validation?

Assignment: Finish Project 1.6.

Oct 14: (abbreviated class)

You should be reading about classification and working on Project 1.6.

Classification quiz.

Assignment: Continue reading about classification and begin Project 1.6.

Oct 12: Discussion, Mining Medical Data with WEKA

You should be finishing Project 1.5.

Discussion of classifiers benefits, drawbacks.

Discussion of key terms to be found in our readings.

Intro to decision tree classifiers and Hunt's Algorithm.

Assignment: Begin reading assignment and finish Project 1.5.

Oct 9: No class

You should be working on Project 1.5.

Assignment: Continue Project 1.5.

Oct 7: Discussion, Mining Medical Data with WEKA

You should have finished Project 1.4, Homework 6 and be prepared to discuss your findings.

Discussion and application of clustering.

Discussion of results.

Assignment: Project 1.5.

Oct 5: Discussion, Mining Medical Data with WEKA

You should be working on Project 1.3, Homework 6 and be prepared to discuss your findings.

Discussion of summary statistics.

Discussion of clustering theory, application, expectations, strategy.

Assignment: Project 1.4, Homework 6.

Oct 2: Discussion, Mining Medical Data with WEKA

You should be exploring the data and be prepared to discuss your summary statistics.

Discussion of problem clarification

Brief discussion of mining strategies (clustering, classification trees).

Assignment: Project 1.3, Homework 6.

Sep 30: Discussion, Mining Medical Data with WEKA

You should be working on Project 1.2 and be prepared to discuss your findings.

What does the Instances class in Weka represent?

About how many records are in train.all?

What's a "longitudinal sequence?"

What issues did you encounter when importing train.all into your database/tool of choice?

Attribute, values discussion.

What pattern did we see with diagnosis codes? What is their datatype?

Sep 28: Discussion, Mining Medical Data with WEKA

You should be finished with Project 1.1 and be prepared to discuss your findings.

Problem definition

Starting with what we know

Clarification of unknowns, next steps

Assignment: Project 1.2.

Sep 25: Lecture 14, Mining Medical Data with WEKA

You should definitely have WEKA working properly in your dev env.

Project overview (Informs 2009 Data Mining contest challenge)

Initial project & data discussion

Assignment: Project 1.1.

Sep 23: Lecture 13, Cluster Analysis with WEKA

How might you run a cluster analysis with WEKA?

Sep 21: Lecture 12, Exploring the WEKA Toolkit

You should be finishing Homework 5.

What is WEKA?

What data mining tools does it provide?

How do you preprocess data with Weka?

Sep 18: Lecture 11, Classifiers, Associations, Clusters, Anomalies

You should be working on Homework 5.

What is the most popular approach for building a classifier?

What is an association rule?

What are support and confidence?

What are the two main steps in association analysis?

What is cluster analysis? What are two popular algorithms for determining clusters?

What are two general approaches in detecting anomalies?

Sep 16: Lecture 10, Summary Statistics, Visualizations

You should have completed Homework 4.

What are some general summary statistics we should create for all datasets?

What is visualization? Why do we care?

How do you create a visualization using Processing?

How do you get Processing to use your dataset?

Assignment: Read DM chapter 3. Start Homework 5.

Sep 14: Lecture 9, Exploring Data, Visualization

You should have completed Homework 4.

What is visualization? Why do we care?

What are some modern trends in Information Visualization?

Assignment: Read DM chapter 3. Start Homework 5.

Sep 11: Lecture 8, Similarity in Depth, Intro to Visualization slides

You should definitely have Processing and git working properly in your dev env.

How does Euclidean distance represent similarity?

How does the Simple Matching Coefficient represent similarity?

When should you use SMC?

How does the Jaccard Coefficient represent similarity?

When should you use Jaccard?

How is Cosine Similarity calculated?

When should you use Cosine Similarity?

How does the Extended-Jaccard/Tanimoto Coefficient work?

When should you use the Tanimoto Coefficient?

How does Pearson Correlation represent similarity?

When should you use Pearson?

Assignment: Re-read (again, really!) DM chapter 2.4 and CI chapter 2. Homework 4.

Sep 9: Lecture 7, Similarity, Dissimilarity slides

You should have read DM chapter 2.4, CI chapter 2.

What do we mean by similarity and dissimilarity?

How do we convey this meaning to a machine?

How do we measure dissimilarity between two attribute values?

How do we measure dissimilarity between two data objects?

What are some important issues surrounding dissimilarity measurements?

What is collaborative filtering?

What's the difference between user-based collaborative filtering and item-based collaborative filtering?

Assignment: Re-read (really!) DM chapter 2.4 and CI chapter 2.

Sep 7: Lecture 6, Less Talk, More Mining

You should have finished Homework 3.

Clustering example using the Orange Data Mining Toolkit.

Example of SQL Server Data Mining tools.

Review: Data Preprocessing.

Assignment: Prepare for quiz (data, preprocessing). Read DM chapter 2.4, CI chapter 2.

Sep 4: Lecture 5, Data: Preprocessing Techniques slides

You should be working on Homework 3.

Homework 3 / toolkit demonstration

What is aggregation?

What is sampling?

What is dimensionality reduction?

What is feature subset selection?

What is creation?

What are discretization and binarization?

What is variable transformation?

Assignment: Read DM chapter 2.3 (pages 44 - 65).

Sep 2: Lecture 4, Data: Fundamentals slides

You should be working on Homework 3

What can we assume about almost all non-random data?

What are some types of data?

What are some types of data sets?

What are some problems that affect data quality?

What is the simple definition of high quality data?

Assignment: Read DM chapter 2.1, 2.2 (pages 19 - 44).

Aug 31: Lecture 3, Data Mining in Context, Main Categories of Algorithms

You should have completed Homework 2

What are the two main categories of Data Mining strategies? (review)

What are the five general types of strategies we will use in this class?

Toolkit overview and explanation of homework 3

Assignment: Homework 3

Aug 28: Lecture 2, Data Mining in Context, Main Categories of Algorithms

You should have completed Homework 1

Where does Data Mining fit in the big picture of "Knowledge Discovery?"

What are some limits to machine learning?

In what industries or contexts is data mining used? What are some examples?

What are the two main categories of Data Mining strategies?

Assignment: Homework 2

Aug 26: Lecture 1, Introduction

You should have skimmed the table of contents and the entire contents of DM and CI

Semester overview.

What is data mining?

Why does data mining even exist? What's the point?

Assignment: Homework 1

SYLLABUS