CSCI 568: Data Mining

Fall/Winter 2010

Lectures

Dec 8: Data Mining Review

You should be working on the Final Project & Final Portfolio.

Final project dataset discussion & hints.

Assignment: Finish the Final Project & Final Portfolio

Dec 6: Beautiful Data: Connecting Data

You should be working on the Final Project & Final Portfolio.

What are some current challenges in the vast data landscape?

What does it mean to have a data silo?

What are some problems with identical data in multiple silos?

What are some current approaches (and research) being pursued to "connect" data?

Assignment: Finish the Final Project & Final Portfolio

Dec 3: Current Papers and Domain-Specific Examples of Algorithm Use

You should have completed Reading 29.

A survey of domain-specific and algorithm-specific papers

Assignment: Finish the Final Project & Final Portfolio

Dec 1: Additional Resources, Final Project/Portfolio Overview

You should have completed Reading 29.

What are some additional, current texts in data mining?

Final project overview

Assignment: Final Project, Final Portfolio

Nov 29: Non-Negative Matrix Factorization

You should have finished Project 7.

How does non-negative matrix factorization (NMF) work?

Assignment: Reading 29.

Nov 22: A Survey of Visualization

Nov 19: Ensemble Classifiers

Nov 17: Ensemble Classifiers, Boosting, Bagging, Random Forests

Nov 15: Advanced Stratagems, Intro Non-Negative Matrix Factorization

You should have finished Reading 25, Portfolio 3 and be working on Project 7.

What is non-negative matrix factorization? (NMF)

What are some "advanced" mining strategies?

Assignment: Reading 26, Portfolio 4 and continue Project 7.

Nov 12: Support Vector Machines, Binary Classifiers & Multi-Label Classifiers

You should have finished Reading 24 and be working on Project 7.

How does an SVM work?

What are the criteria for kernel functions used with an SVM?

How might you apply a binary classifier to multi-label problems?

How is exploratory data analysis related to data mining?

Assignment: Reading 25, Portfolio 3 and continue Project 7.

Nov 10: (no lecture)

You should have finished Reading 24 and be working on Project 7.

Assignment: Continue Project 7.

Nov 8: Linear Classifiers, Kernel Functions, SVMs

You should have finished Reading 23 and be working on Project 7.

What is a linear classifier?

What is a major limitation of linear classifiers and non-trivial datasets?

What is the "kernel trick?"

What are kernel functions?

What is meant by "maximum margin hyperplane?"

What are the two main goals of an SVM's function? (find support vectors, determine MM hyperplane)

Assignment: Complete Reading 24 and continue Project 7.

Nov 5: Intro to Support Vector Machines

You should be working on Project 7.

Challenge: what are kernel functions, kernel methods, the kernel trick and Support Vector Machines?

Challenge: what is non-negative matrix factorization?

Assignment: Reading 23 and Project 7.

Nov 3: Portfolio 2 Review, Knime Workflow Example

You should have finished Reading 22 and Portfolio 2.

Discussion: Knime workflows, nodes, problems, transformations, tool limitations.

Discussion: Project 6.1 portfolio review.

Challenge: what are kernel functions, kernel methods, the kernel trick and Support Vector Machines?

Challenge: what is non-negative matrix factorization?

Assignment: Begin Project 7 (declare your dataset).

Nov 1: Bioinformatics in Practice, Numeric Prediction w/ kNN, Case Study: Heterogeneous Similarity

You should have finished Reading 21 and be working on Portfolio 2.

Discussion: Life in Data

How might you use kNN to predict numeric values?

How might you choose the right k?

What are some properties of common weighting functions?

Case study: heterogeneous similarity, delivering a solution.

Assignment: Reading 22 and finish Portfolio 2.

Oct 29: Applications of Decision Trees, Case Study: Heterogeneous Similarity

You should have applied mining strategies against your dataset via your toolkit.

What additional characteristic do decision trees provide that other classifiers do not?

What types of attributes are decision trees often a better fit for analysis?

Can a decision tree provide a probabilistic assignment of classification?

For what problems are decision trees a poor choice?

How might you devise a plan for creating a system that displays "users most like you" given heterogeneous attributes and missing data?

Assignment: Reading 21 and Portfolio 2.

Oct 25: Project Discussion

You should be experimenting with applying mining strategies against your dataset via your toolkit.

Reading 20 discussion

Mining strategy discussion

Assignment: Continue mining your dataset.

Oct 22: Stochastic Optimization & Data Mining

You should be experimenting with applying mining strategies against your dataset via your toolkit.

What are some common optimization techniques?

What is the "hill-climbing" algorithm? What are it's drawbacks?

What about "random-restart hill-climbing" and it's drawbacks?

What is simulated annealing, and how might you use it in a DM context?

How do simple genetic algorithms work?

Assignment: Reading 20 and continue mining your dataset.

Oct 20: Project Discussion

You should have finished Project 6.7 - 6.9.

(midterm review)

(project discussion)

Assignment: Experiment with applying mining strategies against your dataset via your toolkit.

Oct 15: Data Preprocessing Techniques

You should be working on Project 6.7 - 6.9.

What is data preprocessing?

What are some common preprocessing tasks?

What are the most common dataset transformations that usually take place?

What is aggregation?

What is sampling?

What is dimensionality reduction?

What is feature subset selection?

What is feature creation?

What are discretization and binarization?

What is variable transformation?

Project discussion: summary statistics with Knime

Assignment: Complete Reading 19 and continue project Project 6.7 - 6.9.

Oct 13: Midterm

You should be working on Project 6.7 - 6.9.

Assignment: Continue project Project 6.7 & 6.9.

Oct 11: Lecture 19, Midterm Tips, Project Discussion, Data Preprocessing

You should be finished with Project 6.1 - 6.6.

Project Discussion: data preprocessing, dependencies

Assignment: Study for the midterm and begin project Project 6.7 - 6.9.

Oct 8: Lecture 19, Project Discussion

You should be finished with Reading 17 and be working on Project 6.

Project Discussion: your dataset(s) and tool choices, tinkering and data exploration

Assignment: Continue Project 6.

Oct 6: Lecture 18, Exploring Data, Project Discussion

You should be working on Reading 17 and Project 6.

What are some general summary statistics we should create for all datasets?

What is visualization? Why do we care?

What are some modern trends in Information Visualization?

Project Discussion: your dataset(s) and tool choice

Assignment: Continue Reading 17 and Project 6.

Oct 4: Lecture 17, Anomaly Detection Review, Project Requirements

You should have finished Reading 16 and be finishing Project 5.

What are some common approaches in anomaly detection?

Project preliminaries & expectations.

Assignment: Finish Project 5, begin Reading 17 and Project 6.

Oct 1: Lecture 16, Intro to Anomaly Detection, Tools & Applied Theory I

You should be working on Project 5.

What is an anomaly?

What is anomaly detection?

What are some common causes of anomalies?

What are some common contexts and approaches in anomaly detection?

What are some common issues to be aware of when detecting anomalies?

Everyone should master Excel: analysis toolpack, optimization

Statisticians love R: histograms, scatterplots, linear regression, correlation, error

Assignment: Reading 16 and finish Project 5.

Sep 29: Lecture 15, Association Analysis

You should have finished Reading 15 and be working on Project 5.

What are the two main steps in a general association analysis strategy?

What types of datasets are traditional association analyses run against?

What is an association rule? A frequent itemset?

What are support and confidence? How are they used to evaluate rules and itemsets?

What is the apriori principle and how is it applied to frequent itemset generation and ruleset generation?

What are some general algorithms for generating frequent itemsets?

What are some general algorithms for generating association rules?

What are most frequent itemset generation algorithms sensitive to?

What are some general ways of evaluating association rules / patterns?

What are some current research topics in the area of association analysis?

Assignment: Continue Project 5.

Sep 27: Lecture 14, Bayesian Classifiers, Intro to Association Analysis

You should have finished Reading 14 and Project 4.

What is Bayes Theorem? How is used to calculate dependent probabilities?

What is a Bayesian Classifier and how does it work?

What is association analysis?

Assignment: Reading 15 and begin Project 5.

Sep 24: Lecture 13, Rule-Based Classifiers, k-Nearest Neighbor Classifiers

You should have finished Reading 13 and be working on Project 4.

What is a Rule-Based Classifier and how does it work?

What are some common Rule-Based Classifier algorithm 'brands'?

What two measurements are typically used for all rule evaluation?

What are two general approaches to building a Rule-Based Classifier?

What is a k-Nearest Neighbors Classifier and how does it work?

What are some characteristics of a kNN classifier?

Assignment: Reading 14 and finish Project 4.

Sep 22: Lecture 12, Decision Tree Induction

You should have finished Reading 12 and be working on Project 4.

What is Hunt's Algorithm?

What are two important issues regarding attribute test conditions?

How do you split different attribute types?

How do you determine which attribute to test, and what values to use?

How do you measure the quality of a split? As compared to what?

How do you stop the tree growth?

What are some qualities of decision trees you should remember?

Assignment: Reading 13 and continue Project 4.

Sep 20: Lecture 11, Perceptrons & Multi-Layer Artificial Neural Networks, Intro to Decision Trees

You should have finished Reading 11.

How do you treat bias in a perceptron?

What is the general learning algorithm for a perceptron?

Why are perceptrons limited to linearly separable classifications?

What is a multi-layer Artificial Neural Network (ANN) and how does it work?

To discover (via project 4): what is back-propogation and why is it necessary when training a multi-layer ANN?

What are five issues to keep in mind when designing an ANN?

What are five general characteristics of ANNs?

What is decision tree induction and how does it work?

Assignment: Reading 12 and Project 4.

Sep 17: Lecture 10, Cluster Review, Intro to Classifiers, Artificial Neural Networks

You should have finished Reading 10, Portfolio 1 and Project 3.

Discussion review regarding clustering.

What is classification?

What is the difference between descriptive modeling and predictive modeling?

What is the general process of building a classifier?

How are training datasets and test datasets used to build a classifier?

What two basic metrics are always used to evaluate classifier performance?

What is a perceptron and how does it work?

Assignment: Reading 11.

Sep 15: Discussion, DBSCAN, Cluster Analysis, Comparisons

You should have finished Reading 9 and be finishing Project 3.

How does the DBSCAN algorithm work?

How can we use cohesion and separation to describe clusters?

How can we determine the 'correct' number of clusters?

What comparisons can we extract from K-means and DBSCAN?

What are some important characteristics of clusters and clustering algorithms?

Assignment: Reading 10, finish Portfolio 1 and continue Project 3.

Sep 13: Lecture 9, Comparing Clustering Strategies

You should have finished Reading 8 and be working on Project 3.

What are some important issues regarding k-means clustering?

What are some important issues regarding heirarchical clustering?

How does density-based clustering work? What are some important issues about this strategy?

In what situations might you use these particular clustering strategies?

Discussion: Information Platforms; "Data Scientist"

Assignment: Reading 9, Portfolio 1 and continue Project 3.

Sep 10: Lecture 8, Clustering

You should have finished Reading 7 and be working on Project 3.

How does clustering fit in an overall data mining strategy?

How might we use k-means clustering to answer, "Do meaningful groups exist?"

What is heirarchical clustering?

What is K-Means clustering?

What are the differences between the two clustering algorithms?

Discussion: cloud storage

Assignment: Reading 8 and continue Project 3.

Sep 8: Lecture 7, Data Quality, Intro to Clustering

You should have finished Reading 6 and be working on Project 3.

What are some problems that affect data quality?

What is the simple definition of high quality data?

Pearson correlation review: is our implementation valid? When do you use Bressel's Correction? Object correlation vs. attribute correlation. When should you normalize attribute values?

What are two types of structures for clusterings? (heirarchical, partitional)

Assignment: Reading 7 and continue Project 3.

Sep 6: Lecture 6, Data: Fundamentals

You should have finished Reading 5 and Project 2.

Challenge: calculate Pearson correlation for two objects and two attributes.

What can we assume about almost all non-random data?

How can we categorize and remember the fundamental data types?

What are some common dataset types?

Assignment: Reading 6 and begin Project 3.

Sep 3: Lecture 5, Correlation as Similarity, Attribute Issues in Similarity, Choosing a Metric

You should have completed Reading 4 and be working on Project 2.

Proximity metrics review (Minkowski/Euclidean, cosine similarity, SMC, Jaccard)

How does the Extended-Jaccard/Tanimoto Coefficient work?

When should you use the Tanimoto Coefficient?

How does Pearson Correlation represent similarity?

How do you calculate Pearson correlation (without a calculator)?

When should you use Pearson?

What are three common attribute issues when measuring similarity, and how might you handle those issues?

How do you choose a similarity metric?

Assignment: Reading 5 and finish Project 2.

Sep 1: Lecture 4, Similarity in Depth

You should have completed Reading 3 and started Project 2.

How does Euclidean distance represent similarity?

How does the Simple Matching Coefficient represent similarity?

When should you use SMC?

How does the Jaccard Coefficient represent similarity?

When should you use Jaccard?

How is Cosine Similarity calculated?

When should you use Cosine Similarity?

Assignment: Reading 4 and continue Project 2.

Aug 30: Lecture 3, Data Mining in Context, Algorithm Types, Dissimilarity

You should have completed Reading 2 and Project 1.

What are the two main categories of Data Mining strategies?

What are the five general types of strategies we will use in this class?

Discussion: CI ch01, BD ch01, "Mining Crime Data."

Project 1 & 2 overview.

Assignment: Reading 3 and Project 2.

Aug 27: Lecture 2, Data Mining in Context, Main Categories of Algorithms

You should have completed Reading 1

Where does Data Mining fit in the big picture of "Knowledge Discovery?"

What are some limits to machine learning?

In what industries or contexts is data mining used? What are some examples?

Project 1 overview.

Assignment: Reading 2 and Project 1.

Aug 25: Lecture 1, Introduction

You should have skimmed the table of contents and the entire contents of DM, CI and BD

Semester overview.

What is data mining?

Why does data mining even exist? What's the point?

Assignment: Reading 1

SYLLABUS | FORUM