CSCI 568: Data Mining

Fall/Winter 2008

Lectures

Dec 3: Lecture 36, The Data Mining Process, Semester Review

You should be working on your final project

Example: Data Mining Attrition Analysis

Discussion: What have we covered this semester?

Dec 1: Lecture 35, The Data Mining Process

You should be working on your final project

Example: Mining and Modeling Criminal Behavior

Nov 24: Lecture 34, Additional Data Mining Techniques

You should be working on your final project

Nov 21: Lecture 33, The Data Mining Process: Rapid Miner

You should have read DM p349 - 353 and be working on your final project

Example: Rapid Miner and Declarative Data Mining

Example: Performance Analysis w/ Rapid Miner

Assignment: Continue your final project

Nov 19: Lecture 32, The Data Mining Process, Association Analysis Rule Generation

You should be working on your final project

Example: Feature Subset Selection w/ Orange

Assignment: Read DM p 349 - 353 and continue your final project

Nov 17: Lecture 31, The Data Mining Process, Association Analysis

You should have reviewed DM p327 - 331, read DM p332 - 349, and be working on your final project

Example: Association Rules w/ Orange

Assignment: Continue your final project

Nov 14: Lecture 30, The Data Mining Process

You should be working on your final project

Case Study: Functional Genomics w/ Orange

Assignment: Review DM ch6 p327 - 331, read DM ch6 p332 - 349, continue your final project

Nov 12: Lecture 29, The Data Mining Process

You should have started your final project

Case Study: Chemogenomical Analysis w/ Orange

Case Study: Defining a Gene Set Network w/ Orange

Assignment: Continue your final project

Nov 10: Lecture 28, Visualization with Processing

You should have downloaded and installed the orange toolkit

What is Processing? What's the big deal?

How does Processing work?

Historically, what has been a unifying challenge across the fields of data mining, statistics, and graphic design?

Assignment: Begin working on your final project

Nov 7: Lecture 27, Tools

What are a few of the more popular and free tools for data mining today?

What are some modern tools for data visualization?

What are some good 'rules of thumb' when choosing tools?

Assignment: Download and install the orange toolkit

Nov 5: Lecture 26, Anomaly Detection

Nov 3: Lecture 25, Intro to Anomaly Detection

Oct 31: Lecture 24, Rule-Based Classifiers, Nearest-Neighbor Classifiers

You should have read DM p208 - 223 and finished portfolio assigment 13

What are two ways to order rules?

What are two ways to build a rule-based classification model?

What are two ways for a rule-based classifier learn a rule?

What are some characteristics of rule-based classifiers?

What is a nearest-neighbor classifier and how does one work?

What are two things the quality of a nearest-neighbor classifier is dependent on?

What are some characteristics of nearest-neighbor classifiers?

Assignment: Read DM ch5 p223 - 227, CI ch8 and begin portfolio assignment 14

Oct 29: Lecture 23, Intro to Rule-Based Classifiers

You should have read CI ch7 p142 - 158 and begun portfolio assigment 13

Assignment: Read DM p208 - 223 and finish portfolio assigment 13

Oct 24: Lecture 22, Model Overfitting, Evaluating Classification Models

You should have finished reading DM ch 4 and finished portfolio assigment 12

Assignment: Read CI ch7 p142 - 158 and begin portfolio assigment 13

Oct 22: Lecture 21, Decision Tree Classifiers

You should have read DM ch 4.4 - p193 and begun portfolio assigment 12

What's the general algorithm for decision tree induction?

What are some characteristics of decision tree classifiers?

Assignment: finish reading DM ch 4 and finish portfolio assigment 12

Oct 20: Lecture 20, Decision Tree Classifiers

You should have read DM ch 4 - p172, and finished portfolio assigment 11

How do attribute types affect test conditions?

How do you measure the 'best split' for a dataset, given a particular attribute test?

What are entropy, gini impurity, and classification error? How do you measure them?

Assignment: Read DM ch 4.4 - p193 and begin portfolio assigment 12

Oct 17: Lecture 19, Fisher's Method, Intro to Decision Tree Classifiers

You should have read DM 5.3 - p240, read CI ch6, and begun portfolio assigment 11

How is Fisher's method different from Bayes' theorem?

How does the computational cost of a naïve Bayesian classifier compare to that of an ANN?

What is the most common approach to building a classification model?

What is a decision tree?

How do you build a decision tree and what are the primary design challenges?

What is Hunt's algorithm?

Assignment: read DM ch 4 - p172, and finish portfolio assigment 11

Oct 15: Lecture 18, Bayes Theorem, Naïve Bayesian Classifiers

What is conditional probability?

What is Bayes' Theorem?

What is a naïve Bayesian classifier and how does it work?

Assignment: read DM 5.3 - p240, read CI ch6, begin portfolio assigment 11

Oct 10: Lecture 17, Midterm review, Genetic Algorithms in General

You should have finished portfolio assignment 10

What is a genetic algorithm?

In general, how does a genetic algorithm work?

Assignment: enjoy fall break

Oct 8: Midterm

Oct 6: Midterm review

Assignment: prepare for the midterm

Oct 1: Lecture 16, Stochastic Optimization & Data Mining

You should have begun reading CI ch 5

What is optimization and how is it related to data mining?

What is the simple Hill-Climbing algorithm and what is it's main fault?

What is Simulated Annealing and how does it work?

Assignment: finish reading CI ch 5, complete porfolio assignments 9 and 10

Sep 26 & 29: Lecture 15, Artificial Neural Networks and Backpropogation

You should have read DM ch 5.4 and finished reading CI ch 4

What is an artificial neural network?

What is a perceptron?

What distinguishes a feed-forward, multi-layer ANN?

What is backpropogation?

Assignment: continue working on porfolio assignments 9 and 10

Sep 24: Lecture 14, Intro to Association Analysis, Artificial Neural Networks

You should have read DM ch 4 to p150, DM ch 6 to p332, and begun reading CI ch 4

In general, how does the PageRank algorithm work?

What is association analysis?

What are two main challenges in association analysis?

What are support and confidence?

What two main tasks are involved in the process of association analysis?

What is a perceptron and how does it work?

Assignment: read DM ch 5.4, finish reading CI ch 4, begin portfolio assignment 9

Sep 22: Lecture 13, Intro to Classification and Web Mining

You should have completed portfolio assignments 7 & 8

What is "information retrieval" and how does it relate to data mining?

How does a modern Web search engine work?

What is classification and how is it used?

In plain terms, how might you describe classification algorithms?

How might you measure the quality of a classification model?

Assignment: read DM ch 4 to p150, DM ch 6 to p332, begin reading CI ch 4

Sep 19: Lecture 12, Assignment 7/8 (clustering) Review, Similarity of Binary Data

You should have read DM section 3.4, and have begun portfolio assignment 8

How would you explain the hcluster() and kcluster() functions as implemented in Python?

What is one limit to using Euclidean distances or Pearson coefficients as a similarity metric?

What are two ways to measure similarity of binary data?

What is the Similarity Matching Coefficient?

What is the Jaccard Coefficient?

What is the 'real' Tanimoto Coefficient?

Assignment: complete portfolio assignment 8

Sep 17: Lecture 11, Exploring Data, Multidimensional Analysis

You should have read DM ch 3 through p 132 and have been working on assignment 7

What are some fundamental summary statistics used to describe a dataset?

How might you convert record-based datasets into multi-dimensional arrays?

What benefit does multi-dimensional analysis provide?

Assignment: read DM section 3.4, begin portfolio assignment 8

Sep 15: Lecture 10, Clustering, Intro to Data Visualization

You should have read DM section 3.3 and have been reading and hacking CI ch. 3

How does clustering fit in an overall data mining strategy?

How might we use k-means clustering to answer, "Do meaningful groups exist?"

What is visualization? Why do we care?

Assignment: portfolio assignment 7

Sep 12: Lecture 9, Clustering

You should have been skimming the first half of DM ch. 3 and be ready to implement code from CI ch. 3

What is heirarchical clustering?

What is K-Means clustering?

What are the differences between the two clustering algorithms?

Assignment: Read DM section 3.3 (visualization), continue reading and hacking CI ch 3

Sep 10: Lecture 8, Data, Distance & Similarity

You should have read ch. 2 p36 - 88. and completed portfolio assignment 6

What are some realistic issues about 'dirty' data?

Ultimately, what does it mean to have data of 'high quality?'

What are some common approaches to data preprocessing?

On what bases might you choose a particular proximity measure?

Assignment: Begin skimming DM ch 3, begin reading and hacking CI ch 3

Sep 8: Lecture 7, Data Theory, Similarity and Filtering

You should have read DM to p36 and been hacking the examples in CI ch2.

What is filtering?

What are some ways of defining 'similarity' for a machine?

What properties of numbers are commonly used to describe attributes of data?

What are some general characteristics of datasets?

What are some types of datasets?

Assignment: Read DM ch. 2 p36 - 88. portfolio assignment 6

Sep 3: Lecture 6, Data Theory

You should have completed the Python tutorial.

What is "collective intelligence" and what does it have to do with data mining?

What can we assume about almost all non-random data?

What is Euclidean distance?

What is the Pearson correlation coefficient?

Assignment: Read DM ch. 2 through p36. Begin implementing examples from CI ch. 2

Aug 29: Lecture 5, Python Exceptions, Classes & Objects, and the stdlib

You should have completed portfolio assignment 4

How can you handle exceptions in Python?

How do you declare classes in Python?

How do Python namespaces work?

What are some common Python packages?

Assignment: portfolio assignment 5

Aug 27: Lecture 4, Python Data Structures, Modules and I/O

You should have completed portfolio assignment 3

How do you create and manipulate common data structures in Python?

What is a module? What is a package?

What are some simple ways of manipulating text output?

How do you read and write files in Python?

What is pickle?

Assignment: portfolio assignment 4

Aug 25: Lecture 3, Python Fundamentals, Flow Control, and Functions

You should have completed portfolio assignment 2

How do you use Subversion and Python?

(Due to time, these lecture topics will be condensed in the following lecture)

What is Python good for and why do we care?

What are some fundamental Python syntax constructs?

How are strings represented in Python?

What are the main Python flow control statements?

How do you define a function in Python?

How do you create lambda expressions in Python?

Assignment: portfolio assignment 3

Aug 22: Lecture 2, Data Mining in Context, Main Categories of Algorithms

You should have installed svn and python, and read DM ch 1

Where does Data Mining fit in the big picture of "Knowledge Discovery?"

What are the main categories of Data Mining strategies?

Assignment: portfolio assignment 2

Aug 20: Lecture 1, Introduction

You should have skimmed the table of contents and the entire contents of DM and CI

Semester overview.

What is data mining?

Why does data mining even exist? What's the point?

Assignment: read DM through ch 1, portfolio assignment 1

SYLLABUS | FORUM