CSCI 568: Data Mining

Fall/Winter 2011

Discussion Highlights

Nov 30: Discussion 37, Support Vector Machines

You should be working on Project 9.

Assignment: Continue Project 9 and work on your portfolio.

Nov 28: Discussion 36, Intro to Non-Linear Classifiers

You should have finished Project 8.

What are characterizes linearly-separable classification problems?

What are some drawbacks to linear classifiers?

Midterm exam review

Project 9 introduction

Assignment: Reading 25 and Project 9.

Nov 18: Midterm Exam

Nov 16: Discussion 35, Unstructured Data Mining, Mining for Sentiment (guest lecturer )

You should have finished Reading 24, be reading the Knime workbook and working on Project 8.

Assignment: Prepare for the midterm exam and continue Project 8 and reading the Knime workbook.

Nov 14: Discussion 34, Building Models with Optimization and Genetic Algorithms

You should be working on Project 8 and reading the Knime workbook.

What characterizes optimization problems?

What is "hill climbing" and what are its benefits and drawbacks?

What is "random restart hill climbing?"

What is simulated annealing? How is it better than hill climbing?

How might you apply optimization methods in data analysis?

What are genetic algorithms?

How can optimization solutions be presented as "generations" in genetic algorithms?

Assignment: Reading 24, continue Project 8 and continue reading the Knime workbook.

Nov 11: Discussion 33, Data Preprocessing

You should be working on Project 8.

What is dimensionality reduction?

What are some approaches to dimensionality reduction?

What is feature subset selection?

What are some approaches to feature subset selection?

What is feature creation?

What are discretization, binarization and variable transformation?

Assignment: Continue Project 8.

Nov 9: Discussion 32, Data Preprocessing

You should be working on Project 8.

What is data preprocessing?

What are aggregation and sampling?

What is dimensionality reduction?

What is feature subset selection?

What is feature creation?

What are discretization, binarization and variable transformation?

Assignment: Reading 23.

Nov 7: Discussion 31, Association Rule Learners, Popular Data Mining Tools

You should have completed Project 7.

How might you test an association rule learner with Weka?

What are some common "brands" of data mining tools?

How are most data mining toolkits meant to be used?

What challenges do big data introduce to the use of common mining toolkits?

Assignment: Finish Project 8.

Nov 4: Discussion 30, Anomaly Detection

You should have completed Reading 22 and be working on Project 7.

What is an anomaly?

What are some common generators of anomalies?

What are the classic approaches to anomaly detection?

What are some important issues to consider when detecting anomalies?

Assignment: Finish Project 7.

Nov 2: Discussion 29, Rule Evalutation

You should have completed Reading 21 and be working on Project 7.

What are some common approaches to evaluating association rules?

What metrics are often used when evaluating association rules?

How is the data domain important when assesssing the "worth" of association rules?

Assignment: Reading 22.

Oct 31: Discussion 28, Rule Generation

You should have completed Reading 20 and be working on Project 7.

What is the Fk-1 X F1 approach to frequent itemset generation?

What is the Fk-1 X Fk-1 approach to frequent itemset generation?

How might you efficiently generate association rules given all frequent itemsets?

Assignment: Reading 21.

Oct 28: Discussion 27, Frequent Itemset Generation

You should have completed Reading 19 and be working on Project 7.

What is a frequent itemset?

What are some brute-force approaches to generating frequent itemsets?

What is the a priori approach to frequent itemset generation?

What is the classic itemset generation algorithm?

What are some efficient approaches to counting itemset support?

Assignment: Reading 20.

Oct 26: Discussion 26, Intro to Association Analysis

You should have completed Reading 18.

What is association analysis?

What do we mean by "asymmetric binary dataset?"

What is an itemset?

What is an itemset support count?

What is an association rule?

How do you calculate the support and confidence of a rule?

What are the challenges in generating assocation rules?

What are the two general steps in efficiently generating association rules?

Assignment: Reading 19.

Oct 24: Discussion 25, Bayesian Classifiers & Bayes Theorem

You should have completed Reading 17.

What is Bayes Theorem?

How do Naive Bayesian Classifiers use probabilities to make predictions?

Assignment: Project 7 and Reading 18.

Oct 21: Discussion 24, Artificial Neural Network Classifiers

You should have completed Reading 16.

How do perceptrons work?

What problems can perceptrons not be trained properly to solve?

How do multiple layers allow us to solve non-linearly separable classification problems?

What is "hidden" about a hidden layer?

What is the importance of weights on input nodes?

What are some common activation functions?

What is backpropogation, in general?

Assignment: Reading 17.

Oct 19: Discussion 23

Oct 12: Discussion 22

Oct 10: Discussion 21, Rule-Based Classifiers, Intro to Naive Bayesian Classifiers and ANNs

You should have reviewed Reading 14.

What is a rule-based classifier? How does one work?

How can probability be used to create a classification model?

What is an ANN? How might one be used to create a classification model?

Assignment: Reading 15.

Oct 7: Discussion 20, Nearest-Neighbor Classifiers

You should be finished with Project 6.

What is a Nearest-Neighbor classifier and how does it work?

What are some benefits and drawbacks of an NN classifier?

Project 6 summary & discussion.

Assignment: No homework (but be prepared on Monday to discuss Reading 14.

Oct 5: Discussion 19, Evaluating Classifiers

You should be working on Project 6.

How do we evaluate classifier performance?

What are some mechanisms for testing the validity of a classifier?

What is cross-validation?

Assignment: Reading 14 and finish Project 6.

Oct 3: Discussion 18, Best Splits, Clustering Implementation Review

You should be working on Project 6.

What is the general algorithm for decision tree induction? (review)

How do we calculate the "best split?"

Clustering implementation review

Assignment: Continue Project 6.

Sep 30: Discussion 17, Decision Tree Algorithm

You should be working on Project 6.

What is the general algorithm for decision tree induction?

How do we determine the "best split?"

What is model overfitting? What are some common causes?

How do we evaluate classifier performance?

Assignment: Reading 13 and continue Project 6.

Sep 28: Discussion 16, Decision Trees

You should have completed Project 5.

What is decision tree induction?

How does a decision tree work?

How do we create decision trees? How are they used?

What is Hunt's algorithm?

How do attribute types influence attribute tests?

Assignment: Project 6.

Sep 26: Discussion 15, Intro to Classification

You should be working on Project 5.

In what way is clustering an "unsupervised classifier"?

What do we mean by "classification?"

What are two general uses of classification models?

What is the general approach to solving a classification problem?

What is a training set? What is a test set? How are they used?

What is the general approach for measuring the quality of a classifier?

What is a confusion matrix?

Assignment: Reading 12 and continue Project 5.

Sep 23: Discussion 14, DBSCAN

You should have finished Reading 11 and be working on Project 5.

Clustering quiz

How might you choose an apporpriate Eps and MinPts?

How has the history of database systems, data warehouses and an explosion of data led to the "Rise of the Data Scientist?"

What is meant by data warehousing, ETL, and BI systems?

Assignment: Continue Project 5.

Sep 21: Discussion 13, DBSCAN, Strengths & Weaknesses

You should have finished Reading 10 and be working on Project 5.

What is density-based clustering?

What do we mean by density?

What are core, border and noise points?

What is DBSCAN? How does it work?

What are some strengths and weaknesses of DBSCAN?

Assignment: Reading 11 and continue Project 5.

Sep 19: Discussion 12, K-Means and Hierarchical Clustering

You should have finished Reading 9 and Project 4.

What are the strengths and weaknesses of K-Means clustering? Why?

What are the strengths and weaknesses of hierarchical clustering? Why?

Assignment: Reading 10 and Project 5.

Sep 16: Discussion 11, Intro to Clustering

You should be working on Project 4.

What are some ways you might try to implement "groups"?

What is the K-Means clustering algorithm? How does it work?

What is the classic hierarchical clustering algorithm? How does it work?

What are some qualities and drawbacks of these algorithms?

Assignment: Reading 9 and finish Project 4.

Sep 14: Discussion 10, Big Data and NoSQL/Distributed Databases

You should have finished Reading 8 and be working on Project 4.

What is PNUTS?

What are some scalability issues inherent to the relational model when handling "big data"?

What is the difference between scaling up vs. scaling out?

What are some interesting requirements and features of PNUTS?

How does PNUTS compare to other databases?

Assignment: Continue Project 4.

Sep 12: Discussion 9, Similarity Metrics

You should be finished with Project 3.

Are there any flaws in our calculation of Euclidean distance from Friday?

What are some fundamental proximity metrics used repeatedly in data mining algorithms?

What is SMC? What contexts is it often used? How do you calculate it?

What is Jaccard? What contexts is it often used? How do you calculate it?

What is Cosine Similarity? What contexts is it often used? How do you calculate it?

What is Pearson Correlation Coefficient? How might it be used? How do you calculate it?

Assignment: Reading 8 and begin Project 4.

Sep 9: Discussion 8, Hacking Similarity

You should be working on Project 3.

Case Study: How might you implement movie recommendations using similarity?

Assignment: Reading 7 and continue Project 3.

Sep 7: Discussion 7, Visualization Wrap-Up, Intro to Similarity Metrics

You should have completed Reading 6 and be working on Project 3.

How does visualization fit in the data mining process?

Who is Edward Tufte?

What do we mean by similarity, dissimilarity and distance?

How might you represent similarity between objects with a single attribute?

How do attribute types affect how similarity is computed/represented?

How do you compute data object dissimilarity with Euclidean distance?

What are some other proximity metrics?

Assignment: Review DM 2.4 and continue Project 3.

Sep 5: Discussion 6, Data Visualization & Exploration with Processing

You should have completed Reading 5 and Project 2.

What is Processing?

Why is it important for the data analyst to have a productive, flexible data visualization tool?

What are some examples of postmodern visualization? Why are they effective?

Assignment: Reading 6, Project 3 and Portfolio 1.

Sep 2: Discussion 5, History & Overview of Data Visualization

You should have completed Reading 4 and be working on Project 2.

What are some classic data visualization techniques?

What are some current trends in data visualization?

Why is data visualization an important part of the data mining process?

What are some historic and current pieces of literature from the visualization body of knowledge?

Assignment: Reading 5, finish Project 2.

Aug 31: Discussion 4, Data

You should have completed Reading 3 and be working on Project 2.

What are some common data "formats" encountered in the Data Mining wilderness?

What are PEIR and YFD? How is data influencing the world around us?

Why is knowing your data important?

What are some common characteristics of datasets and data?

What are some data quality issues?

What should you first do once you have obtained a dataset?

What are some common summary statistics? What are some easy ways of generating them?

Weka summary stats demonstration

Assignment: Reading 4, continue Project 2 and post one cool visualization on Piazza.

Aug 29: Discussion 3, Data Mining Road Map

You should have completed Reading 2 and Project 1.

What are two general categories of tasks in data mining?

What are some "families" of algorithms and tasks in data mining?

In general, what are some common examples of these algorithms?

What is an important, yet often overlooked, aspect in data mining?

Repository / project overview

Assignment: Reading 3 and Project 2.

Aug 26: Discussion 2, What Is Data Mining?

You should have completed Reading 1 and be working on Project 1.

What are some common words in the Data Mining vocabulary?

What are the names of some strategies or algorithms used in data mining?

Given a particular context, how might you apply the ideas behind some of these algorithms?

Assignment: Reading 2 and continue Project 1.

Aug 24: Discussion 1, What Isn't Data Mining?

You should have skimmed the table of contents and the entire contents of DM, CI and BD.

Semester overview.

What is data mining, in general?

What isn't data mining?

Why does data mining even exist? What's the point?

Assignment: Reading 1 and Project 1.

SYLLABUS | FORUM | CONTACT