What is libSVM?
What is the difference between multi-label and multi-class classification?
What are two assumptions about classifiers used in a successful ensemble approach?
What are four general approaches to ensemble classification?
What are boosting and bagging and how do they work?
Assignment: Complete project 5.1.
What is a linear classifier?
What is a limitation of linear classifiers and non-trivial datasets?
What is the "kernel trick?"
What are kernel functions?
What is meant by "maximum margin hyperplane?"
What are the two main goals of an SVM's function? (find support vectors, determine MM hyperplane)
Assignment: Complete project 5.1.
Discussion: Collective Intelligence chapter 9
Assignment: Complete project 5.1.
Assignment: Complete project 5.1.
What is feature subset selection?
What is one general algorithmic approach to feature selection?
Assignment: Complete project 5.1.
Research outline (NMF, SVMs).
What is a support vector machine?
For what kind of dataset is an SVM proven to be effective with?
Assignment: Complete project 5.1.
What is non-negative matrix factorization?
For what kind of dataset is NMF proven to be an effective strategy?
Assignment: Complete project 5.0.
You should have finished project 4.0 and project 5.0.
You and your partner will briefly present your dataset and mining ideas in class.
Assignment: Complete reading assignment 13.
You should hand in the midterm and be working on project 4.0 and project 5.0.
What are some common optimization techniques?
What is the "hill-climbing" algorithm? What are it's drawbacks?
What about "random-restart hill-climbing" and it's drawbacks?
What is simulated annealing, and how might you use it in a DM context?
How do simple genetic algorithms work?
Assignment: Complete project 5.0.
You should be working on project 4.0.
What is a datacube?
How might you transform a flat set of data into a three-dimensional datacube?
Assignment: Complete midterm exam.
You should have finished assignment 12.
What are the two main problems facing any data miner?
What is the solution?
Assignment: Complete project 4.0.
What are some common contexts for anomaly detection?
What are common causes of anomalies?
What are some issues to keep in mind when choosing your detection strategy?
How does a statistical approach to anomaly detection work?
How does a proximity-based approach to anomaly detection work?
How does a density-based approach to anomaly detection work?
How does a clustering-based approach to anomaly detection work?
What are some issues to keep in mind when using each of the above strategies?
Assignment: Complete assignment 12.
You should have finished project 3.1
What is backpropogation and how does it work?
What is association analysis?
How are association rules similar to classification rules?
What are the two major steps in association analysis?
How do we generate frequent itemsets?
What is the apriori principle?
You have finished assignment 11 and be working on assignment 10.
Assignment: Finish project 3.1.
You have finished assignment 10.
Assignment: Complete assignment 11 and project 3.1.
You should have finished assignment 9 and Project 2.2.
What is the general algorithm for directly generating rulesets?
What are the concepts behind "sequential covering, and Learn-One-Rule" ideas?
What are two rule-growing strategies?
How is rule quality evaluated?
How are rulesets extracted from decision trees?
What are some characteristics of rule-based classifiers?
What are some characteristics of the kNN decision boundaries?
What two things are important to keep in mind when using kNN classifiers?
What's a simple way of representing Bayesian conditional probability for classification?
Assignment: Complete assignment 10.
You should have finished assignment 8 and Project 2.1.
What is Bayes Theorem? How is it applied to classification?
What is an artificial neural network?
What is a perceptron?
How is an ANN applied to classification?
Assignment: Complete assignment 9 and Project 2.2.
You should have completed Project 1.6.
Introduction of new dataset (weather data).
What is a rule-based classifier?
What is a nearest-neighbor classifier?
Assignment: Begin reading assignment 8 and complete Project 2.1.
You should have read about decision trees and be working on Project 1.6.
What are some characteristics of decision trees?
What is model overfitting?
What is cross-fold validation?
Assignment: Finish Project 1.6.
You should be reading about classification and working on Project 1.6.
Classification quiz.
Assignment: Continue reading about classification and begin Project 1.6.
You should be finishing Project 1.5.
Discussion of classifiers benefits, drawbacks.
Discussion of key terms to be found in our readings.
Intro to decision tree classifiers and Hunt's Algorithm.
Assignment: Begin reading assignment and finish Project 1.5.
You should have finished Project 1.4, Homework 6 and be prepared to discuss your findings.
Discussion and application of clustering.
Discussion of results.
Assignment: Project 1.5.
You should be working on Project 1.3, Homework 6 and be prepared to discuss your findings.
Discussion of summary statistics.
Discussion of clustering theory, application, expectations, strategy.
Assignment: Project 1.4, Homework 6.
You should be exploring the data and be prepared to discuss your summary statistics.
Discussion of problem clarification
Brief discussion of mining strategies (clustering, classification trees).
Assignment: Project 1.3, Homework 6.
You should be working on Project 1.2 and be prepared to discuss your findings.
What does the Instances class in Weka represent?
About how many records are in train.all?
What's a "longitudinal sequence?"
What issues did you encounter when importing train.all into your database/tool of choice?
Attribute, values discussion.
What pattern did we see with diagnosis codes? What is their datatype?
You should be finished with Project 1.1 and be prepared to discuss your findings.
Problem definition
Starting with what we know
Clarification of unknowns, next steps
Assignment: Project 1.2.
You should definitely have WEKA working properly in your dev env.
Project overview (Informs 2009 Data Mining contest challenge)
Initial project & data discussion
Assignment: Project 1.1.
How might you run a cluster analysis with WEKA?
You should be finishing Homework 5.
What is WEKA?
What data mining tools does it provide?
How do you preprocess data with Weka?
You should be working on Homework 5.
What is the most popular approach for building a classifier?
What is an association rule?
What are support and confidence?
What are the two main steps in association analysis?
What is cluster analysis? What are two popular algorithms for determining clusters?
What are two general approaches in detecting anomalies?
You should have completed Homework 4.
What are some general summary statistics we should create for all datasets?
What is visualization? Why do we care?
How do you create a visualization using Processing?
How do you get Processing to use your dataset?
Assignment: Read DM chapter 3. Start Homework 5.
You should have completed Homework 4.
What is visualization? Why do we care?
What are some modern trends in Information Visualization?
Assignment: Read DM chapter 3. Start Homework 5.
You should definitely have Processing and git working properly in your dev env.
How does Euclidean distance represent similarity?
How does the Simple Matching Coefficient represent similarity?
When should you use SMC?
How does the Jaccard Coefficient represent similarity?
When should you use Jaccard?
How is Cosine Similarity calculated?
When should you use Cosine Similarity?
How does the Extended-Jaccard/Tanimoto Coefficient work?
When should you use the Tanimoto Coefficient?
How does Pearson Correlation represent similarity?
When should you use Pearson?
Assignment: Re-read (again, really!) DM chapter 2.4 and CI chapter 2. Homework 4.
You should have read DM chapter 2.4, CI chapter 2.
What do we mean by similarity and dissimilarity?
How do we convey this meaning to a machine?
How do we measure dissimilarity between two attribute values?
How do we measure dissimilarity between two data objects?
What are some important issues surrounding dissimilarity measurements?
What is collaborative filtering?
What's the difference between user-based collaborative filtering and item-based collaborative filtering?
Assignment: Re-read (really!) DM chapter 2.4 and CI chapter 2.
You should have finished Homework 3.
Clustering example using the Orange Data Mining Toolkit.
Example of SQL Server Data Mining tools.
Review: Data Preprocessing.
Assignment: Prepare for quiz (data, preprocessing). Read DM chapter 2.4, CI chapter 2.
You should be working on Homework 3.
Homework 3 / toolkit demonstration
What is aggregation?
What is sampling?
What is dimensionality reduction?
What is feature subset selection?
What is creation?
What are discretization and binarization?
What is variable transformation?
Assignment: Read DM chapter 2.3 (pages 44 - 65).
You should be working on Homework 3
What can we assume about almost all non-random data?
What are some types of data?
What are some types of data sets?
What are some problems that affect data quality?
What is the simple definition of high quality data?
Assignment: Read DM chapter 2.1, 2.2 (pages 19 - 44).
You should have completed Homework 2
What are the two main categories of Data Mining strategies? (review)
What are the five general types of strategies we will use in this class?
Toolkit overview and explanation of homework 3
Assignment: Homework 3
You should have completed Homework 1
Where does Data Mining fit in the big picture of "Knowledge Discovery?"
What are some limits to machine learning?
In what industries or contexts is data mining used? What are some examples?
What are the two main categories of Data Mining strategies?
Assignment: Homework 2
You should have skimmed the table of contents and the entire contents of DM and CI
Semester overview.
What is data mining?
Why does data mining even exist? What's the point?
Assignment: Homework 1