You should be working on the Final Project & Final Portfolio.
Final project dataset discussion & hints.
Assignment: Finish the Final Project & Final Portfolio
You should be working on the Final Project & Final Portfolio.
What are some current challenges in the vast data landscape?
What does it mean to have a data silo?
What are some problems with identical data in multiple silos?
What are some current approaches (and research) being pursued to "connect" data?
Assignment: Finish the Final Project & Final Portfolio
You should have completed Reading 29.
A survey of domain-specific and algorithm-specific papers
Assignment: Finish the Final Project & Final Portfolio
You should have completed Reading 29.
What are some additional, current texts in data mining?
Final project overview
Assignment: Final Project, Final Portfolio
You should have finished Project 7.
How does non-negative matrix factorization (NMF) work?
Assignment: Reading 29.
You should have finished Reading 25, Portfolio 3 and be working on Project 7.
What is non-negative matrix factorization? (NMF)
What are some "advanced" mining strategies?
Assignment: Reading 26, Portfolio 4 and continue Project 7.
You should have finished Reading 24 and be working on Project 7.
How does an SVM work?
What are the criteria for kernel functions used with an SVM?
How might you apply a binary classifier to multi-label problems?
How is exploratory data analysis related to data mining?
Assignment: Reading 25, Portfolio 3 and continue Project 7.
You should have finished Reading 24 and be working on Project 7.
Assignment: Continue Project 7.
You should have finished Reading 23 and be working on Project 7.
What is a linear classifier?
What is a major limitation of linear classifiers and non-trivial datasets?
What is the "kernel trick?"
What are kernel functions?
What is meant by "maximum margin hyperplane?"
What are the two main goals of an SVM's function? (find support vectors, determine MM hyperplane)
Assignment: Complete Reading 24 and continue Project 7.
You should be working on Project 7.
Challenge: what are kernel functions, kernel methods, the kernel trick and Support Vector Machines?
Challenge: what is non-negative matrix factorization?
Assignment: Reading 23 and Project 7.
You should have finished Reading 22 and Portfolio 2.
Discussion: Knime workflows, nodes, problems, transformations, tool limitations.
Discussion: Project 6.1 portfolio review.
Challenge: what are kernel functions, kernel methods, the kernel trick and Support Vector Machines?
Challenge: what is non-negative matrix factorization?
Assignment: Begin Project 7 (declare your dataset).
You should have finished Reading 21 and be working on Portfolio 2.
Discussion: Life in Data
How might you use kNN to predict numeric values?
How might you choose the right k?
What are some properties of common weighting functions?
Case study: heterogeneous similarity, delivering a solution.
Assignment: Reading 22 and finish Portfolio 2.
You should have applied mining strategies against your dataset via your toolkit.
What additional characteristic do decision trees provide that other classifiers do not?
What types of attributes are decision trees often a better fit for analysis?
Can a decision tree provide a probabilistic assignment of classification?
For what problems are decision trees a poor choice?
How might you devise a plan for creating a system that displays "users most like you" given heterogeneous attributes and missing data?
Assignment: Reading 21 and Portfolio 2.
You should be experimenting with applying mining strategies against your dataset via your toolkit.
Reading 20 discussion
Mining strategy discussion
Assignment: Continue mining your dataset.
You should be experimenting with applying mining strategies against your dataset via your toolkit.
What are some common optimization techniques?
What is the "hill-climbing" algorithm? What are it's drawbacks?
What about "random-restart hill-climbing" and it's drawbacks?
What is simulated annealing, and how might you use it in a DM context?
How do simple genetic algorithms work?
Assignment: Reading 20 and continue mining your dataset.
You should have finished Project 6.7 - 6.9.
(midterm review)
(project discussion)
Assignment: Experiment with applying mining strategies against your dataset via your toolkit.
You should be working on Project 6.7 - 6.9.
What is data preprocessing?
What are some common preprocessing tasks?
What are the most common dataset transformations that usually take place?
What is aggregation?
What is sampling?
What is dimensionality reduction?
What is feature subset selection?
What is feature creation?
What are discretization and binarization?
What is variable transformation?
Project discussion: summary statistics with Knime
Assignment: Complete Reading 19 and continue project Project 6.7 - 6.9.
You should be working on Project 6.7 - 6.9.
Assignment: Continue project Project 6.7 & 6.9.
You should be finished with Project 6.1 - 6.6.
Project Discussion: data preprocessing, dependencies
Assignment: Study for the midterm and begin project Project 6.7 - 6.9.
You should be finished with Reading 17 and be working on Project 6.
Project Discussion: your dataset(s) and tool choices, tinkering and data exploration
Assignment: Continue Project 6.
You should be working on Reading 17 and Project 6.
What are some general summary statistics we should create for all datasets?
What is visualization? Why do we care?
What are some modern trends in Information Visualization?
Project Discussion: your dataset(s) and tool choice
Assignment: Continue Reading 17 and Project 6.
You should have finished Reading 16 and be finishing Project 5.
What are some common approaches in anomaly detection?
Project preliminaries & expectations.
Assignment: Finish Project 5, begin Reading 17 and Project 6.
You should be working on Project 5.
What is an anomaly?
What is anomaly detection?
What are some common causes of anomalies?
What are some common contexts and approaches in anomaly detection?
What are some common issues to be aware of when detecting anomalies?
Everyone should master Excel: analysis toolpack, optimization
Statisticians love R: histograms, scatterplots, linear regression, correlation, error
Assignment: Reading 16 and finish Project 5.
You should have finished Reading 15 and be working on Project 5.
What are the two main steps in a general association analysis strategy?
What types of datasets are traditional association analyses run against?
What is an association rule? A frequent itemset?
What are support and confidence? How are they used to evaluate rules and itemsets?
What is the apriori principle and how is it applied to frequent itemset generation and ruleset generation?
What are some general algorithms for generating frequent itemsets?
What are some general algorithms for generating association rules?
What are most frequent itemset generation algorithms sensitive to?
What are some general ways of evaluating association rules / patterns?
What are some current research topics in the area of association analysis?
Assignment: Continue Project 5.
You should have finished Reading 14 and Project 4.
What is Bayes Theorem? How is used to calculate dependent probabilities?
What is a Bayesian Classifier and how does it work?
What is association analysis?
Assignment: Reading 15 and begin Project 5.
You should have finished Reading 13 and be working on Project 4.
What is a Rule-Based Classifier and how does it work?
What are some common Rule-Based Classifier algorithm 'brands'?
What two measurements are typically used for all rule evaluation?
What are two general approaches to building a Rule-Based Classifier?
What is a k-Nearest Neighbors Classifier and how does it work?
What are some characteristics of a kNN classifier?
Assignment: Reading 14 and finish Project 4.
You should have finished Reading 12 and be working on Project 4.
What is Hunt's Algorithm?
What are two important issues regarding attribute test conditions?
How do you split different attribute types?
How do you determine which attribute to test, and what values to use?
How do you measure the quality of a split? As compared to what?
How do you stop the tree growth?
What are some qualities of decision trees you should remember?
Assignment: Reading 13 and continue Project 4.
You should have finished Reading 11.
How do you treat bias in a perceptron?
What is the general learning algorithm for a perceptron?
Why are perceptrons limited to linearly separable classifications?
What is a multi-layer Artificial Neural Network (ANN) and how does it work?
To discover (via project 4): what is back-propogation and why is it necessary when training a multi-layer ANN?
What are five issues to keep in mind when designing an ANN?
What are five general characteristics of ANNs?
What is decision tree induction and how does it work?
Assignment: Reading 12 and Project 4.
You should have finished Reading 10, Portfolio 1 and Project 3.
Discussion review regarding clustering.
What is classification?
What is the difference between descriptive modeling and predictive modeling?
What is the general process of building a classifier?
How are training datasets and test datasets used to build a classifier?
What two basic metrics are always used to evaluate classifier performance?
What is a perceptron and how does it work?
Assignment: Reading 11.
You should have finished Reading 9 and be finishing Project 3.
How does the DBSCAN algorithm work?
How can we use cohesion and separation to describe clusters?
How can we determine the 'correct' number of clusters?
What comparisons can we extract from K-means and DBSCAN?
What are some important characteristics of clusters and clustering algorithms?
Assignment: Reading 10, finish Portfolio 1 and continue Project 3.
You should have finished Reading 8 and be working on Project 3.
What are some important issues regarding k-means clustering?
What are some important issues regarding heirarchical clustering?
How does density-based clustering work? What are some important issues about this strategy?
In what situations might you use these particular clustering strategies?
Discussion: Information Platforms; "Data Scientist"
Assignment: Reading 9, Portfolio 1 and continue Project 3.
You should have finished Reading 7 and be working on Project 3.
How does clustering fit in an overall data mining strategy?
How might we use k-means clustering to answer, "Do meaningful groups exist?"
What is heirarchical clustering?
What is K-Means clustering?
What are the differences between the two clustering algorithms?
Discussion: cloud storage
You should have finished Reading 6 and be working on Project 3.
What are some problems that affect data quality?
What is the simple definition of high quality data?
Pearson correlation review: is our implementation valid? When do you use Bressel's Correction? Object correlation vs. attribute correlation. When should you normalize attribute values?
What are two types of structures for clusterings? (heirarchical, partitional)
You should have finished Reading 5 and Project 2.
Challenge: calculate Pearson correlation for two objects and two attributes.
What can we assume about almost all non-random data?
How can we categorize and remember the fundamental data types?
What are some common dataset types?
You should have completed Reading 4 and be working on Project 2.
Proximity metrics review (Minkowski/Euclidean, cosine similarity, SMC, Jaccard)
How does the Extended-Jaccard/Tanimoto Coefficient work?
When should you use the Tanimoto Coefficient?
How does Pearson Correlation represent similarity?
How do you calculate Pearson correlation (without a calculator)?
When should you use Pearson?
What are three common attribute issues when measuring similarity, and how might you handle those issues?
How do you choose a similarity metric?
You should have completed Reading 3 and started Project 2.
How does Euclidean distance represent similarity?
How does the Simple Matching Coefficient represent similarity?
When should you use SMC?
How does the Jaccard Coefficient represent similarity?
When should you use Jaccard?
How is Cosine Similarity calculated?
When should you use Cosine Similarity?
You should have completed Reading 2 and Project 1.
What are the two main categories of Data Mining strategies?
What are the five general types of strategies we will use in this class?
Discussion: CI ch01, BD ch01, "Mining Crime Data."
Project 1 & 2 overview.
You should have completed Reading 1
Where does Data Mining fit in the big picture of "Knowledge Discovery?"
What are some limits to machine learning?
In what industries or contexts is data mining used? What are some examples?
Project 1 overview.
You should have skimmed the table of contents and the entire contents of DM, CI and BD
Semester overview.
What is data mining?
Why does data mining even exist? What's the point?
Assignment: Reading 1