Jensen - Courses — CMPSCI 591Y (Spring 2005)

Knowledge Discovery and Data Mining

Description and PoliciesLectures and ReadingsHomework

CMPSCI 591Y • Spring 2005 • Tuesday & Thursday 1:00-2:15 • Lederle A339


David Jensen
238 Computer Science Building
Office hours are Wednesday 1:00-2:00 and by appointment

Teaching Assistant Matthew Rattigan
339 Computer Science Building
Office hours are Monday 3:00-4:00, Friday 2:00-3:00


Officially, the prerequisites are junior, senior or graduate standing in CMPSCI and at least one statistics course. However, I also admit students on a case-by-case basis, depending on interests and prior coursework.


Knowledge discovery is the process of discovering useful regularities in large and complex data sets. The field encompasses techniques from artificial intelligence (representation and search), statistics (inference), and databases (data storage and access). When integrated into useful systems, these techniques can help human analysts make sense of vast stores of digital information. This course presents the fundamental principles of the field, familiarizes you with the technical details of representative algorithms, and connects these concepts to applications in industry, science, and government, including fraud detection, marketing, scientific discovery, and web mining. The course assumes that you are familiar with basic concepts and algorithms from probability and statistics.


The text for the course will be D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001. It is available from many online retailers, including Amazon, Barnes & Noble, and Abebooks. We will be using the Weka 3 Data Mining Software in Java, as well as the R statistics package. Other online resources will be made available as the semester progresses.

Learning goals

In this course, you can expect to learn how to:

  • Identify the elements of knowledge discovery (KD) algorithms
  • Understand how those elements can interact in beneficial and pathological ways
  • Recognize instances of known KD tasks
  • Reformulate new problems so that they can be addressed by existing algorithms
  • Apply known algorithms in practical situations
  • Compose successful new algorithms from basic elements
  • Evaluate the performance of algorithms and entire systems
Educational context This course can be seen as a 'capstone' to several other courses in the Computer Science and Mathematics curricula, including programming, databases, AI, and statistics. Full exploitation of systems for knowledge discovery and data mining require integration of knowledge from all of these areas, and this integration provides powerful tools that are now widely used in business, government, and science. Successfully completing this course will provide you with a versatile and general toolbox to approach problems requiring analysis of large datasets, including problems in marketing, fraud detection, finance, fault diagnosis, engineering design, healthcare, and many other fields.
Teaching approach For a course of this kind, I serve primarily as an organizer, guide, and resource, rather than the sole source of information and the provider of correct answers. This material requires your active participation to learn it, and that is one of the reasons for the course's structure. Throughout the semester, most homework assignments will contribute toward a final project where you will synthesize a novel knowledge discovery algorithm and apply it to a novel data set. The results of some homework assignments will be new discoveries about previously unanalyzed data sets. Discussion of these results, and the methods used to obtain them, will be a substantial portion of some class periods. Students will be expected to participate actively in class discussions and pursue analyses of data sets for which there is no "right" answer.
What to expect during class Lecture will consume about 40 minutes of each class. The remainder will be spent discussing homework assignments, research papers, and case studies. Some classes will also contain brief quizzes covering material drawn from the course text. For each class meeting, you are expected to have read the assigned selections from the text and any additional assigned readings. The additional readings will be directly discussed. Readings from the course text will be necessary to understand the in-class lecture, but they will rarely be directly reviewed in the lecture.
Grading 30% homework and quizzes, 30% two exams (at roughly 1/3 and 2/3 points of the semester), 20% final project, and 20% class participation.
How to excel in this course

Work consistently — If, twice a week, you do the reading, complete the homework, and participate in class, then it will be hard not to succeed in this course. This is probably the best advice I can give you. You are unlikely to succeed if you expect to do the work of the course in a few short periods of work before exams and the final project deadline.

Participate actively — Passive participation during lectures and class discussions is unlikely to help you learn. It will be even less successful on your homework assignments and project. Participate actively in class and actively seek ways to satisfy the goals of the homework. For many of the homework assignments and in-class activities, there is more than one right answer.

Policies Absence — Absences for approved medical, religious, or family reasons will not impact class participation grades. Absences for other reasons will result in a grade of zero for class participation that day. However, you can drop up to three class participation grades without penalty. If you have fewer than three missing class participation grades at the end of the semester, I will drop the days with the lowest grade(s). You can also elect to use one or more of your "free passes" to attend class, but not participate in discussion, provided that you notify me at the beginning of class.

Late work — Homework assignments are due at the beginning of class on the designated date. Late homework assignments will be accepted only if you have an approved medical, religious, or family absence on the day that the assignment is due. You can drop up to three homework assignments without penalty. If you have fewer than three missing homework assignments at the end of the semester, I will drop the assignments with the lowest grade(s).

Academic honesty — Incidents of academic dishonesty will be reported to the University Academic Honesty Board. You are responsible for knowing the University Academic Honesty Policy.

Working together — Discussion of homework assignments is not considered cheating and is strongly encouraged. If you receive substantial help from another person you must acknowledge them in your work. If you use any published or unpublished source in any of your own work, you must give full citation.

If you have questions about these policies please contact me.