CSE 8803 Special Topics: Big Data for Health Informatics

Course Creator and Instructor

Jimeng Sun
Jimeng Sun
Creator, Instructor

Course Developer

David Joiner
David Joyner
Course Developer


Data science plays an important role in many industries. In facing massive amount of heterogeneous data, scalable machine learning and data mining algorithms and systems become extremely important for data scientists. The growth of volume, complexity and speed in data drives the need for scalable data analytic algorithms and systems. In this course, we study such algorithms and systems in the context of healthcare applications.  

In healthcare, large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment.

In this course, we introduce the characteristics of medical data and associated data mining challenges on dealing with such data. We cover various algorithms and systems for big data analytics. We focus on studying those big data techniques in the context of concrete healthcare analytic applications such as predictive modeling, computational phenotyping and patient similarity.  

We focus on studying those big data techniques in the context of concrete healthcare analytic applications such as: 1. Predictive modeling: e.g., how to predict disease risks on individual patients? 2. Computational phenotyping: e.g., how to convert patient data from electronic health records into meaningful clinical concepts (phenotypes)? 3. Patient similarity: e.g., how to measure similarity between patients within a specific context?  

We also study big data analytic technology: 1. Scalable machine learning algorithms such as online learning and fast similarity search; 2. Big data analytic systems a. Hadoop family (MapReduce, Hive, Pig, HBase) b. Spark (SparkSQL, MLlib and GraphX)


To succeed in this class, please ensure that you can answer "yes" to each of the following questions:

  • Have you acquired basic machine learning and data mining concepts like classification and clustering (such as you would find in the OMS Machine Learning class)?
  • Are you proficient in programming in Python, Java, C++, and/or Scala?
  • Are you proficient with dealing with data in SQL and NoSQL?

Course Preview



This course involves two major evaluations:

  • Three homeworks assignments, due roughly every three weeks throughout the first half of the semester.
  • One group project spanning the second half of the semester.


  • Homework - 45% a. This includes 3 homework assignments, each worth 15%.
  • Participation - 5%
  • Projects - 50% a. Group Formation - 1% b. Proposal - 5% c. Mid-Presentation - 9% d. Final Presentation - 15% e. Final Paper - 15% (5% from peer evaluation and 10% from instructor and TA evaluation) f. Peer Review - 5%

Required Course Readings

There are no required readings besides those supplied as a part of the class.

Minimum Technical Requirements

  • Browser and connection speed: An up-to-date version of Google Chrome, Mozilla Firefox, or Internet Explorer is strongly recommended. 2+ Mbps is recommended; at minimum 1Mbps download speed is ideal.
  • Cloud Computing - Amazing Web Service (AWS) or MS Azure
  • Virtual Machine - Docker or other Virtual machine will be needed

Operating System:

  • Windows XP or higher with latest updates.
  • Mac OS X 10.6 or higher with latest updates.
  • Linux - Any recent distribution that has supported browsers installed

Other Info

Academic Honesty

All Georgia Tech students are expected to uphold the Georgia Tech Academic Honor Code