Course Syllabus

Instructor Information

Tyler Moore

Office: S141 Science Center (next to the Leaky Beaker)

Email: tmoore@cs.wellesley.edu

Office Hours: Monday 3PM - 4PM, Wednesday 4PM-5:30PM in Micro-Focus, Thursday 4PM - 5:30PM in Micro-Focus, and by appointment

Email Hours: I strive to respond to course-related emails within 24 hours on weekdays. Inevitably I may overlook some messages; if more than 24 hours has passed, feel free to send me a reminder. If you need immediate help, try posting your question on Piazza.

Learning Objectives

Upon completing this course, you will have learned how to collect, analyze and visualize data for a variety of applications. You will acquire skills in script-based programming, where the goal is to write code that solves tasks quickly. This is subtly different than the paradigm introduced in CS 111 and 230, where the code you wrote created an artifact that solves a problem (e.g., sorting a list, designing a game). By contrast, the code you will be writing will help you gather and analyze data relevant to a topic of interest. The one exception will occur toward the end of the semester, once you have learned techniques for analyzing data. Then you will also learn how to construct user interfaces to data that enables others to glean insights without having to write code.

There are four primary learning objectives for the course, outlined in more detail below. The reason for all this detail in the syllabus is because this is a new course and so it is important for everyone to be on the same page of what we are aiming to learn this semester. Please be advised that, depending on how the course progresses, we may not cover all the material listed here.

Goal 1: Learn how to collect data to answer questions quantitatively

  1. Formulating questions data can answer
    • Explore potential topics of interest
    • What types of questions data can answer?
    • Understanding available data
      • Categorical vs. numerical data
      • Time-based or not
      • What does ideal data look like?
    • Sources of bias
    • Collecting control and treatment data
    • Combining disparate data sources
  2. Write scripts to automate collection
    • Methodologies for data collection
    • Direct data set downloads
    • Web page scraping
    • API access
    • Skills acquired: Python, HTML, crontab, regular expressions (grep)
  3. Understand how to store and query data
    • Data collection process
      1. Store raw collected data
      2. Create processed data designed for analysis
        • Flat file (CSVs)
        • Database entries (MongoDB)
    • Skills acquired: Python, crontab, MongoDB

Goal 2: Learn techniques of analysis to yield your own insights

  1. Using R to study data
    • Introduction to R syntax and data structures
      • Vectors, matrices, data frames and lists
      • I/O
    • Techniques for accessing data
      • The many ways to select particular rows and columns of data frames
      • Using logical vectors to select data subsets using conditionals
    • Data aggregation
      • Using functional programming to transform data
        • Entire datasets transformed (sapply, lapply)
        • Transform based on categorical variables (tapply, aggregate)
      • Creating categorical variables from numerical variables
    • Transforming data for analysis
    • Skills acquired: R
  2. Data exploration
    • Summary statistics
    • Box plots
    • Histograms
    • Cumulative distribution functions
    • Skewness and kurtosis
    • Comparing values split by categorical variables
    • Skills acquired: data processing, manipulation and plotting in R
  3. Analyzing a single variable
    • Summarizing variables
    • Matching empirically-observed distributions to theoretical distributions
      • Representative distributions
      • Kolmogorov-Smirnov test
    • Categorical variables
      • Testing for differences using the Chi-squared test
    • Part-to whole analysis
    • Deviation analysis
  4. Comparing multiple variables
    • Response variables
    • Explanatory variables
    • Linear regression
    • Logistic regression
    • Survival analysis
    • Visualizing more than two variables
      • Trellises
      • Heat maps
  5. Time series analysis

Goal 3: Learn how to facilitate analysis for others

  1. Architecture for making data available online
    • Python CGI scripting
    • Storing data on a web server
    • Presenting a subset of available data to users
    • Using JavaScript APIs to present data
  2. Designing a web interface to data
    • Selecting subset of interest to user (optional)
    • Trade-off between presenting known insights and letting users find their own
  3. Google Charts API
    • Selecting appropriate charts
      • Scatter plots
      • Bar charts
      • Motion charts
      • Annotated timelines
    • User-driven filters
    • Presenting data to Google Charts API
      • Trade-off between what is kept on the web server and what is made available to plot
      • Queries from MongoDB or loading flat-file CSVs
      • Creating a web interface to R using Google Charts API

Goal 4: Learn how to summarize the findings of your analysis

Unlike many programming tasks, your code is not necessarily the primary deliverable. Instead, the coding is a means to an end: delivering better understanding of a question that can be answered by collected data. We will discuss and practice techniques for:

What's not covered

Unfortunately there is only so much information that can be covered in a semester-long project-based course. I have deliberately chosen to err on the side of covering too little than too much. Worthy topics not covered include:

  1. Network analysis
  2. Data mining techniques using artificial intelligence and machine learning
  3. Analysis of huge datasets (peta-scale and beyond)

Fortunately for Wellesley students, the first two topics are covered in CS315: Web Search and Data Mining. The second topic is also covered in CS349A: The Intelligent Web and CS232: Artificial Intelligence. Finally, we only scratch the surface in terms of statistical analysis. For those students who want to dig deeper, particularly those with an economic bent, I encourage you to take ECON 203: Econometrics and ECON 242: The Information Economy.

Textbook and Readings

We are using a single physical textbook this semester:

We will also use web-based resources and book chapters. Readings will appear in the schedule next to the day you are expected to have completed the reading.

Course announcements in Google Groups

Course announcements will be made to the CS349B Announcements Google Group. By default, you will receive these messages in your Wellesley email.

Asking Questions Using Piazza

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the tutors, and instructors. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. Find our class page at: http://www.piazza.com/wellesley/spring2012/cs349b. Reading the Piazza forums on a regular basis is a requirement for the course.

Coursework

Assignments

There are 5 assignments, each equally weighted. For full details, see the assignments H0-H4 linked to from the schedule.

The assignments must be done in pairs. When working in a pair, both members must collaborate closely on the assignment and turn in a single assignment with both names listed. It is not allowed to split the tasks up between members; instead, all tasks must be completed together. All programming should be done with both members sitting at the same computer. Members should take turns actually typing in the code at the computer.

Furthermore, you can only work together as a pair for one assignment. After that, you should each find a new partner for subsequent assignments. Working in pairs is mandatory. While finding time to work with your partner can be difficult, in my experience working in pairs can substantially reduce the total amount of time required for working on assignments. This is because it is easier to squash bugs with two pairs of eyes, particularly when you are learning a new language.

Project

The goal of the semester-long course project is to give students the opportunity to apply new skills in the context of a real-world topic, from beginning to end. Projects should be carried out in teams of 3 students, including at least one sophomore or junior. The project incorporates several intermediate milestones during the course of the semester. Most milestones will require groups to apply skills learned in the preceding individual homework assignments to the project.

Deliverables for each project milestone will include working code, as in any other CS project-based courses. However, deliverables will also include well-reasoned written explanations of the tasks completed. For more details, see the project page

Blog

One primary goal of the course is to improve your skills in summarizing the findings of your analysis. Communicating technical topics clearly and succinctly can be hard. To get more practice, you will maintain a blog during the semester. Six separate blog posts will be written, mostly in the context of the class project.

In addition to improving writing skills, the blog posts are designed to foster collaboration between students. Because groups will be working on a diverse range of topics, this presents an opportunity to learn from each other about the different approaches that are suitable. Students will be expected to read a number of blog posts from classmates before specified classes, so that they can be discussed in lecture. For full details, see the blog post assignments linked to from the schedule.

Finally, because most blog posts will be project-related, they are designed to make writing up the project report P5 a bit easier. Blog posts can be adapted into sections of the project report.

Grade Distribution

I use standard percentage cut-offs when determining letter grades (e.g., [93-100] is an A, [90-93) is an A, [87-90) is a B+, etc.). I do not use a curve in assigning grades, as I believe grading on a curve discourages collaboration among students. Occasionally, though, a particular assignment may be too difficult and so I reserve the right to adjust the score appropriately.

Quantitative Reasoning Overlay

This course satisfies the Quantitative Reasoning overlay requirement.

Attendance and Participation Policy

I expect you to attend classes and participate in class discussions. I understand that occasionally circumstances may arise so that you must miss class. This is OK, but I would appreciate if you send me an email in advance letting me know that you won't be able to attend class. Chronically missing class is not acceptable, and I reserve the right to penalize the course grade in the event of persistent absence.

I also expect that you will keep up with the reading. In particular, I expect that you will do the reading assigned for the day on the schedule prior to attending class.

Late Work

The assignments are designed to prepare you for tasks on the course project, and often build on concepts introduced in earlier assignments. Consequently, it is essential that you do not fall too far behind. As a result, assignments and project tasks really are due at the time stated in the course schedule.

There are three exceptions to this policy. First, if you have an emergency (e.g., serious illness, death in the family), please let me know as soon as possible so we can work out an accommodation.

Second, students are given 5 lateness coupons for assignments for use throughout the semester, with one coupon equal to a 24-hour extension. Project teams are given 3 lateness coupons, which grants an extension for all project tasks with the exception of the blog post portion of the assignment, which must be turned in on time so other students can review what has been written in order to give timely feedback. No lateness coupons can be used for the final presentations either. Finally, no lateness coupons can be used for the final report, as there is a strict deadline that work must be turned in by the end of the exam period.

To redeem a lateness coupon, you must send an email to qtw@cs.wellesley.edu with subject "Lateness coupon" BEFORE the assignment is due. In the body of the email pelase let me know how many coupons you wish to redeem.

The third exception to the strict deadline policy is for unforeseen circumstances that affect everyone: the power goes out two hours before an assignment is due, for example. In this case, I will extend the deadline in a reasonable manner (e.g., extend by 24 hours after power is restored).

Collaboration and Attribution

I encourage collaboration between students on assignments and when studying. Collaboration is an essential skill for software development, not to mention life in general. Unless I say otherwise, feel free to discuss assignments and the project with your classmates, including ideas for how to solve problems. Please do not, however, share code that solves an assignment directly with other students. Solutions to homeworks should be written from scratch and not pieced together from other students.

It is also important to give credit to others when appropriate. If you implement an idea that you got from another student (or students), please say so. Furthermore, if you consult a web resource that directly assists you, please say so. As a reminder, it is also not acceptable to copy code directly from a web resource that solves a problem on an assignment.

Extra Credit

It is my policy to not offer extra credit assignments on a per-student basis. To ensure fairness, extra credit may only be offered to all students, and would most likely take the form of a modest reward for attending an optional lecture, not an extra assignment.

Special Needs

If you have any special needs, please come see me to discuss how to best accommodate you.