Group Project Overview

During the semester, students are expected to work on a capstone project within a group. The capstone project is meant to showcase the ability to implement an analytical data pipeline. That is to say, we would like each student project to show evidence of the following stages:

R4DS Chapter 1: Data Science Project Overview

Timeline

There are five assignments for the project. Their due dates are:

  • Group Choice - Friday, March 30, 11:59 PM
  • Project Proposal - Friday, April 13, 11:59 PM
  • Project Demo Video - Monday, April 30th, 11:59 PM
  • Final Report - Tuesday, May 8, 10:00 PM (in place of the final exam)
  • Peer Evaluation - Tuesday, May 8, 10:00 PM

Analytical Data Pipeline

The final product of this project will consist of:

  1. a written report that details the construction of the data pipeline and
  2. code that can be run on a member of the course staff’s computer.

Data Selection

You may use or construct any dataset of your choice under two conditions:

  1. there is a minimum of 500 observations and 10 variables;
  2. data is not from either UC Irvine Machine Learning Repository or Kaggle and was not used as a data set in the course.

The dataset may be relevant to research outside of this course, another field, or some other interest of the groups. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from either a research project or current job be sure to gain permission from the data controller.

Group Project Checklist

The must successful projects have a tendency to exhibit the following characteristics:

  • describe the problem or need succinctly;
  • explain the benefits of the project;
  • complete the project before the course ends;
  • ability to approach the problem or need in multiple ways;
  • have fun working on the project.

Try to make sure these characteristics can be found in your project.

Task Specifics

Group Choice

For this project, you must work in groups of at least three students and at most four students. A portion of your grade will come from your ability to work in a group setting.

Please submit a roster of your group members by Friday, March 30, 11:59 PM. Send a single email to the STAT 385 e-mail box (stat-385@illinois.edu) that contains:

  • Name of each group member
  • NetID
  • Team Name

Please also CC all group members’ University emails to verify that each student is in the correct group.

If you would like to be assigned to a group, send an email to STAT 385 e-mail box (stat-385@illinois.edu) and state in the body of the email that you would like to be assigned to a group. You must do so by Friday, March 30, 11:59 PM., but if you do so earlier, you may be assigned to a group earlier.

You may not be a group of one. Groups of one only exist if a student has been fired from their original group. At which point, they must complete the group project themselves. To trigger this measure, the student’s group must:

  • Build a case against the team member that documents their inability to contribute to the project.
  • Schedule a meeting with the instructor to try to resolve this.
  • After one week, if the issues have not been addressed, the team member will be discharged from the group and will have to complete the group project by themselves.

Project Proposal

The project proposal is due on Friday, April 13, 11:59 PM

It should be submitted via the groups GitHub repository in stat385-sp2017/group-project-<team-name>.

After review of the proposal, it will be evaluated in one of two ways:

  • Approved: Your group may proceed with your plans for the data and project.
  • Pending: We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.

Within the group project proposal, students should have the following sections:

  • Introduction
  • Related Work
  • Method
  • Feasibility
  • Conclusion
  • References
  • Appendix

Content within every section except the Appendix should contain text only. All figures, tables, or supporting material should be placed within an Appendix and referenced from within the proposal.

All reports should be written in size 12 font, Times New Roman, and be single-spaced. Color is permitted for section headers. However, the use of color should be used sparingly within the body of the document. Label each section of the proposal clearly, e.g. “Introduction”, “Related Work”, …

Expectations regarding the contents of each section is outlined next. Make sure to answer each question fully within the given section.

Note: The proposal style is loosely based on the IMRAD Style. Organzing a paper under the IMRAD is largely used to write scholarly articles. Check out the Carnegie Mellon University’s IMRD Cheat Sheet for more details.

Introduction

The introduction section provides a preview of the project’s focus. Within this section, provide an overview on the selected topic for the consumption of a manager. In essence, the manager must be able to understand what the project is and why they should support the endeavor. You are allowed to make the assumption that the manager is knowledgeable in base R concepts. Make sure to answer the following questions:

  • What problem or topic are you addressing?
  • Why is it interesting or important? In particular, what evidence supports this conclusion?
    • Cite papers or reputable sources that back up this claim. (You may want to find material using Google Scholar.)
  • Where did the problem or topic come from?
  • What is your idea for addressing the problem or topic?
  • How does your idea match with the course’s focus on statistical programming?

The Related Work section must provide an overview of pre-existing solutions. In essence, please credit those who enabled you to consider embarking on this project, or as Issac Newton in a letter to Robert Hooke on 15 February 1676 more aptly put it:

If I have seen further it is by standing on the shoulders of Giants.

Address the following questions:

  • What other ideas have been attempted?
  • Why is your team’s idea original compared to prior work?

Method

The Methods section should contain the overall details of the project including any preliminary work. In particular, the implementation details behind the approach should be explained at length here. The more details you can provide, the better feedback your group can receive. As a result, the section serves as a roadmap of what features are going to be developed and any external dependencies that are required. To satisfy this section, provide detailed responses for the following:

  • What packages will you use in your implementation?
  • What code will the group need to write for the project?
  • Provide sketches in the Appendix of:
    • Visualisations
      • Show sample graphs you plan to make.
    • Interface
      • How can someone use your product?
  • What have you done or learned so far for the project?

We are primarily wanting to ensure that your project has met the criterion of the data science pipeline. In essence, we want to see evidence that your project has:

  • Reading data into R or accessing data via an API.
  • Data transformations (e.g. Tidying (tidyr), Summarizing (dplyr), et cetera.)
  • Data visualization (e.g. ggplot2, plotly, gganimate)
  • R functions either in external packages or included in a new R package
  • Interactive Interface (e.g. shiny)
  • Reproducibility

Feasibility

The Feasibility section is meant to act as a way to reflect upon the proposal. Generally speaking, there will be three weeks of heavy development time afforded to the group. Building a detailed ecosystem or heavily scripting in a different language will likely not lead your team to success. Hence, please provide a project management overview of who on your team will be doing what and when by answering:

  • Is this project able to be completed before the end of the semester?
  • What steps must occur to complete the project before the end of the semester?
  • What is the work plan to accomplish the necessary tasks before the end of the semester?
    • Specify who is doing what and when.
    • Consider making a Gantt chart to highlight each stage of the project.

Conclusion

The Conclusion section provides a summary of the entire proposal. This acts as the final paragraph that can be used to justify the work being proposed. In general, this means you should make one last push to identify the problem, potential solution, and its novelty.

References

The References section acts as a bibliography for all papers referenced in the Introduction, Related Works, and Method sections. The references should be formated in Chicago author-date format, which is the default for RMarkdown.

  • Provide a list (5+) of papers or items you have read to write this proposal.
  • Please list all R packages or software referenced.

To acquire software citation information, R has a built-in command that creates a BibTex and in-line text citation. To generate the citation of an installed R package, type:

# In R
citation(package="pkg_name")

Appendix

The Appendix section contains figures, sample data, and other miscellaneous entries. Generally, this sketch seeks to contain all of your planning information.

  • Provide the sketches of visualisations or the shiny application.
  • Provide an overview on the desired functions.
    • What is a function’s input? Output? How are functions related to each other.
    • For example, read_data("hospital_data.csv") must be called before tidy_hospital(), et cetera.
  • Provide a sample of the data set you intend to use (~10 observations).

Project Demo Video

The project demo video is due by Monday, April 30th, 11:59 PM.

The goal behind the demo video is to provide an overview of the project and show how the solution presently works. The video should be between 3 and 7 minutes long.

For the overview, we would like to see 3 to 5 slides that briefly describe the problem/data, the method, and results. After this, please demo the solution in its entirety. Be wary of the time limit though!

The videos should be uploaded to either Google Drive or Box Sync and a download link sent in a single email to the STAT 385 e-mail box (stat-385@illinois.edu).

Suggestions

Below are suggestions on FREE software that can be used to record your screen and create a short demo video. If you need additional assistance, please visit the Media Commons @ UGL.

You do not have to use these software suggestions.

Screen Recording Software

To record your screen, you can use the following free screen recorders:

Editing Tools

To piece together different video clips, add sounds, or title cards, you can use one of the following movie editors:

Final Report

The final report for your project is due by Tuesday, May 8, 10:00 PM.

This report is largely an update of the initial project proposal. The goal here is to clean up the methods section so that it resembles the actual project methodology, include a results portion and discussion section that clearly convey the take aways from working with the data.

Please see the rubric for more details.

Peer Evaluation

Peer evaluation of all group members is due by Tuesday, May 8, 10:00 PM.

The peer evaluation will involve rating individually each member of your group, suggesting a grade, and indicating if any issue arose.

Please fill out the peer evaluation form:

https://goo.gl/forms/0Bb9Mlsn9VOxn3Po1

For difficulty logging into the form, please see the FAQ entry: Why is Google Forms telling me I’m outside of the organization when I try to fill out a survey?

Scoring Rubric

Proposal

  • Percent of Final Grade: 5%

For the group proposal, the structure is meant to provide a moment for an intervention or clarity to the project. The basis of the proposal is largely used as the basis for the final report. Spending time working on the proposal will have a significantly higher payoff when the time comes to submit the final report.

Having said this, you will be graded on whether each portion of proposal is answered, the clarity of the content, appropriateness of data, formatting, et cetera.

  • Introduction
    • [2] Topic/Problem is explained.
    • [2] Motivation for solving the problem is described
    • [2] Data set selected is described and the source is given.
    • [2] Project’s focus is on statistical programming.
  • Related Work
    • [2] Familiarity with prior work.
    • [2] Proposed work has not been attempted before.
  • Methods
    • [6] At least one entry exists for each of the stages of the data science workflow.
    • [2] Preliminary work done to undertake the project
    • [2] Sketches provide ample evidence of forethought.
    • [2] Interface provides multiple input controls and output areas.
  • Feasibility
    • [2] Project can be completed within the time frame. (e.g. Not a reimplementation of existing methods.)
    • [2] Gantt chart or breakdown of work displayed
    • [2] Project lists specific contributors for steps in the method.
  • Conclusion
    • [2] Group has provided an executive summary of the proposal.
  • References
    • [2] At least 5+ content references are included
    • [2] Citations are appropriately listed
    • [1] Packages being used are cited (citation(package = "pkg_name"))
  • General
    • [3] Grammar and Spelling
      • Free from spelling mistakes
      • Content follows a logical ordering
      • Audience considerations are accounted for (e.g. explain for the layperson / manager.)
    • [3] The appropriate formatting is followed
      • Report includes project title.
      • Team Name, Team Members, and NetIDs are included in the report.
      • Report is appropriately named and submitted.
Points Status
> 35 Approved+
(30, 35] Approved
(25, 30] Approved-
< 20 Pending

Video

  • Percent of Final Grade: 5%

Guide to be Released

Final Report

  • Percent of Final Grade: 13.75%

The CAs will use the following point breakdown on the final report

  • Introduction
    • [5] Problem statement.
      • What is the issue that has arisen?
    • [5] Relevance to audience.
      • Why should we be interested in the project?
    • [5] Description of data
      • What is the data and how it is related to the goal?
      • Please place the code book (e.g. description of each variable) in the appendix.
    • [5] Course connection
      • How does your idea match with the course’s focus on statistical programming?
  • Related Work
    • [5] Previous approaches
      • Who has done what so far on this problem?
    • [5] Novelty of Approach
      • How is your view original in comparison to this body of work?
  • Methods
    • [5] Appropriate methods from class are used.
    • [10] Methods are used correctly.
  • Results
    • [5] Results are clearly organized either visually or as a table.
  • Discussion
    • [5] Correct conclusions are drawn from the results.
    • [5] How the results relate to the goal is discussed.
    • [10] Results are connected to the motivation of the project.
  • Conclusion
    • [5] Appropriately summarize the project and end outcomes.
  • References
    • [5] Citations are appropriately listed
  • Code
    • [25] R is used appropriately.
      • Does your code perform the desired tasks?
      • Is your code readable?
      • Is your style consistent?
      • Does your code work on a different computer?
  • General
    • [5] Grammar and Spelling
      • Free from spelling mistakes
      • Content follows a logical ordering
      • Audience considerations are accounted for (e.g. explain for the layperson / manager.)
    • [5] The appropriate formatting is followed
      • Report includes project title.
      • Team Name, Team Members, and NetIDs are included in the report.
      • Report is appropriately named and submitted.

Peer Evaluation

  • Percent of Final Grade: 1.25%

When writing the peer evaluations, you will be asked to grade how well you did and inturn how well each other member of the group functioned. As a result, you should put thought into the reviews of each team member. Evaluations that are simplistic in nature, e.g. scoring all members as 100%, will likely result in reduced peer evaluation grade dedicated.

The instructor reserves the right to further reduce a students overall project grade if their team members report that they did not attempt to make a significant contribution to the project.

FAQ

This section will likely be updated as we progress through the remainder of the semester.

What do you mean we cannot embed code in the report sections?

The goal here is to write the final report as form of documentation for future students or employers to look over. As a result, there is a need to emphasize what the end project’s outcomes are to a general audience that is not as keyed into your project.

How long should the written reports be?

Reports should emphasize brevity and conciseness. As a result, there is no “minimum” page requirement; however, there is a “maximum” threshold. That is, avoid cluttering the report with extended text when simpler sentences would suffice.

Keep in mind that the group project is intentionally open-ended to see what your group will do without being given explicit steps, so have fun!


Home | Policies