Course Infos
Course description
Ten years ago, who would have thought, that R
, the “environment for statistical computing and graphics”, would become one of the most popular programming languages for data scientists?
The impressive growth of R
is not a coincidence. As free & open-source alternative to expensive & proprietary software like SPSS, Matlab and Excel, R
’s strengths have always been its capabilities for statistical data analysis as well as its functionalities to create powerful, aesthetically appealing graphics and charts.
While R
attracted a rather exclusively academic audience in the 90’s & 00’s, the R
community since has grown not only by sheer number but also in diversity, as people from different industries and backgrounds discover R
’ usefulness for a wide range of applications. As of February 2020, more than 15,000 (!) packages have been published to CRAN, ca. half of them since 2015.
Especially in the last decade, the functionality and versatility of R
has gained momentum. Among the most popular R
packages are:
In Data Science with R (DataSciR), you will learn fundamentals of R
and how to use the following packages for Data Science:
- the “
tidyverse
” which includes packages likedplyr
andtidyr
for data manipulation andggplot2
for data visualization, rmarkdown
andknitr
for reproducible & automated reporting,shiny
for creating interactive web applications, andtidymodels
for inferential and predictive modeling.
You will demonstrate your proficiency in these packages on a semester-long graded data science project.
Prerequisites
There are no mandatory prerequisites for DataSciR. However, you are expected to have a profound knowledge of fundamental data mining techniques, such as classification, regression and clustering. Hence, it is recommended that you have heard at least one of the following lectures (or comparable):
Also, you should have a basic programming and statistics knowledge. For example, you will learn the most important vector types and classes in R
, but you will not learn what a vector or a class is in general. Accordingly, you should know what the terms mean, standard deviation, probability, etc. mean.
Recommended Reading
Data Mining / Statistical Analysis:
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Second edition. Springer, 2021.
- Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. Introduction to Data Mining. Second edition. Pearson, 2020.
R
-specific:
- Hadley Wickham, and Garrett Grolemund. R for Data Science. O’Reilly, 2017.
- Max Kuhn, and Julia Silge. Tidy Modeling with R. O’Reille, 2022.
- Hadley Wickham. ggplot2 - Elegant Graphics for Data Analysis. 3rd edition. Draft version.
- Hadley Wickham. Mastering Shiny. O’Reilly, 2021.
- Yihui Xie, J. J. Allaire, and Garrett Grolemund. R Markdown: The Definitive Guide. Chapman & Hall/CRC, 2018.
- Hadley Wickham. Advanced R. 2nd edition, Chapman & Hall/CRC, 2019.
- Max Kuhn, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013.
- Max Kuhn, and Kjell Johnson. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press, 2019.
- Bradley Boehmke. Hands-on Machine Learning with R. Chapman and Hall/CRC, 2019.
other:
- Introduction to Data Science course held by Mine Çetinkaya-Rundel at the University of Edinburgh.
- Jenny Bryan, and others. Happy Git and GitHub for the useR. 2018.
- Jeffrey Leak. Organizing Data Science Projects. Learnpub.com.
- RStudio cheat sheets
- RStudio primers
- RStudio webinars
- Quick-R (short tutorials on various topics, e.g. data import, statistics and graph generation)
Software
By the end of the first week, you should have installed the following software on your own laptop:
Also, please check whether you can successfully install packages. To do so, click on the Packages tab in the bottom-right pane in RStudio. Then, click on the Install button and specify an arbitrary package, e.g. dplyr
. Finally, click on Install. Alternatively, you can install a package from the console with install.packages("dplyr")
. If everything is set up correctly, no error messages should be displayed when you load the installed package with library(dplyr)
.
FAQ
Q: Do I have to show up to regular course meetings?
A: There is no compulsory attendance during general course meetings. Obviously, it is recommended to attend and actively participate in the meetings. However, there are several deliverables in the context of your project during the semester. Some of them require attendance.
Q: Where can I find interesting datasets for my project?
A: Here is a list of websites with various real-world datasets:
- curated list of “awesome” public datasets
- companies:
- governments:
- List of
R
data packages that access various web APIs - FiveThirtyEight
- “100+ Interesting Data Sets for Statistics”
Don’t use a small or built-in dataset like iris, mtcars or Titanic in your project! Also, very popular data sources like Kaggle or the UCI ML repository are deprecated because most of datasets have been extensively studied already.
Q: I’ve no prior programming experience with R
? Is that a problem?
A: Programming experience with R
is not a mandatory prerequisite. An introduction to the fundamentals of the R
language will be given in the second course week. However, please take into account that this introduction cannot cover all details. You are expected to work through additional ressources and materials for yourself or together with your team. Most of the referenced materials are freely available.
Q: Where can I get some inspiration for our project?
A: Please have a look at the hall of fame for an overview of student projects from 2019, 2020, and 2021.