Understanding the World with Data

STAT 20: Introduction to Probability and Statistics

Adapted by Gaston Sanchez

Agenda

  • Introductions
  • The Data Science Lifecycle
  • Types of Claims with Practice
  • Course Structure and Syllabus
  • Intro to R and RStudio
  • Looking forward

Introductions

  • Let us first introduce ourselves!

Introductions

  • In groups of 3, take turns introducing yourselves to one another by providing the info listed on the handout (your name, hometown, etc).

  • Each person should finish with a handout filled-in with info on their groupmates. Make sure you save this for next week!

05:00

The Data Science Lifecycle

Two Years Ago …

What’s going on with crashes in California?

01:00

01:00

01:00

Understand
the World

Data

Understand
the World

Data

Takeaways from this exercise

We can call the process of:

  • having a question,
  • finding data to investigate that question,
  • reaching a conclusion,
  • and then thinking of a next step which starts everything over again
  • the data science lifecycle.

This lifecycle involves constructing and critiquing claims made using data: which is the main goal of our course!

Types of Claims

Course Goal

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.

To learn to critique and construct
claims made using data.


A numerical, graphical, or verbal description of an aspect of data that is on hand.



Example
Using data collected from students in Stat 20 (Fall 2025), the proportion of students—in this class—born in California is 75%.


A numerical, graphical, or verbal description of a broader set of units than those on which data was been recorded.



Example
Using data collected from students in Stat 20 (Fall 2025), the proportion of UC Berkeley students born in California is 75%.


A claim that changing the value of one variable will influence the value of another variable.



Example
Data from a Randomized Controlled Experiment shows that lab scores of STAT 20 students who attend Group Tutoring sessions are 20% higher than those who don’t.


A guess about the value of an unknown variable, based on other known variables.



Example
Based on STAT 20 data from the past three semesters, I predict that the median score on quiz 1 will be 80%.

Practice Concept Questions

Practice Concept Questions

We will now re-examine a few pathways in the data science lifecycle:

  • Forming a question -> collecting data
  • Collecting data -> making a claim

The AeroFlex Shoe

“Peak Performance” is a fictional company which sells athletic gear, and is launching a brand new shoe, the AeroFlex, designed for track-and-field runners, claiming to improve performance.

AeroFlex shoe prototype.

The AeroFlex Shoe (cont’d)

The company’s marketing team has access to data from various sources:

  • Market Research: Data from a survey of 500 local runners before the shoe’s official launch.

  • Experiment data: Results from a controlled experiment with a test group of 40 university track-and-field athletes.

  • Sales data: Figures from the first year of sales, categorized by region.

  • Customer feedback: Online reviews and ratings submitted by purchasers.

Claims

Based on the survey results of 500 local runners, the company states:

“We are confident that if all local runners were to try the AeroFlex prototype, their average satisfaction rating would be between 4.0 and 4.4 stars.”

What kind of claim is this: a summary, a prediction, a generalization, or a causal claim?

01:00

Claims

Based on the sales data and customer feedback, the company states:

For every 0.1-star increase in the average online rating for the AeroFlex shoe, monthly sales will increase by 500 units.

What kind of claim is this: a summary, a prediction, a generalization, or a causal claim?

01:00

Claims

Based on the survey results of 500 local runners, the company states:

The average satisfaction score for the AeroFlex prototype was 4.2 out of 5 stars.

What kind of claim is this: a summary, a prediction, a generalization, or a causal claim?

01:00

Break

05:00

Course Structure



  • Read lecture notes
  • Work through reading questions
  • Work through concept questions solo / in groups / as a class
  • Make progress on assignments

All of the materials and links for the course can be found at:

https://stat20.berkeley.edu/fall-2025/

Syllabus

Take 4 minutes to read through the syllabus and jot down at least one question that you have.

04:00

Ed Discussion Forum

  • Forum to ask questions, answer questions, and course announcements
  • Please answer each other’s questions!

Practice by asking/answering a question on the “Syllabus Discussion” thread on Ed via the link at the top right of https://stat20.berkeley.edu/fall-2025/.

Intro to R and RStudio

Intro to R and RStudio (cont’d)

We will now

  • Demo the four parts of RStudio
  • Show you how to work with a Quarto Document
  • Walk through the first few questions of Lab 1

Components of RStudio

  1. Console

  2. Environment

  3. Editor

  4. File Directory

Now we are going to switch over to RStudio to understand these 4 components a bit better.

Components of RStudio

  1. Console: Where the live R session lives. Type commands into the prompt > and press enter/return to run them. The Console is in the lower-left pane.

  2. Environment: The space that keeps track of all of the data and objects that you have created or loaded and have access to. Found in the upper right pane.

  3. Editor: Used to compose and edit text (.qmd files) and R code (.r files). Found in the upper left pane.

  4. File Directory: Used to navigate between your files/folders on your Rstudio account. Can move, copy, rename, delete, etc. Found in the lower right pane.

R as a calculator

R allows all of the standard arithmetic operations.

Addition

1 + 2
[1] 3

Subtraction

1 - 2
[1] -1

Multiplication

1 * 2 
[1] 2

Division

1 / 2
[1] 0.5

R as a calculator, cont.

R allows all of the standard arithmetic operations.

Exponents

2 ^ 3
[1] 8

Parentheses for Order of Ops.

2 ^ 3 + 1
[1] 9
2 ^ (3 + 1)
[1] 16

Your turn

What is three times one point two raised to the quantity thirteen divided six?

01:00

Looking forward

  • This Friday 8/29 I’ll be holding OH in class to help you with Lab-1.
  • Read the lecture notes for Taxonomy of Data.
  • If you have any questions, you may leave a comment/question on the Taxonomy of Data thread on Ed.
  • Reading Questions for Taxonomy of Data are due on Gradescope by 11:59 pm on Tuesday.
  • Lab-1 and WSP-1 will also be due on Gradescope next Tuesday at 8am!

End of Lecture