Data.Frames

  • Run code, learn more
  • ~45 Minutes
  • Followed By: Good Bye
    • Or Lunch
Photo of the Titanic

Learning Objectives

  • What is a data.frame?
  • Creation
  • Visualization
  • Import CSV data
  • Indexing
  • Deletion
  • Further Learning
Drawing ofthe Titanic gymnasium

Drawing of the Titanic gymnasium

What is a Data.Frame

  • N dimensional object
    • Usually 2-dimensional
    • Normal people call this a table
  • All entries in a column have the same class
  • Rows can have entries of different classes
  • Each column must be the same length (missing data represented as NA)
  • A single data.frame is easier to manage than a collection of vectors
  • Relationship between the entries is made explicit
Paperboy with newspaper of Titanic's sinking

Paperboy Ned Parfett, Soldier and Paperboy

Creation[1]

Works best when you copy/paste


## First we need some vectors
age <- c(22,38,26,35,35,NA,54,2,27,14)
gender <- c("male","female","female","female","male",
            "male","male","male","female","female")
survived <- c(FALSE,TRUE,TRUE,TRUE,FALSE,
              FALSE,FALSE,FALSE,TRUE,TRUE)
    

Creation[2]


## Then we put them together
passengers <- data.frame(age, gender, survived)

## Yeah, the capital V is important. Dunno why.
View(passengers)
    

   age gender survived
1   22   male    FALSE
2   38 female     TRUE
3   26 female     TRUE
4   35 female     TRUE
5   35   male    FALSE
6   NA   male    FALSE
7   54   male    FALSE
8    2   male    FALSE
9   27 female     TRUE
10  14 female     TRUE

## Those numbers on the left: row numbers.
    

Memory Utilization In R

  • Statistical tools such as SAS & SPSS can manage datasets larger than total system RAM
  • Base R cannot do this
  • Once upon a time - this was a practical concern
  • Today - RAM is cheap
    • Other Options: RDBMS tools, Spark, and Hadoop
  • Few useRs experience this limitation

Photo of Captain Smith Your Turn!


## What does this tell us?
dim(passengers)
    

[1] 10  3

## 10 == Number of Rows
##  3 == Number of Columns
## This is ALWAYS the order!
    

Import CSV data

  • R can import data, from CSV, Excel, Databases, etc.

## Import data from CSV.
## This WILL overwrite the data.frame we made a few minutes ago!
url <- "http://choens.github.io/titanic/public/data/passengers.csv"
passengers <- read.csv(url)
head(passengers)
    

Photo of Captain Smith Your Turn!


## Spend some quality time with the following commands.
## ? is a great way to read the documentation.
dim(passengers)
names(passengers)
summary(passengers)
str(passengers)
View(passengers)
head(passengers)
tail(passengers)
    

Let us know if you need help!

Reuse What You Know!

  • Functions adapt to the data type (Object Oriented)
  • Both of the following code segments work

head(survived)
            

[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE
            

head(passengers)
            

  passengerid survived pclass                                                 name    sex age embarked 
1           1        0      3                              Braund, Mr. Owen Harris   male  22        S 
2           2        1      1  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38        C 
3           3        1      3                               Heikkinen, Miss. Laina female  26        S 
4           4        1      1         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35        S 
5           5        0      3                             Allen, Mr. William Henry   male  35        S 
6           6        0      3                                     Moran, Mr. James   male  NA        Q 
            

Indexing

  • Back to the land of square brackets
  • Critical: passengers[ROW,COLUMN]

## Returns one row by row #.
passengers[10, ]
            

   PassengerId Survived Pclass                                Name    Sex Age Embarked
10          10        1      2 Nasser, Mrs. Nicholas (Adele Achem) female  14        C
            

## Returns one column by column name.
passengers[ ,"age"]
            

  [1] 22.00 38.00 26.00 35.00 35.00    NA 54.00  2.00 27.00 14.00  4.00 58.00
 [13] 20.00 39.00 14.00 55.00  2.00    NA 31.00    NA 35.00 34.00 15.00 28.00
 [25]  8.00 38.00    NA 19.00    NA    NA 40.00    NA    NA 66.00 28.00 42.00
 [37]    NA 21.00 18.00 14.00 40.00 27.00    NA  3.00 19.00    NA    NA    NA
 [49]    NA 18.00  7.00 21.00 49.00 29.00 65.00    NA 21.00 28.50  5.00 11.00
 [61] 22.00 38.00 45.00  4.00    NA    NA 29.00 19.00 17.00 26.00 32.00 16.00
 [73] 21.00 26.00 32.00 25.00    NA    NA  0.83 30.00 22.00 29.00    NA 28.00
 [85] 17.00 33.00 16.00    NA 23.00 24.00 29.00 20.00 46.00 26.00 59.00    NA
 [97] 71.00 23.00 34.00 34.00
            

Just Give Me A Vector!


## There is an easier way to refer to a single column
## A single column IS a Vector.
passengers$age
    

passengers$age
  [1] 22.00 38.00 26.00 35.00 35.00    NA 54.00  2.00 27.00 14.00  4.00 58.00
 [13] 20.00 39.00 14.00 55.00  2.00    NA 31.00    NA 35.00 34.00 15.00 28.00
 [25]  8.00 38.00    NA 19.00    NA    NA 40.00    NA    NA 66.00 28.00 42.00
 [37]    NA 21.00 18.00 14.00 40.00 27.00    NA  3.00 19.00    NA    NA    NA
 [49]    NA 18.00  7.00 21.00 49.00 29.00 65.00    NA 21.00 28.50  5.00 11.00
 [61] 22.00 38.00 45.00  4.00    NA    NA 29.00 19.00 17.00 26.00 32.00 16.00
 [73] 21.00 26.00 32.00 25.00    NA    NA  0.83 30.00 22.00 29.00    NA 28.00
 [85] 17.00 33.00 16.00    NA 23.00 24.00 29.00 20.00 46.00 26.00 59.00    NA
 [97] 71.00 23.00 34.00 34.00
    

The Real Power Is Boolean

  • A Boolean test on a vector returns a vector of TRUE/FALSE/NA the same length as the vector
  • We can use this vector as a filter
  • Let's do this together!

## This should tell us that passengers has 250 rows.
dim(these_passengers)

my_filter <- passengers$survived == 1
these_passengers_lived <- passengers[my_filter,]

## This should be less than 250.
dim(these_passengers_lived)
    

Photo of Captain Smith Your Turn!


## Using passgengers, create a new data.frame of "children".
## Define children as passengers where age < 18.
    

Answer on the next slide!

Photo of Captain Smith Your Turn!


## Define children as passengers where age < 18.
## Unlike last time - I did this as one step.
children <- passengers[passengers$age < 18, ]
summary(children)
    

  passengerid       survived       pclass                                      name        sex          age         embarked 
 Min.   : 8.00   Min.   :0.0   Min.   :2.000   Andersson, Miss. Erna Alexandra   : 1   female:11   Min.   : 0.830       : 0  
 1st Qu.:18.50   1st Qu.:0.0   1st Qu.:2.250   Caldwell, Master. Alden Gates     : 1   male  : 7   1st Qu.: 4.000   C   : 3  
 Median :47.50   Median :0.5   Median :3.000   Ford, Mr. William Neal            : 1   NA's  :22   Median : 9.500   Q   : 2  
 Mean   :45.50   Mean   :0.5   Mean   :2.722   Goodwin, Master. William Frederick: 1               Mean   : 9.435   S   :13  
 3rd Qu.:67.75   3rd Qu.:1.0   3rd Qu.:3.000   Goodwin, Miss. Lillian Amy        : 1               3rd Qu.:14.750   NA's:22  
 Max.   :87.00   Max.   :1.0   Max.   :3.000   (Other)                           :13               Max.   :17.000            
 NA's   :22      NA's   :22    NA's   :22      NA's                              :22               NA's   :22                
    

Photo of Captain Smith Your Turn!


## What does ls() do?
ls()
    

Crosstabs!

Very useful when working with categorical data.


## This command works with ONE vector . . . 
table(passengers$sex)

## Or Two
table(passengers$sex, passengers$survived)

## Or More! This one is a little weird.
table(passengers$sex, passengers$survived, passengers$passenger_class)

## This makes it more readable. A function inside a function!
ftable(table(passengers$sex, passengers$survived, passengers$passenger_class))
    

Photo of Captain Smith Your Turn!

Were men and women equally likely to survive the Titanic?


## - chisq.test() is the command to run the Chi Squared Test.
## Don't worry about calculating the percents.
    

Answer on the next slide!

Photo of Captain Smith Your Turn!


chisq.test( passengers$sex, passengers$survived )
    
    
	Pearson's Chi-squared test with Yates' continuity correction

data:  passengers$sex and passengers$survived
X-squared = 36.585, df = 1, p-value = 1.461e-09
    

Tabular Margins and Percents

In these commands 1 ALWAYS means "over rows" and 2 ALWAYS means "over columns"

Margins

## Margins are what other tools call Totals.
tbl <- table(passengers$sex, passengers$survived)
mbl <- margin.table( tbl, margin=1)
mbl
            


female   male 
   107    143 
            
Margins

## At least they didn't invent a new term for proportion.
tbl <- table(passengers$sex, passengers$survived)
pbl <- prop.table( tbl, margin=1)
pbl
            

               0         1
  female 0.2803738 0.7196262
  male   0.7692308 0.2307692
            

You _can_ do this without the intermediaries!

Photo of Captain Smith Your Turn!

Can you figure out how to round the proportions?


## Try the round command!
    

Answer on the next slide!

Photo of Captain Smith Your Turn!

Can you figure out how to round the proportions?


## Try the round command!
tbl <- table(passengers$sex, passengers$survived)
pbl <- prop.table( tbl, margin=1)
round(100*pbl,1)
    

            0    1
  female 28.0 72.0
  male   76.9 23.1
    

Deletion

  • This is thankfully EASY

## Always clean up after yourself.
rm(children)
    

Further Learning

  • This workshop is just . . . the tip of the iceberg.
    Sorry. I had to. I really had to.
  • Recommended R Resource: Quick-R

That's All Folks!

Photo of John Jacob Astor in 1919

John Jacob Astor IV in 1919

Before you leave, please complete the workshop survey .