Project List: HW 5: UFO Search!

Peter Sullivan

Data Description:

This Data is published by the NUFORC, the National UFO Reporting Center. This data set was published roughly one year ago. When looking at the UFO data set, there are 88875 observations, and 11 variables. This data set was posted on kagle.com.

Here are the column names: datetime, city, state, country, shape, duration (seconds), duration (hours/min), comments, date posted, latitude, longitude

Lets look into the data. When I read in the data set using read.csv, I used na.strings = "" to replace any blanks with NA. The first step thing I want to konw is how many blanks for each variable. To find this out I used the sapply, function, sum, and is.na to sum the number of NA’s in the data. I also plan to str() function, to see the data types of the variables. Letosttes see our results:

Method

sapply(UFO_data, function(Count) sum(is.na(Count)))

            datetime                 city                state 
                   0                  196                 7519 
             country                shape   duration (seconds) 
               12561                 3118                    5 
duration (hours/min)             comments          date posted 
                3101                  126                    0 
            latitude            longitude 
                   0                    0

US_data <- UFO_data %>%
  filter(country == "us") 


US_data %>% 
  select(state) %>%
table() %>%
  sort(decreasing =TRUE)

.
  ca   wa   fl   tx   ny   il   az   pa   oh   mi   or   nc   mo   co 
9575 4292 4155 3742 3234 2698 2617 2520 2464 1980 1882 1863 1569 1521 
  in   va   ga   nj   ma   wi   tn   mn   sc   ct   ky   md   nv   nm 
1404 1381 1359 1352 1341 1309 1228 1096 1078  971  931  890  858  794 
  ok   ia   al   ut   ks   ar   me   la   id   mt   nh   wv   ne   ms 
 779  735  706  659  652  642  599  597  566  531  517  493  415  414 
  ak   vt   hi   ri   sd   wy   de   nd   pr   dc 
 341  283  276  244  196  191  180  140   26    7

str(US_data)

spec_tbl_df [70,293 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ datetime            : chr [1:70293] "10/10/1949 20:30" "10/10/1956 21:00" "10/10/1960 20:00" "10/10/1961 19:00" ...
 $ city                : chr [1:70293] "san marcos" "edna" "kaneohe" "bristol" ...
 $ state               : chr [1:70293] "tx" "tx" "hi" "tn" ...
 $ country             : chr [1:70293] "us" "us" "us" "us" ...
 $ shape               : chr [1:70293] "cylinder" "circle" "light" "sphere" ...
 $ duration (seconds)  : num [1:70293] 2700 20 900 300 1200 180 120 300 180 1800 ...
 $ duration (hours/min): chr [1:70293] "45 minutes" "1/2 hour" "15 minutes" "5 minutes" ...
 $ comments            : chr [1:70293] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I "| __truncated__ "AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 "| __truncated__ "My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my f"| __truncated__ ...
 $ date posted         : chr [1:70293] "4/27/2004" "1/17/2004" "1/22/2004" "4/27/2007" ...
 $ latitude            : chr [1:70293] "29.8830556" "28.9783333" "21.4180556" "36.5950000" ...
 $ longitude           : num [1:70293] -97.9 -96.6 -157.8 -82.2 -73.4 ...
 - attr(*, "problems")= tibble [199 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:199] 877 1712 1814 2857 3733 4755 5388 5422 5613 5848 ...
  ..$ col     : chr [1:199] NA NA NA NA ...
  ..$ expected: chr [1:199] "11 columns" "11 columns" "11 columns" "11 columns" ...
  ..$ actual  : chr [1:199] "12 columns" "12 columns" "12 columns" "12 columns" ...
  ..$ file    : chr [1:199] "'complete.csv'" "'complete.csv'" "'complete.csv'" "'complete.csv'" ...
 - attr(*, "spec")=
  .. cols(
  ..   datetime = col_character(),
  ..   city = col_character(),
  ..   state = col_character(),
  ..   country = col_character(),
  ..   shape = col_character(),
  ..   `duration (seconds)` = col_double(),
  ..   `duration (hours/min)` = col_character(),
  ..   comments = col_character(),
  ..   `date posted` = col_character(),
  ..   latitude = col_character(),
  ..   longitude = col_double()
  .. )

There is a lot to look into. First, It seems like the country column has the most amount of NA’s. 12,757 out of the 89,000 data points. DateTime is consistent with 0 NA’s. I filtered the data down to US DATA, and created a table showing the count per state of UFO sighting. From the table, California has the largest amount of sightings (9575), so I am going to filter further into California.

When looking at the result of the str() function, I’ve noticed the the latitude column is a character type. This is going to cause trouble when trying to plot the location of each sighting. I’m going to use mutate to create a new column that is numeric. After I filter down to California, I will also create a table of the count per shape per sighting in California.

library(knitr)
US_data <- US_data %>%
  mutate(new_latitude = as.numeric(latitude))


California_data <- US_data %>%
  filter(state == "ca")



California_data %>%
  select(shape)%>%
  table()%>%
  sort(decreasing = TRUE)

.
    light    circle  triangle  fireball      disk     other    sphere 
     1937       915       863       732       685       663       642 
  unknown      oval formation  changing     cigar   chevron rectangle 
      613       416       312       294       208       158       155 
    flash   diamond  cylinder       egg  teardrop     cross      cone 
      153       143       130        96        77        34        32 
    flare     round 
        1         1

It Looks like in California, The shape most seen is the light and circle, with triangle fire ball and disk.

Visuals

Now, lets create some visuals. Since there are so many shapes, let’s use Count and Slice to grab and plot the top 5 shapes in California.

top_5_shapes <- California_data %>%
  count(shape, sort = TRUE) %>%
  slice(1:5)
top_5_shapes <- rename(top_5_shapes, Total_Count = n)



top_5_shapes %>%
ggplot()+
geom_col(aes(x = shape, y =Total_Count, fill = shape), show.legend = FALSE )+
  geom_text(aes(x = shape, y =Total_Count, label = Total_Count), vjust = 3)+
  theme_classic()+
  labs(title = "Sighting by Shape in California")

Since the data was already counted using the count function, I had to use geom_col instead of geom_bar to plot the visual.

Mapping

Now lets look into the positional data, specifically the latitude and longitude columns. As I mentioned before, I had to create a new column named “New_latitude” that was numeric instead of character type that was provided.

To plot this locational data, I plan to use ggmpa. To use ggmap, I had to create a free account with google cloud. I then had to request a key to the static map api. This allowed me to get the map that I would be plotting the location data onto.

#register_google(key = "Your API KEY")
#map <- get_googlemap(center = c(lon = -119.4179, lat = 36.7783),
             #       zoom = 6, scale = 2,
             #       maptype ='roadmap',
              #      color = "color")

#ggmap(map) + geom_point(data = US_data, aes(x = longitude, y = new_latitude ), size = .2) 

#map("state", ".*california", myborder = 0)  For one state.
california_map <- map("state", region = c("California", "Oregon", "Arizona", "Nevada"), col = "gray90", fill = TRUE)

head(US_data)

# A tibble: 6 x 12
  datetime city  state country shape `duration (seco~ `duration (hour~
  <chr>    <chr> <chr> <chr>   <chr>            <dbl> <chr>           
1 10/10/1~ san ~ tx    us      cyli~             2700 45 minutes      
2 10/10/1~ edna  tx    us      circ~               20 1/2 hour        
3 10/10/1~ kane~ hi    us      light              900 15 minutes      
4 10/10/1~ bris~ tn    us      sphe~              300 5 minutes       
5 10/10/1~ norw~ ct    us      disk              1200 20 minutes      
6 10/10/1~ pell~ al    us      disk               180 3  minutes      
# ... with 5 more variables: comments <chr>, date posted <chr>,
#   latitude <chr>, longitude <dbl>, new_latitude <dbl>

plot(california_map)

#ggplot(california_map, aes(long, lat))+
  geom_point(aes(US_data$longitude,US_data$new_latitude))

mapping: x = ~US_data$longitude, y = ~US_data$new_latitude 
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity

  geom_polygon(data = california_map)

geom_polygon: na.rm = FALSE, rule = evenodd
stat_identity: na.rm = FALSE
position_identity

#US_data %>%
  #points(longitude, new_latitude, col = "red")

From a Quick glimpse, most of the sightings are from San Franciso, LA, and San Diego. Lets see exactly how many sightings there are from those specific cities. To acomplish this, I will do the same method as above, create a dataframe of the top 5 citys by count.

top_5_citys <- California_data %>%
  count(city, sort = TRUE) %>%
  slice(1:5)
top_5_citys <- rename(top_5_citys, Total_Count = n)



top_5_citys %>%
ggplot()+
geom_col(aes(x = city, y =Total_Count, fill = city), show.legend = FALSE)+
  geom_text(aes(x = city, y = Total_Count, label = Total_Count), vjust = 3)+
  theme_get()+
  labs(title = "Sighting by City in California")

Surprisingly It looks like Sacramento actually has more sightings from San Francisco. LA and San Diego Still have the most sightings by far.

Timeline

The last item I would like to look into is the datetime column. I’m curious to see if I can find out if there is a specific time of the year when more people report UFO sightings. Perhaps there’s a season where Californians will see more UFO’s than others.

From a quick glance, it looks like the datetime column is a char type, so I’ll first have to change it to a datetime. I then plan to add new columns for for month, day and year.

California_data <- California_data %>%
  mutate(new_datetime = as.Date(datetime,format = "%m/%d/%Y"))

head(select(California_data,new_datetime, datetime))

# A tibble: 6 x 2
  new_datetime datetime        
  <date>       <chr>           
1 1968-10-10   10/10/1968 13:00
2 1979-10-10   10/10/1979 22:00
3 1989-10-10   10/10/1989 00:00
4 1995-10-10   10/10/1995 22:40
5 1998-10-10   10/10/1998 02:30
6 1999-10-10   10/10/1999 00:01

The New datetime looks like it worked based off the first six rows. Lets just check the tail as well just in-case.

tail(select(California_data,new_datetime, datetime))

# A tibble: 6 x 2
  new_datetime datetime      
  <date>       <chr>         
1 2012-09-09   9/9/2012 13:00
2 2012-09-09   9/9/2012 20:00
3 2012-09-09   9/9/2012 20:30
4 2012-09-09   9/9/2012 21:00
5 2013-09-09   9/9/2013 09:51
6 2013-09-09   9/9/2013 22:00

OK. Everything looks fine. Now I am going to create a month column from the new_datetime column.

California_data <- California_data %>%
  mutate(month_column = month(new_datetime),
         year_column = year(new_datetime),
         day_column = day(new_datetime))
head(select(California_data,datetime, new_datetime, month_column, year_column, day_column))

# A tibble: 6 x 5
  datetime         new_datetime month_column year_column day_column
  <chr>            <date>              <dbl>       <dbl>      <int>
1 10/10/1968 13:00 1968-10-10             10        1968         10
2 10/10/1979 22:00 1979-10-10             10        1979         10
3 10/10/1989 00:00 1989-10-10             10        1989         10
4 10/10/1995 22:40 1995-10-10             10        1995         10
5 10/10/1998 02:30 1998-10-10             10        1998         10
6 10/10/1999 00:01 1999-10-10             10        1999         10

California_data <- California_data %>%
  mutate(month = case_when(
    month_column == 1 ~ "Jan",
    month_column == 2 ~ "Feb",
    month_column == 3 ~ "Mar",
    month_column == 4 ~ "Apr",
    month_column == 5 ~ "May",
    month_column == 6 ~ "Jun",
    month_column == 7 ~ "Jul",
    month_column == 8 ~ "Aug",
    month_column == 9 ~ "Sep",
    month_column == 10 ~ "Oct",
    month_column == 11 ~ "Nov",
    month_column == 12 ~ "Dec"
  ))


Month_data <- California_data %>%
  count(month) 

Month_data <- rename(Month_data, Sightings = n)
Month_data

# A tibble: 12 x 2
   month Sightings
 * <chr>     <int>
 1 Apr         697
 2 Aug         962
 3 Dec         708
 4 Feb         639
 5 Jan         844
 6 Jul         943
 7 Jun         910
 8 Mar         724
 9 May         647
10 Nov         875
11 Oct         804
12 Sep         822

Month_data$month <- factor(Month_data$month, levels = c("Jan", "Feb", "Mar", "Apr","May",
                                                        "Jun","Jul","Aug","Sep","Oct",
                                                        "Nov","Dec"))



Month_data%>%
  ggplot()+
  geom_col(aes(x = month, y = Sightings, fill = month), show.legend = FALSE)+  
  geom_text(aes(x = month, y =Sightings, label = Sightings), vjust = 3)+
  theme_classic()+
  labs(title = "Sighting by Month in California")

Results

Alright. There were a few steps that I had to go through in order to create the Sightings by Month visual. I first used mutate to create month, day, and year variables from the new_datetime column. Then I used the case_when and mutate to change the months from a number to a string (Aug, Sep, etc…). I then created an object using count, called month data. I did this because as of now, I am unaware how to use geom text with geom_bar. Geom_text with geom_col works fine. Since the months are now characters, when I plot them, they will be out of order. To solve this problem, I used factor to set up levels in the month column. After that, the data was ready to plot.

Based on this data set, if one was looking to see a UFO, I would recommend heading to San Diego, LA, or Sacramento during the Summer! If I was going to look deeper into this data set, I would look closer into the time of when these UFO’s are spotted.

HW 5: UFO Search!