This document focuses on the UFO sightings in the United states over the last 70 years.
This Data is published by the NUFORC, the National UFO Reporting Center. This data set was published roughly one year ago. When looking at the UFO data set, there are 88875 observations, and 11 variables. This data set was posted on kagle.com.
Here are the column names: datetime, city, state, country, shape, duration (seconds), duration (hours/min), comments, date posted, latitude, longitude
Lets look into the data. When I read in the data set using read.csv, I used na.strings = "" to replace any blanks with NA. The first step thing I want to konw is how many blanks for each variable. To find this out I used the sapply, function, sum, and is.na to sum the number of NA’s in the data. I also plan to str() function, to see the data types of the variables. Letosttes see our results:
datetime city state
0 196 7519
country shape duration (seconds)
12561 3118 5
duration (hours/min) comments date posted
3101 126 0
latitude longitude
0 0
US_data <- UFO_data %>%
filter(country == "us")
US_data %>%
select(state) %>%
table() %>%
sort(decreasing =TRUE)
.
ca wa fl tx ny il az pa oh mi or nc mo co
9575 4292 4155 3742 3234 2698 2617 2520 2464 1980 1882 1863 1569 1521
in va ga nj ma wi tn mn sc ct ky md nv nm
1404 1381 1359 1352 1341 1309 1228 1096 1078 971 931 890 858 794
ok ia al ut ks ar me la id mt nh wv ne ms
779 735 706 659 652 642 599 597 566 531 517 493 415 414
ak vt hi ri sd wy de nd pr dc
341 283 276 244 196 191 180 140 26 7
str(US_data)
spec_tbl_df [70,293 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ datetime : chr [1:70293] "10/10/1949 20:30" "10/10/1956 21:00" "10/10/1960 20:00" "10/10/1961 19:00" ...
$ city : chr [1:70293] "san marcos" "edna" "kaneohe" "bristol" ...
$ state : chr [1:70293] "tx" "tx" "hi" "tn" ...
$ country : chr [1:70293] "us" "us" "us" "us" ...
$ shape : chr [1:70293] "cylinder" "circle" "light" "sphere" ...
$ duration (seconds) : num [1:70293] 2700 20 900 300 1200 180 120 300 180 1800 ...
$ duration (hours/min): chr [1:70293] "45 minutes" "1/2 hour" "15 minutes" "5 minutes" ...
$ comments : chr [1:70293] "This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church"| __truncated__ "My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I "| __truncated__ "AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise, I was at 50ꯠ' "| __truncated__ "My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my f"| __truncated__ ...
$ date posted : chr [1:70293] "4/27/2004" "1/17/2004" "1/22/2004" "4/27/2007" ...
$ latitude : chr [1:70293] "29.8830556" "28.9783333" "21.4180556" "36.5950000" ...
$ longitude : num [1:70293] -97.9 -96.6 -157.8 -82.2 -73.4 ...
- attr(*, "problems")= tibble [199 x 5] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:199] 877 1712 1814 2857 3733 4755 5388 5422 5613 5848 ...
..$ col : chr [1:199] NA NA NA NA ...
..$ expected: chr [1:199] "11 columns" "11 columns" "11 columns" "11 columns" ...
..$ actual : chr [1:199] "12 columns" "12 columns" "12 columns" "12 columns" ...
..$ file : chr [1:199] "'complete.csv'" "'complete.csv'" "'complete.csv'" "'complete.csv'" ...
- attr(*, "spec")=
.. cols(
.. datetime = col_character(),
.. city = col_character(),
.. state = col_character(),
.. country = col_character(),
.. shape = col_character(),
.. `duration (seconds)` = col_double(),
.. `duration (hours/min)` = col_character(),
.. comments = col_character(),
.. `date posted` = col_character(),
.. latitude = col_character(),
.. longitude = col_double()
.. )
There is a lot to look into. First, It seems like the country column has the most amount of NA’s. 12,757 out of the 89,000 data points. DateTime is consistent with 0 NA’s. I filtered the data down to US DATA, and created a table showing the count per state of UFO sighting. From the table, California has the largest amount of sightings (9575), so I am going to filter further into California.
When looking at the result of the str() function, I’ve noticed the the latitude column is a character type. This is going to cause trouble when trying to plot the location of each sighting. I’m going to use mutate to create a new column that is numeric. After I filter down to California, I will also create a table of the count per shape per sighting in California.
library(knitr)
US_data <- US_data %>%
mutate(new_latitude = as.numeric(latitude))
California_data <- US_data %>%
filter(state == "ca")
California_data %>%
select(shape)%>%
table()%>%
sort(decreasing = TRUE)
.
light circle triangle fireball disk other sphere
1937 915 863 732 685 663 642
unknown oval formation changing cigar chevron rectangle
613 416 312 294 208 158 155
flash diamond cylinder egg teardrop cross cone
153 143 130 96 77 34 32
flare round
1 1
It Looks like in California, The shape most seen is the light and circle, with triangle fire ball and disk.
Now, lets create some visuals. Since there are so many shapes, let’s use Count and Slice to grab and plot the top 5 shapes in California.
top_5_shapes <- California_data %>%
count(shape, sort = TRUE) %>%
slice(1:5)
top_5_shapes <- rename(top_5_shapes, Total_Count = n)
top_5_shapes %>%
ggplot()+
geom_col(aes(x = shape, y =Total_Count, fill = shape), show.legend = FALSE )+
geom_text(aes(x = shape, y =Total_Count, label = Total_Count), vjust = 3)+
theme_classic()+
labs(title = "Sighting by Shape in California")
Since the data was already counted using the count function, I had to use geom_col instead of geom_bar to plot the visual.
Now lets look into the positional data, specifically the latitude and longitude columns. As I mentioned before, I had to create a new column named “New_latitude” that was numeric instead of character type that was provided.
To plot this locational data, I plan to use ggmpa. To use ggmap, I had to create a free account with google cloud. I then had to request a key to the static map api. This allowed me to get the map that I would be plotting the location data onto.
#register_google(key = "Your API KEY")
#map <- get_googlemap(center = c(lon = -119.4179, lat = 36.7783),
# zoom = 6, scale = 2,
# maptype ='roadmap',
# color = "color")
#ggmap(map) + geom_point(data = US_data, aes(x = longitude, y = new_latitude ), size = .2)
#map("state", ".*california", myborder = 0) For one state.
california_map <- map("state", region = c("California", "Oregon", "Arizona", "Nevada"), col = "gray90", fill = TRUE)
head(US_data)
# A tibble: 6 x 12
datetime city state country shape `duration (seco~ `duration (hour~
<chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 10/10/1~ san ~ tx us cyli~ 2700 45 minutes
2 10/10/1~ edna tx us circ~ 20 1/2 hour
3 10/10/1~ kane~ hi us light 900 15 minutes
4 10/10/1~ bris~ tn us sphe~ 300 5 minutes
5 10/10/1~ norw~ ct us disk 1200 20 minutes
6 10/10/1~ pell~ al us disk 180 3 minutes
# ... with 5 more variables: comments <chr>, date posted <chr>,
# latitude <chr>, longitude <dbl>, new_latitude <dbl>
plot(california_map)
#ggplot(california_map, aes(long, lat))+
geom_point(aes(US_data$longitude,US_data$new_latitude))
mapping: x = ~US_data$longitude, y = ~US_data$new_latitude
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity
geom_polygon(data = california_map)
geom_polygon: na.rm = FALSE, rule = evenodd
stat_identity: na.rm = FALSE
position_identity
#US_data %>%
#points(longitude, new_latitude, col = "red")
From a Quick glimpse, most of the sightings are from San Franciso, LA, and San Diego. Lets see exactly how many sightings there are from those specific cities. To acomplish this, I will do the same method as above, create a dataframe of the top 5 citys by count.
top_5_citys <- California_data %>%
count(city, sort = TRUE) %>%
slice(1:5)
top_5_citys <- rename(top_5_citys, Total_Count = n)
top_5_citys %>%
ggplot()+
geom_col(aes(x = city, y =Total_Count, fill = city), show.legend = FALSE)+
geom_text(aes(x = city, y = Total_Count, label = Total_Count), vjust = 3)+
theme_get()+
labs(title = "Sighting by City in California")
Surprisingly It looks like Sacramento actually has more sightings from San Francisco. LA and San Diego Still have the most sightings by far.
The last item I would like to look into is the datetime column. I’m curious to see if I can find out if there is a specific time of the year when more people report UFO sightings. Perhaps there’s a season where Californians will see more UFO’s than others.
From a quick glance, it looks like the datetime column is a char type, so I’ll first have to change it to a datetime. I then plan to add new columns for for month, day and year.
California_data <- California_data %>%
mutate(new_datetime = as.Date(datetime,format = "%m/%d/%Y"))
head(select(California_data,new_datetime, datetime))
# A tibble: 6 x 2
new_datetime datetime
<date> <chr>
1 1968-10-10 10/10/1968 13:00
2 1979-10-10 10/10/1979 22:00
3 1989-10-10 10/10/1989 00:00
4 1995-10-10 10/10/1995 22:40
5 1998-10-10 10/10/1998 02:30
6 1999-10-10 10/10/1999 00:01
The New datetime looks like it worked based off the first six rows. Lets just check the tail as well just in-case.
# A tibble: 6 x 2
new_datetime datetime
<date> <chr>
1 2012-09-09 9/9/2012 13:00
2 2012-09-09 9/9/2012 20:00
3 2012-09-09 9/9/2012 20:30
4 2012-09-09 9/9/2012 21:00
5 2013-09-09 9/9/2013 09:51
6 2013-09-09 9/9/2013 22:00
OK. Everything looks fine. Now I am going to create a month column from the new_datetime column.
California_data <- California_data %>%
mutate(month_column = month(new_datetime),
year_column = year(new_datetime),
day_column = day(new_datetime))
head(select(California_data,datetime, new_datetime, month_column, year_column, day_column))
# A tibble: 6 x 5
datetime new_datetime month_column year_column day_column
<chr> <date> <dbl> <dbl> <int>
1 10/10/1968 13:00 1968-10-10 10 1968 10
2 10/10/1979 22:00 1979-10-10 10 1979 10
3 10/10/1989 00:00 1989-10-10 10 1989 10
4 10/10/1995 22:40 1995-10-10 10 1995 10
5 10/10/1998 02:30 1998-10-10 10 1998 10
6 10/10/1999 00:01 1999-10-10 10 1999 10
California_data <- California_data %>%
mutate(month = case_when(
month_column == 1 ~ "Jan",
month_column == 2 ~ "Feb",
month_column == 3 ~ "Mar",
month_column == 4 ~ "Apr",
month_column == 5 ~ "May",
month_column == 6 ~ "Jun",
month_column == 7 ~ "Jul",
month_column == 8 ~ "Aug",
month_column == 9 ~ "Sep",
month_column == 10 ~ "Oct",
month_column == 11 ~ "Nov",
month_column == 12 ~ "Dec"
))
Month_data <- California_data %>%
count(month)
Month_data <- rename(Month_data, Sightings = n)
Month_data
# A tibble: 12 x 2
month Sightings
* <chr> <int>
1 Apr 697
2 Aug 962
3 Dec 708
4 Feb 639
5 Jan 844
6 Jul 943
7 Jun 910
8 Mar 724
9 May 647
10 Nov 875
11 Oct 804
12 Sep 822
Month_data$month <- factor(Month_data$month, levels = c("Jan", "Feb", "Mar", "Apr","May",
"Jun","Jul","Aug","Sep","Oct",
"Nov","Dec"))
Month_data%>%
ggplot()+
geom_col(aes(x = month, y = Sightings, fill = month), show.legend = FALSE)+
geom_text(aes(x = month, y =Sightings, label = Sightings), vjust = 3)+
theme_classic()+
labs(title = "Sighting by Month in California")
Alright. There were a few steps that I had to go through in order to create the Sightings by Month visual. I first used mutate to create month, day, and year variables from the new_datetime column. Then I used the case_when and mutate to change the months from a number to a string (Aug, Sep, etc…). I then created an object using count, called month data. I did this because as of now, I am unaware how to use geom text with geom_bar. Geom_text with geom_col works fine. Since the months are now characters, when I plot them, they will be out of order. To solve this problem, I used factor to set up levels in the month column. After that, the data was ready to plot.
Based on this data set, if one was looking to see a UFO, I would recommend heading to San Diego, LA, or Sacramento during the Summer! If I was going to look deeper into this data set, I would look closer into the time of when these UFO’s are spotted.
For attribution, please cite this work as
Sullivan (2022, Feb. 12). Project List: HW 5: UFO Search!. Retrieved from https://pjsulliv34.github.io/Blog/posts/hw-5-ufo-search/
BibTeX citation
@misc{sullivan2022hw, author = {Sullivan, Peter}, title = {Project List: HW 5: UFO Search!}, url = {https://pjsulliv34.github.io/Blog/posts/hw-5-ufo-search/}, year = {2022} }