## Potential Data Set

The write up should be more than one page, but less than two pages. Think of it as an executive summary of the data, or a news brief,you could show to a potential employer.Often data comes unstructured. There is a bit of work to do before you can readily apply the concepts learned in this course. To receive credit for a project you will

1) Identify a potential data set of interest. This can be on any topic you wish. There must be

a) at least fifty observations in the data set.

b) at least five variables.

c) two or more variables must be categorical, it is recommended that at least one of them takes only a few values

d) three or more variables must be quantitative.I recommend restricting the data to only the observations, and variables, that are necessary. You can create a smaller data set from a larger data set, in terms of variables or observations, using the subset() command in R.

*Resources to be Used*

There are many good resources to find data online:

• Google’s data set search: https://toolbox.google.com/data set search

• Data is plural structured archive, list of interesting data:https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit

• Bureau of Labor Statistics, prices, unemployment: https://www.bls.gov/data/

• Five Thirty Eight project data: https://github.com/fivethirtyeight/data

• Energy Information Agency: https://www.eia.gov/

• Census data at IPUMS: https://www.ipums.org/

• Economics data at the Federal Reserve: https://fred.stlouisfed.org/

• Economic history data: http://eh.net/databases/

• Bureau of Economic Analysis: https://www.bea.gov/data

• Agricultural data at USDA: https://www.ers.usda.gov/data-products/

• If you have a specific idea in mind, let me know and I can help you find a good dataset.

**What to Do**

2) Load the cleaned, final, data set in R and summarize the data by doing the following

a) Describe the number of observations, the number of variables, the units (feet, dollars,index) for each variable, and the unit of observation (year, month, state by year, etc.).

b) Do you see multiple observations over time, or the observations all at the same time (across section)?

c) Make one comment on how you could improve on the structure of the data. This could be the inclusion of a new variable, a variable calculated from other variables, or a different unit of observation.

3) Create summary statistics, and interpret them, for two continuous variables. Do the summary statistics tell you anything interesting that you might not have known before?

4) For the same two variables in part 3, construct a 95% confidence interval for the variable.

5) For the categorical variable, construct a pie chart or a bar graph that summarizes the distribution of the variable. If the variable has many values you will likely need to create anew variable, with many of the values (except 4 or 5) coded into a value of “other”.

6) Create a visualization with the data set. Something more than a histogram or a pie chart. I suggest using the R package ggplot2, https://ggplot2.tidyverse.org/, however that isn’t necessary. Ideally you would be able to show the relationship between three variables in the data set using a single picture.

How you do it depends in part on if the variables are discrete,continuous, or categorical. Because pictures are two dimensional, you will need to incorporate color, size, or symbol (like a dash or dot), to characterize all three variables together.

7) Formulate a meaningful hypothesis for this data set. It can be if the average of a variable is equal to a meaningful value, or if a numerical variable differs by some categorical outcome.Test this hypothesis and interpret the result.

*Get in touch with us today and get your customized paper from us…*

## Leave a Reply

You must be logged in to post a comment.