All questions
Pdf version, please explain properly
DEPARTMENT OF ECONOMICS ECON 4041H – RESEARCH METHODOLOGY Winter 2023, Peterborough Assignment #1 Due date: January 31, 2023 Instructions: You must provide your own unique solution. You may work with others, but each of you is responsible for submitting your own problem set solution. Question values are listed for each question. Submit solution through SafeAssign. Ideally you will submit your RMarkdown file, preferably in pdf format. Blackboard won’t accept html files, so if submitting an html file, first zip it and submit the zipped version. But if you don’t like using RMarkdown, you may submit two files: your command file and a wordprocessor file containing results, comments and answers to questions, as well as graphs. Please bind all output together in one document file rather than submitting separate files for each question, or for each graph. Your command file will be a separate file. For questions 1–5 use the labour force survey file lfs7797.rds. For question 6, use the 2016 Census PUMF cen16.rds. 1. Some basic data descriptions of datafile lfs7797.rds [15 marks] a. number of observations in the dataframe b. number of observations for variable cowmain–class of worker c. number of missing observations for variable cowmain d. mean wage (hrlyearn) of workers of variable cowmain category: i. “Public employee” ii. “Private employee” e. mean wage (hrlyearn) of workers of variable union category: i. “Union + agreement” ii. “Agreement,no union” 2. Distribution of hrlyearn (wage rate), and uhrsmain (usual weekly hours) [15 marks] a. summary statistics: find mean, median, maximum, minimum, standard deviation of wage rate and weekly hours b. plot the densities of i. wage rate ii. log of wage rate iii. usual weekly hours iv. log of usual weekly hours 3. Generate some 2x2 tables of several variables [15 marks] ECON 4041H - Assignment 1 a. first recode the variables for educational attainment: ed76to89 and educ90, the first is for years prior to 1990, and the second is 1990 on. Recode to create one variable for both years and call it educ i. ed76to89 • “0 to 8 years” and “9-10 yrs schooling”: code as “less than high school” • “11-13 years schooling” and “Some post secondary”: “high school” • “Post secondary certificate of diploma”: “college” (note: keep spelling error) • “University degree”: “university” ii. educ90 • “0 to 8 years” and “Some secondary”: “less than high school” • “Grade 11 to 13,grad” and “Some post secondary”: “high school” • “College diploma”: “college” • “Bachelors degree” or “Graduate degree”: “university” b. now calculate the following conditional means i. mean hourly earnings by sex ii. mean hourly earnings by educational attainment iii. mean weekly hours by sex iv. mean weekly hours by educational attainment 4. Composition of labour force by year: 1977 and 1997 [15 marks] a. by sex (sex) b. by educational attainment (use variable created in previous question) c. by age (use variable age_12) Use the variable lfsstat (labour force status) to subset the labour force. Remember from macro that the labour force is composed of those employed plus those unemployed. 5. Test the central limit theorem, as we did in our demo example. You will draw repeated sam- ples of two variables hrlyearn–wages, and uhrsmain-usual weekly hours worked, saving the mean value of each sample. Then compare the means, standard deviations and distribution of the three samples to the “population” statistics. Note, the data are in a dataframe, so you must either extract each variable as a vector, or make sure you set your command for a dataframe. In order to replicate results, you will need to set a seed value. The seed value determines a starting point for the random number generator. To set your seed value, take your sid, drop the leading 0, then take the sum of the next three digits and the last three. For example, if my sid is 0123456, I would calculate my seed value as 123 + 456 = 579. Then draw the random sample following the example in the Sampling Distribution exercise. [20 marks] a. Draw a sample of 1,000 observations of wages (hrlyearn). Save the mean value. Repeat this for 2,000 repetitions. This yields 2,000 sample means. Then repeat for 5,000 observations, and again for 10,000 observations. This will give you three sets of 2,000 means. Report the mean, standard deviation, and graph the kernel density for each of these three sets. b. What do you see as you increase the sample size? Compare your results—mean, stan- dard deviation, density plot—with those of the aggregate sample. 2 ECON 4041H - Assignment 1 c. Repeat parts a. and b. above, but use the weekly hours variable (uhrsmain). 6. Use the Census 2016 PUMF (cen16.rds) to test whether the relationship between age (factor variable agegrp) and employment income (variable empin) is linear. Restrict your analysis to those in the age range from 20 to 84 years old. The variable agegrp for this range consists of 5-year age groups. Generate a numeric version of this variable and use the numeric variable rather than the factor variable where appropriate. [20 marks] a. generate a scatter plot with employment income on the y-axis and (the numeric version of) age on the x-axis. Use a subset of the census file including only 50,000 observations. The generated plot will otherwise take up a lot of space in your output file. b. generate a loess plot of employment income as a function of (the numeric version of) age. Use a subset of the census file including only 50,000 observations. This command is otherwise very slow. In specifying the loess plot command, make sure to include the option “se = FALSE”, otherwise the estimation is very slow, even on the subset. c. Run a regression of employment income on the numeric version of age. Report the results and interpret. What do they mean? d. Run a regression of employment income on original factor variable version of age. i. Report the results and interpret. What do they mean? Do they tell you anything about whether the relationship is linear? ii. Using the output from the regression above, test the significance of power terms of the age variable using the contrast() command. iii. Generate a plot of the predicted values of employment income for each level of the age factor variable. Interpret. 3