Data Management and Analysis
Data Management and Analysis
1.1 General guidelines
This assignment is provided as a pdf file, a lab notebook template and a word template. You should do all the coding work for this assignment in the enclosed notebook template and you should write all your answers to the questions, requested summaries and project report in the enclosed assignment word template. You will also work through and submit the requested course notebooks as a separate zip file.
1.3 Submitting your assignment
For this assignment, you will be required to submit a compressed directory containing four different items:
1. The coding work required to support your answers all inside the lab notebook template as a sequence of markdown and well commented and solved code cells following each question.
2. Your solved course notebooks in a compressed file.
3. The answers, summaries and conclusions as well as your research report requested in the enclosed assignment word template.
4. Any required data sets in a separate directory named: data
Please note that all notebooks you submit must be in a solved state and must show all outputs. Your Grader is not obliged to re-run your notebook cells.
Complete and submit all the following Jupyter notebooks in the form of a “solved” .rar or .zip file:
0.1 Scribble pad
2.2.0 Data file formats, file encodings
2.1 Pandas dataFrames
2.2.1 Data file formats -CSV
2.2.2 Data file formats – JSON
2.2.3 Data file formats – other
3.1 Cleaning data
3.2 Selecting and projecting, sorting and limiting
3.3 Combining data from multiple data sets
3.4 Handling missing data
4.1 Crosstabs and pivot tables
4.2 Descriptive statistics in pandas
4.3 Simple visualizations in pandas
4.5 Split-apply-combine with SQL and pandas
4.6 Introducing regular expressions
4.7 Reshaping data with pandas
— show at least three screenshots from OpenRefine
5.1 Anscombe’s Quartet – visualising data
5.2 Getting started with maps – folium
8.1 Movies dataset
9.1 SQL DDL
9.2 SQL DML
9.3 SQL views
10.7 Outer join operations
11.1 SQL set operations
11.2 SQL subqueries
14.1 Basic CRUD
14.2 Introduction to accidents
14.3 Using statistical tests
Please note that:
You will receive 1 mark for each completed notebook, including your own scribble pad notebook and screenshots of the OpenRefine tool, for a total of 30 marks.
Please note that:
• Partially completed notebooks will not be counted. All outputs must be shown.
• Please demonstrate your active interaction with each notebook by including your own additions and/or extensions to the code and/or your own additional comments. Use a double hash sign ‘##’ to distinguish your comments from those already provided in the notebook.
Your tutor may quiz you on the contents of the notebooks you provide.
Place all your coding work for this question in the lab notebook template and your 300-word summary in the assignment word processing template.
In this question, you will download Higher Education Staff Statistics: UK, 2018/19 datasets and write a summary of your understanding of the purpose and contents of the datasets and your assessment of the quality of the data. To do this, you must develop code to explore the data programmatically in a notebook and provide it as part of your answer.
HESA, the Higher Education Statistics Agency, are the experts in UK higher education data, and the designated data body for England. HESA published details of staff employment at UK higher education (HE) providers on 1 December 2018.
Download the following datasets which are available at https://data.gov.uk/dataset/452fa2dd-72e2-4de3-9e91-25be38dec27d/higher-education-staff-statistics-uk-2018-19
• Figure 6 – All staff (excluding atypical) by equality characteristics 2018/19
• Figure 5 – Staff by mode of employment, academic contract marker and sex 2018/19
• Figure 4 – All staff (excluding atypical) by academic contract marker, mode of employment and hourly paid marker 2018/19
In addition, read the definitions of items specified in the datasets which is provided in https://www.hesa.ac.uk/support/definitions/staff
Write a summary (~ 300 word) in word processing document which includes the following:
• The contents of the above datasets with detailed description
• The quality of the data with respect to validity, accuracy, completeness, consistency and uniformity
• Different types of dirty data in the data sets [It is mandatory to use any analysis tool/ python code to estimate the dirty data]
Question 3 – Project
Place all your coding work for this question in the lab notebook template and your project report in the assignment word processing template.
In answering this question, you will benefit from the experience you gained in the previous question.
In this question, you will formulate your own research question and investigate it and write a report of your findings in your Solution document. The research question should be related to investigating the relationships between a selected independent variable and a selected dependent variable. The research question may depend on more than one data set. Use correlation to show the relationships between your selected variables and provide visualizations. Visualizations can be provided either by utilizing folium or matplotlib.
For example, you may visualize “Fixed-term contract female employees working on Not on a zero hours contract”
Your project report should include the following:
1. Executive summary
• A brief summary of your project
2. Aims and objectives
• A brief description about the general aims of your project
• and more detailed objectives to achieve those aims
3. The Source Data
• Describe the data:
i. its sources and its, and
ii. comment on its quality
• Variable classification I:
i. Classify all variables into dependent or independent variable.
ii. Organize your answer into a table.
4. The Research Question
• State your research question by:
i. identifying the independent variables, and
ii. the dependent variables you wish to investigate.
5. Analysis and Findings
• Produce convincing correlations demonstrating a statistically significant correlation among your chosen independent and dependent variables. You must choose an appropriate statistical method for the types of measure in the variables in your study.
• Give your critical interpretation and conclusions about those observed correlations.
• Produce tabular summaries of the data in the form of crosstabs or pivot tables, along with your critical interpretation of those tables.
• At least two relevant visualizations along with your critical interpretations of each visualization.
• Your final answer to the research question you posed
• and critical comment on your conclusions.
6. Project Description
• Describe how you:
i. planned your project
ii. went about out acquiring your data,
iii. prepare it,
iv. analyze it
v. report your findings
• Reflect on:
i. your experience with the project,
ii. what you learned,
iii. what you went well,
iv. what went wrong
v. and how can you benefit from this experience in future projects
• At least 6 references. All references must be in the Harvard style of referencing and must be accompanied by proper citations in the text.
Top-quality papers guaranteed
100% original papers
We sell only unique pieces of writing completed according to your demands.
We use security encryption to keep your personal data protected.
We can give your money back if something goes wrong with your order.
Enjoy the free features we offer to everyone
Get a free title page formatted according to the specifics of your particular style.
Request us to use APA, MLA, Harvard, Chicago, or any other style for your essay.
Don’t pay extra for a list of references that perfectly fits your academic needs.
24/7 support assistance
Ask us a question anytime you need to—we don’t charge extra for supporting you!
Calculate how much your essay costs
What we are popular for
- English 101
- Business Studies