TDM 20100: Project 6 — 2023
Motivation: awk
is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk
.
Context: This is the second of three projects where we introduce awk
. awk
is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/restaurant/orders.csv
-
/anvil/projects/tdm/data/whin/observations.csv
Questions
Question 1 (1 pt)
-
How many columns and rows are in the following dataset:
/anvil/projects/tdm/data/restaurant/orders.csv
.
The following is example output
rows: 12345 columns: 12345
Question 2 (1 pt)
-
Please list all possible values of "Location Type" in the file
/anvil/projects/tdm/data/restaurant/orders.csv
and how many times each value occurs.
Your output should give each location type, followed by the numbers of orders for that Location Type. Use awk
to answer this question. Make sure to format the output as follows:
Location Type Number of Orders -------------- ---------------- AAA 12345 bb 99999
Question 3 (2 pts)
-
What is the year range for the data in the dataset:
/anvil/projects/tdm/data/restaurant/orders.csv
?
Question 4 (2 pts)
-
What is the sum of the order amounts for each year in the data set
/anvil/projects/tdm/data/restaurant/orders.csv
?
Pease make sure the output format is the following:
Year Summary of Orders in dollars 2019 $PUT THE TOTAL DOLLAR AMOUNT HERE
It is totally OK if you put the dollar amount in scientific notation (that will probably happen by default when you add up the dollar amounts, because there were a lot of restaurant orders! |
ANOTHER NOTE: There is only 1 year (namely, 2019) in this data set.
Question 5 (2 pts)
-
Please extract both the years and months for the file:
/anvil/projects/tdm/data/whin/observations.csv
and how many times each year-and-month pair occurs.
Your output should give each year-and-month value, followed by the numbers of times that this year-and-month appears. Use awk
to answer this question. You likely will need to use awk twice in a pipeline. Make sure to format the output as follows:
Month and Year Number of Occurrences -------------- --------------------- 2020-06 12345 2020-07 99999
Project 06 Assignment Checklist
-
Jupyter notebook with your code, comments and output for questions 1 to 5
-
firstname-lastname-project06.ipynb
.
-
-
A
.sh
text file with all of yourbash
code and comments written inside of it-
bash code and comments used to solve questions 1 through 5
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |