STAT 19000: Project 9 — Spring 2021
Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!
Context: At this point in the semester we have a solid grasp on Python basics, and are looking to build our skills using the pandas
and numpy
packages to build a data-driven recommendation system for beers.
Scope: python, pandas, numpy
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/beer
Load the following datasets up and assume they are always available:
beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")
breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet")
reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet")
Questions
Question 1
Write a function called prepare_data
that accepts an argument called myDF
that is a pandas
DataFrame. In addition, prepare_data
should accept an argument called min_num_reviews
that is an integer representing the minimum amount of reviews that the user and the beer must have, to be included in the data. The function prepare_data
should return a pandas
DataFrame with the following properties:
First remove all rows where score
or username
or beer_id
is missing, like this:
myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
myDF = myDF.loc[myDF.loc[:, "username"].notna(), :]
myDF = myDF.loc[myDF.loc[:, "beer_id"].notna(), :]
myDF.reset_index(drop=True)
Among the remaining rows, choose the rows of myDF
that have a user (username
) and a beer_id
that each occur at least min_num_reviews
times in myDF
.
train = prepare_data(reviews, 1000)
print(train.shape) # (952105, 10)
We added two examples of how to do this with the election data (instead of the beer review data) in the book: cleaning and filtering data |
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Run the function in question (1). Use train=prepare_data(reviews, 1000)
. The basis of our recommendation system will be to "match" a user to another user will similar taste in beer. Different users will have different means and variances in their scores. If we are going to compare users' scores, we should standardize users' scores. Update the train
DataFrame with 1 additional column: standardized_score
. To calculate the standardized_score
, take each individual score, and subtract off the user’s average score and divide that result by the user’s score’s standard deviation.
In R, we have the following code:
myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myMean = tapply(myDF$b + myDF$c, myDF$a, mean)
myMeanDF = data.frame(a=as.numeric(names(myMean)), mean=myMean)
myDF = merge(myDF, myMeanDF, by='a')
Or you could also use a very handy package called tidyverse in R to do the same thing:
library(tidyverse)
myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8))
myDF %>%
group_by(a) %>%
mutate(d=mean(b+c))
Unfortunately, there isn’t a great way to do this in Python:
def summer(data):
data['d'] = (data['b']+data['c']).mean()
return data
myDF = myDF.groupby(["a"]).apply(summer)
Create a new column standardized_score
. Calculate the standardized_score
by taking the score and subtracting the average score, then divide by the standard deviation. As it may take a minute or two to create this new column, feel free to test it on a small sample of the reviews DataFrame:
import pandas as pd
testDF = pd.read_parquet('/class/datamine/data/beer/reviews_sample.parquet')
Don’t forget about the |
If you are worried about getting `NA`s, do not worry. The only way we would get `NA`s would be if there is only a single review for the user (which we took care of by limiting to users with at least 1000 reviews), or if there is no variance in a user’s scores (which doesn’t happen). |
We added an example about how to do this with the election data in the book: standardizing data example |
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Use the pivot_table
method from pandas
to put your train
data into "wide" format. What this means is that each row in the new DataFrame will be a username
, and each column will be a beer_id
. Each cell will contain the standardized_score
for the given username
and beer
combination. Call the resulting DataFrame score_matrix
.
-
Python code used to solve the problem.
-
Output the
head
andshape
ofscore_matrix
.
Question 4
The result from question (3) should be a sparse matrix (lots of missing data!). Let’s fill in the missing data. For now, let’s fill in a beer_id’s missing data by filling in every missing value with the average score for the beer.
The |
-
Python code used to solve the problem.
-
Output the
head
ofscore_matrix
.
Congratulations! Next week, we will complete our recommendation system!