Fiction Tagging Engine: Data Transformation
Data Extraction
The extraction process is written about here: Fiction Tagging Engine: Data Extraction. You can get the script here: https://github.com/jfrankbryant/ao3ETLpipeline (please note this script needs to be cleaned up and is currently in a Jupyter notebook)
What each step of the script does to transform and load the dataset:
- Read in manually processed labels and in-process dataset into dataframes
- Setup createTagDict function to convert all label-related dataframes into dictionaries
- Call createTagDict function to create tag, fandom, and fandom genre dictionaries
- Convert tag and fandom columns in AO3 dataframe from strings to lists
- Create functions to map dictionary keys and values to data frame columns
- mapDictKeys2Column: loop through a column, replacing an element in a list with the dictionary key if the element matches one of the key’s values
- mapDictValues2Column: loop through a column, replacing an element in a list with the dictionary values if the element matches the values’ key
- Call function to map processed tags over unprocessed tags to standardize tag data
- Create oneHotEncode function to one-hot encode data
- Call function to one-hot encode data so it can be feed into a machine learning algorithm, convert 0s to NaNs so the dataset can be shrunk down to original amount of rows with all relevant columns encoded (instead of one column encoded per row), and re-add 0s (so the data can be held in an integer column in SQLite database later), drop columns that are now duplicate or incomplete
- Create listToString function to convert a list to a string
- Call function to convert lists in processedAO3 dataframe into strings because SQLite databases don’t allow for storing list type values
- Store processedAO3 dataframe, tags dataframe, fandoms dataframe, and fandom genre dataframe as tables in SQLite database
Improvement
I would like to have a tag that relates to multiple groups be replaced by all of the group tags for that tag. For example the tag ‘trans gomez addams’ should be replaced by transTags, latinxCharacters, and addamsFamilyTags. The way the code is setup, right now the tag is replaced by whichever one comes first. A possible solution for this is to have the code search for the tag in dictionary’s values and pull out the keys for those values, add those values to the list for the cell. This will probably make the code run slower, but it would result in better 9and more accurate) labeling for the dataset.