Fiction Tagging Engine: Data Transformation

July 3, 2021July 3, 2021 by Jacqui B

Data Extraction

The extraction process is written about here: Fiction Tagging Engine: Data Extraction. You can get the script here: https://github.com/jfrankbryant/ao3ETLpipeline (please note this script needs to be cleaned up and is currently in a Jupyter notebook)

What each step of the script does to transform and load the dataset:

Read in manually processed labels and in-process dataset into dataframes
Setup createTagDict function to convert all label-related dataframes into dictionaries
Call createTagDict function to create tag, fandom, and fandom genre dictionaries
Convert tag and fandom columns in AO3 dataframe from strings to lists
Create functions to map dictionary keys and values to data frame columns
- mapDictKeys2Column: loop through a column, replacing an element in a list with the dictionary key if the element matches one of the key’s values
- mapDictValues2Column: loop through a column, replacing an element in a list with the dictionary values if the element matches the values’ key
Call function to map processed tags over unprocessed tags to standardize tag data
Create oneHotEncode function to one-hot encode data
Call function to one-hot encode data so it can be feed into a machine learning algorithm, convert 0s to NaNs so the dataset can be shrunk down to original amount of rows with all relevant columns encoded (instead of one column encoded per row), and re-add 0s (so the data can be held in an integer column in SQLite database later), drop columns that are now duplicate or incomplete
Create listToString function to convert a list to a string
Call function to convert lists in processedAO3 dataframe into strings because SQLite databases don’t allow for storing list type values
Store processedAO3 dataframe, tags dataframe, fandoms dataframe, and fandom genre dataframe as tables in SQLite database

Improvement

I would like to have a tag that relates to multiple groups be replaced by all of the group tags for that tag. For example the tag ‘trans gomez addams’ should be replaced by transTags, latinxCharacters, and addamsFamilyTags. The way the code is setup, right now the tag is replaced by whichever one comes first. A possible solution for this is to have the code search for the tag in dictionary’s values and pull out the keys for those values, add those values to the list for the cell. This will probably make the code run slower, but it would result in better 9and more accurate) labeling for the dataset.

Literary Sidekick

Fiction Tagging Engine: Data Transformation

Data Extraction

What each step of the script does to transform and load the dataset:

Improvement

Leave a Reply Cancel reply

Fiction Tagging Engine: Data Transformation

Data Extraction

What each step of the script does to transform and load the dataset:

Improvement

Leave a Reply Cancel reply

You may like

Fiction Tagging Engine: Manually Processing Tags

Fiction Tagging Engine: Data Extraction