I'm A Data Scientist Now

The Second Best Time Is Now

The best time to plant a tree is twenty years ago. The second best time is now. — Chinese proverb (according to the internet)

A pile of books against a black background.A common problem with putting algorithms created by data science into practice is codifying bad things like racism, misogyny, and classism. Cathy O’Neill mentions multiple examples in Weapons of Math Destruction where an algorithm systematizes the biases of the data it is fed, inadvertently becoming both an obstacle and a self-fulfilling prophecy. One example is an algorithm built to help lenders determine if a person makes a good loan candidate. It becomes an obstacle because the biased data the algorithm is trained on builds inequity into the system and self-fulfilling prophecy because if a person is deemed a bad loan candidate, they’re only options are more expensive and more likely drag down their ability to become a good loan candidate.

I used the Goodreads-books dataset from the Kaggle user jealousleopard, which is a comprehensive list of all books listed in GoodReads, pulled from GoodReads’s API. The dataset was last updated March 9, 2020. I made several adjustments to the dataset I explored, the biggest one being for the dataset to only hold English language books.

In the vein of #weneeddiversebooks and We Read Too, I was interested in seeing what the top 20 most popular books are and different information about them. There are two metrics that I chose to measure popularity, ratings count and text review count. I chose those metrics over average rating because a book with a 5 star rating, 3 ratings, and 0 text reviews doesn’t have as much engagement as a book with a 3.5 star rating, 1 million ratings, and 10,000 text reviews.

The top 20 most rated books were published between 1993 and 2006, the top 20 most reviewed books were published between 1993 and 2007. The publication dates for some of these books are definitely older (e.g. Romeo and Juliet, Of Mice and Men), it’s just the publication dates for the specific ISBNs in this dataset. This suggests that having a brief moment of mainstream attention only counts for so much, some of the popularity of these books comes from existing for a while and accruing ratings and reviews for well over a decade. It also suggests that while the best time to start with the inclusivity work was years ago, as the saying goes: the second best time is now.

At the end of 2018, I collected a dataset of Orbit Books for all books published from their launch to November 2018 from the ISFDB. I augmented the data with my own research to add in identity traits like skin color, sexuality, gender, and disability. I didn’t use that dataset this time because I’m not comfortable with making that dataset public even though all of the information I gathered is publicly available. Right now, given the current climate of the US, making that dataset public feels like I am helping to highlight people in a way that could make them targets for hate speech and harassment.

I want to use this information to make the publishing industry get better about representation so that there isn’t less than a dozen Black and brown people in a dataset of over a hundred authors, a handful of queer people, and no disabled people. Also, I wanted it to be less about cataloging who has been published and more about who has been given room to be popular. I ultimately went with ratings over reviews because the number of ratings dwarfed the number of reviews, although I would love to look into the engagement of the fandoms of the top reviewed books versus the top rated books. I wonder if AO3 has more Jane Eyre fanfic than Catcher in the Rye fanfic. That’s a question for another time.

The questions that I am interested in answering in this dataset are:
What is the relationship between the top 20 most rated books and the top 20 most reviewed books?
What is the relationship with gender/sex in the top 20 most rated books?
What is the relationship between race in the top 20 most rated books?
What is the relationship between category, genre, and subgenres in the top 20 most rated books?
What is the relationship between books set in a universe versus solo stories for the top 20 most rated books?
What can I do about it?

Ratings/Reviews Stacked Bar Chart of Most Rated Books, Twilight is the most rated bookWhat is the relationship between the top 20 most rated books and the top 20 most reviewed books?
There isn’t a whole lot of movement between the the top 20 most rated and the top 20 most reviewed— however it’s interesting that the top 20 most reviewed books have the same series, for the most part, but fewer repeats. Ratings/Reviews Stacked Bar Chart of Most Reviewed Books, there are over 10x more ratings than reviews in some casesAlso the books that show up in the top 20 most rated but not in the top 20 most reviewed (outside of the series) are books you get assigned to read in middle and high school here in the US (Lord of the Flies, Romeo and Juliet, Of Mice and Men, Little Women).

Pie Chart of Author Gender for Top 20 Most Rated Books: 65% of authors are cis man, 35% are cis womanWhat is the relationship with gender in the top 20 most rated books?
This turned out to be a boring pie chart with two sections: cis men and cis women, with cis men having authored 30% more of the top 20 books. We could have had at least five sections with trans men, trans women, and non-binary folk— but nope. I’d like to see the publishing industry do better in terms of gender representation of people who are out. I don’t think it’s necessary for anyone to be out of the closet, if they are not comfortable or don’t feel safe— but I do think the onus is on the publishing industry to seek more trans people who are out and writing own voice stories.

Pie Chart of Author Gender for Top 20 Most Rated Books: 100% of authors are white
What is the relationship with skin color in the top 20 most rated books?
This turned out to be an even more boring graph than the gender one because it is all one section: white. The publishing industry needs to do better in terms of representation of people of color. Nalo Hopkinson has great advice about how to do that: by going to where writers of color congregate and hang out.

Pie Chart of Connected for Top 20 Most Rated Books: 50% of books are part of a series, 10% are in the same universe as other books, 30% are solo storiesWhat is the relationship between books set in a universe versus solo stories for the top 20 most rated books?
It’s interesting, although not surprising, that more than half of the top 20 most rated books are from series or written in the same universe as other books. I did find it interesting that 50% of the same ‘verse books were the first book and out of the seven series/same ‘verse books shown, only three had multiple books from the same series. Also the first in the series were more popular than the last in the series, except in the case of HP, where neither the first nor last book ranked in the top 20. This proves the idea of series do better than solo stories, but in general it helps the first book of a series.

Bar Chart of Book Genres Color for Top 20 Most Rated BooksWhat is the relationship between category, genre, and subgenres in the top 20 most rated books?
This graph is loosely related to the graph above it because books that come from a series (like the Hobbit/LotR books and the Harry Potter books) are going to have roughly the same categories, genres, and subgenres. More than half the books have a genre of young adult or juvenile. The data doesn’t have information about the users leaving ratings- so there isn’t a way to know if these books were read by a user a long time ago and remembered fondly (or harshly) and are now being rated, if the users of GoodReads skews to a younger age range, or some other factor that could cause this correlation. The other popular genre is speculative fiction, with more than half of the books being from the fantasy or science fiction genres.

What can I do about it?
The gatekeepers of the publishing industry are less important because self-publishing is no longer frowned upon and people can build a career as an author by building a self-publishing business.However with more books to read, there’s more stuff to sort through as a customer and taking Amazon’s recommendations can only take you so far (if you shop on Amazon at all). If you are a reader, you can start with the resources provided by We Need Diverse Books. If you’re in a gatekeeper role (literary agent, editor, small press publisher, etc.), you may need to step outside of a comfort zone while reading but I can guarantee you that you have stories in your slush pile that will diversify who you represent/publish. If not, you can always use the WNDB resources linked above to lead you to where Black, queer, and/or disabled writers are congregating and publishing.

If you are interested in looking at how I got to this point, my GitHub project is here.

Leave a Reply

Your email address will not be published. Required fields are marked *