Part 1: Get the data from Kaggle and tokenize


1) Set up libaries

2) Set up for Kaggle API

3) Download data from Kaggle

4) Create a process to find all the PDF files on the drive, write them to list and count the number of files.

5) Run the extraction and token process (Read all the files and words and tokenize.)

6) Specify words to exclude from level of difficulty tagging

7) Create a process to extract keywords by exclusion and ensure the words are unique

8) Run the keyword extraction process

Part 2: Detemine level of difficulty


Look up each word for level of difficulty (Twinword API) and write words and difficulty levels to a file

9) Prompt for unirest key

10) Iterate through words looking up the level of difficulty and write to list. (Example set to 100 for now)

11) Create a process to sort and output the data to a .csv file with the most difficult words at the top.

12) Run the process to create the .csv

Part 3: Gender Bias


Referencing this research article and acknowledging this tool:, look up words for contains male or femal bias words in the job description.

13) Import gender bias words from .csv

.csv test data

Compare words to list of known gender bias words and write output to files.


14) Transform to lists for matching

Test Data for Bias

15) Process for Male bias words and write matches to file

16) Process for Female bias words and write matches to file