Detecting PII (Personally Identifiable Information) in Text!

We are team of talented data science and analytics students, working to aid student, teachers, and places of education, detect PII in their writing.

Get Started

Our Technologies

Our team is the Algorithm Allies! We are working on a Kaggle project via their Challenges section. Our client is Dr. Gunay.

The Kaggle Competition we are participating in is the PII Data Detection hosted by The Learning Agency Lab The objective of the project is to create an AI model that detects personal identifiable information (PII) so they can be censored. This is important when releasing educational material to the public to protect the identity of students. The data is contained in JSONs of student essays that were tokenized using spaCy.

About us

Data Sources

Our data was sourced from Kaggle. We obtained training and test data in a json format from the Kaggle competition, as well as an example submission file.

Machine Learning Model

We are leveraging BERT-base-cased for this project, which is pre-trained. Our goal is to fine-tune it (retrain) in order to meet the needs of the project - to identify and label types and locations of PII (Personally Identifiable Information) in English text data.

Visualizations

Our visualizations were created using a variety of technologies. Matplotlib was used in our notebook for EDA (Exploratory Data Analysis), Plotly was used to show the visualizations dynamically on our application site, and D3.js was used to create dynamic visualizations on this website.

Website

The website that you are currently viewing was created using HTML, CSS, and JS, along with Bootstrap. The application was created using Taipy, which allows us to create a web interface for interaction with our model, completely in Python.

Happy Client

Student Essays for Training

Hours Of Development

Team Members

Our Web Application

Taipy Web Application Home Page.

The goal of this page was to create an initial landing page for the user, in which they could access the app page via button click, click to view the data that was used to train our model, and show a preview of our training dataset.

  • Provide buttons for direct access to the application page.
  • Provide links to the datasets on Kaggle.
  • Display our training dataset, so that the user is able to understand the inputs for how the model was trained.

Our Web Application has an About Page

The goal of the About page was to simply highlight the goal of the project, and give a look into the minds behind the project created. You can view the technologies that we leveraged in the project.

  • Give a synopsis about the project.
  • Show the teammates that worked on this project.
  • Display the technologies used in this project.

Our Model Made Usable via Web, Thanks to Taipy

The goal of this application was to enable users to identify PII in their text. The user will upload a copy of their text and have it processed by our model, which will then produce a report for the user. We wanted the user to be able to:

  • Upload a copy of their text file.
  • See a preview of their input and the report
  • Download a copy of their report for further use.

At the current moment, our application is not hosted anywhere, so if you would like to use it, you will have to clone our repo and run it locally. The instructions are kept within the README.md file within the repository.

Dynamic Graphs in Taipy using Plotly

All of the graphs created on our Taipy web application feature the use of Plotly, which allows for them to be dynamic in the respect that when the data changes, so will the graphs associated.

Using Plotly also allows you to drill into the visualization, enabling you to view the graphs at the level that you would like.

  • Interactive graphs
  • Dynamically update with data changes
  • Color scheme changes based on selected website theme

Visualizations

This section is to display the graphs that were created in the process of Exploratory Data Analysis, as well as metrics gathering for our Model.

Most Common Types of PII

PII Distribution

Location of PII in Text

Performance Metrics of Model

Length of Essays

Length of Essays (D3.js Version)

Team

The Algorithm Allies is a team of dedicated, hardworking, and knowledgeable Data Science and Analytics students. Our goal is to make a meaningful impact on society in a positive way. We aim to leverage Machine Learning, along with other analytical processes to drive solutions to initiatives, such as the one in this project.

Pratik Chaudhari

Project Documenter​ & Data Modeler​

Cody Ledford

Client Liaison​ & Data Visualizer​

Manu Achar

Project Manager​ & Data Analyzer​