Build your own Resume Parser Using Python and NLP

adminMarch 9, 2023

0 57 5 minutes read

YOU ARE READING: Build your own Resume Parser Using Python and NLP AT Vccidata_En

Let’s start by getting one thing straight. A resume is a brief one or two page summary of your skills and experience, while a resume is more detailed and is a longer account of the candidate’s skills. With that in mind, let’s dive into building a parser tool using Python and basic natural language processing techniques.

CVs are a great example of unstructured data. Since there is no universally accepted resume layout; Every resume has its own style of formatting, different blocks of text or even category titles vary greatly. I don’t even need to mention how challenging it is to parse multilingual resumes.

Reading: How to create a resume parser

One of the misconceptions about creating a resume parser is that it’s an easy task. “No, it’s not.”

Let’s just talk about predicting the names of the candidate who is looking at the resume.

There are millions of personal names around the world , unlike Björk Guðmundsdóttir to 毛泽东, from Наина Иосифовна to Nguyễn Tấn Dũng. Many cultures are accustomed to using middle initials like Maria J. Sampson, while some cultures use prefixes like Ms. Maria Brown extensively. Trying to build a database of names is a desperate effort because you’ll never keep up with it.

Once we understand how complex this is, let’s start building our own resume parser create?

Understand the tools

We will use Python 3 because of the wide range of libraries already available and because of its general acceptance in the field of data science.

We will also use nltk for NLP (Natural Language Processing) tasks like stopword filtering and tokenization, docx2txt and pdfminer.six for extracting text from MS Word and PDF formats.

We assume that you already have Python3, pip3 on your system and possibly with the wonders of virtualenv. We will not go into details for installing these. We also assume that you are running on a Posix-based system such as Linux (Debian-based) or macOS.

Convert CV to plain text

Extract text from PDF -Files

Let’s start extracting text from PDF files with pdfminer. You can install it using the pip3 utility (Python Package Installer) or compile it from source (not recommended). Using pip is as simple as running the following command at the command prompt.

With pdfminer, you can easily extract text from PDF files using the following code.

Pretty simple , or? PDF files are very popular for resumes, but some people prefer docx and doc formats. Let’s continue with extracting text from these formats as well.

Extracting text from docx files

To extract text from docx files, the procedure is quite similar to what we did for PDF files. Let’s install the required dependency (docx2txt) using pip and then write code to do the actual work.

And the code is as follows:

So simple. However, the problem arises when we try to extract text from old-fashioned doc files. These formats are not handled correctly by the docx2txt package, so we will use a trick to extract text from them. Please continue.

Extracting text from doc files

To extract text from doc files, we’ll use Pete Warden’s neat but extremely powerful command line tool catdoc.

/ p>

Catdoc reads MS Word files and outputs readable ASCII text to stdout, just like the Unix cat command. We will install it using the apt tool since we are running on Ubuntu Linux. You should choose to run your favorite package installer, or you can build the utility from source.

Once done, we can now enter the code that instantiates a subprocess, the standard output in a It captures string variables and returns It’s the same as we did for pdf and docx file formats.

Now that we have the resume in text format, we can start extracting specific fields from it.

Extracting Fields from Resumes

Extracting Names from Resumes

It may seem simple, but in reality one of the most difficult tasks in resume parsing is to extract the person’s name from it. There are millions of names around the world and in a globalized world we can come up with a CV from anywhere.

This is where natural language processing comes into play. Let’s start by installing a new library called nltk (Natural Language Toolkit), which is very popular for such tasks.

Now it’s time to write code to use Named Entity Recognition ( NER) functionality of nltk is able to do so.

Actually, nltk’s personal name recognition algorithm is far from correct. Run this code and try to see if it works for you. If not, you can try using Stanford University’s NER model. See Listendata for detailed instructions.

Extract phone numbers from resumes

Unlike extracting personal names from resumes, phone numbers are much easier to work with. In general, using a simple regex will suffice for most cases. Try the code below to extract phone numbers from a resume.You can change the regular expression to your liking.

Extract email addresses from resumes

Similar to extracting phone numbers, this is pretty easy. Just start a regular expression and extract email address from resume. The first one that appears above others is generally the applicant’s actual email address, since people place their contact information in the header of their resume.

Extract Skills from Resume

Well you’ve been so good so far. This is the section where things get trickier. Exporting skills from a text is a very challenging task and to increase accuracy you need a database or API to verify whether a text is a skill or not.

Take a look at the following code. It first uses the nltk library to filter out the stop words and generates tokens.

It’s not easy to maintain an up-to-date database of skills for each individual industry. You might want to take a look at the Skills API, which offers you a simple and affordable alternative to maintaining your own skills database. The Skills API provides over 70,000 skills, well organized and frequently updated. Just check the code below and see how easy it would be to extract the skills from a resume using the Skills API.

Before proceeding, you must first create a new dependency called import requests, also using the pip tool.

Now below is the source code using the Skills API.

Extract education and schools from resumes

Once you understand the principles of competency extraction, you will feel more comfortable with the topic of education and school extraction.

Not surprisingly, there are many ways to do this.

First, you can use a database that contains all (or?) school names from around the world.

You can use our school names database to create your own NER model using Spacy or a other NLP framework to train, but we follow a much simpler route. To do this, we search for words like “university, college, etc.” in named entities marked as organization types in the following code. Believe it or not, it works quite well in most cases. You can enrich the list of reserved_words in the code as you wish.

Similar to the extraction of person names, we will first filter out the reserved words and punctuation marks. Second, we store all named entities with “organization type” in a list and check whether they contain reserved words or not.

Check the following code, because it speaks for itself.

Resume parsing is difficult. There are hundreds of ways to do this. We have only described a simple way and unfortunately do not expect miracles. It may work for some layouts and for some others.

If you need a professional solution, take a look at our hosted solution called: Resume Parser API. It is well maintained and supported by its API provider, who is also the maintainer of the Skills API. It comes pre-trained with thousands of different resume layout formats and is the cheapest solution on the market compared to others.

A free tier is available and no credit cards are required upon registration. Just check whether it suits your needs or not. Feel free to leave comments below. Every contribution is appreciated