Stack Organization
What is Information Extraction?
Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database. Semantically enhanced information extraction (also known as semantic annotation) couples those entities with their semantic descriptions and connections from a knowledge graph. By adding metadata to the extracted concepts, this technology solves many challenges in enterprise content management and knowledge discovery.
For example, consider we’re going through a company’s financial information from a few documents. Usually, we search for some required information when the data is digital or manually check the same. But with information extraction NLP algorithms, we can automate the data extraction of all required information such as tables, company growth metrics, and other financial details from various kinds of documents (PDFs, Docs, Images etc.).
How Does Information Extraction Work?
There are many subtleties and complex techniques involved in the process of information extraction, but a good start for a beginner is to remember:
Tokenization
Computers usually won’t understand the language we speak or communicate with. Hence, we break the language, basically the words and sentences, into tokens and then load it into a program. The process of breaking down language into tokens is called tokenization.
For example, consider a simple sentence: “NLP information extraction is fun”. This could be tokenized into:
- One-word (sometimes called unigram token): NLP, information, extraction, is, fun
- Two-word phrase (bigram tokens): NLP information, information extraction, extraction is, is fun, fun NLP
- Three-word sentence (trigram tokens): NLP information extraction, information extraction is, extraction is fun
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text)
Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
Parts of Speech Tagging
Tagging parts of speech is very crucial for information extraction from text. It’ll help us understand the context of the text data. We usually refer to text from documents as ”unstructured data” – data with no defined structure or pattern. Hence, with POS tagging we can use techniques that will provide the context of words or tokens used to categorise them in specific ways.
Parts of Speech Tagging
In parts of speech tagging, all the tokens in the text data get categorised into different word categories, such as nouns, verbs, adjectives, prepositions, determiners, etc. This additional information connected to words enables further processing and analysis, such as sentiment analytics, lemmatization, or any reports where we can look closer at a specific class of words.
Here’s a simple python code snippet using spacy, that’ll return parts of speech of a given sentence.
import spacy
NLP = spacy.load("en_core_web_sm")
doc = NLP("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_)
Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM
Dependency Graphs Dependency graphs help us find relationships between neighbouring words using directed graphs. This relation will provide details about the dependency type (e.g. Subject, Object etc.). Following is a figure representing a dependency graph of a short sentence. The arrow directed from the word faster indicates that faster modifies moving, and the label `advmod` assigned to the arrow describes the exact nature of the dependency.
Dependency Graph Example
Similarly, we can build our own dependency graphs using frameworks like nltk and spacy. Below is an example:
import spacy
from spacy import displacy
NLP = spacy.load(“en_core_web_sm”)
doc = NLP(“This is a sentence.”)
displacy.serve(doc, style=”dep”)
An Example of Information Extraction
#1 Information Collection
Firstly, we’ll need to collect the data from different sources to build an information extraction model. Usually, we see documents on emails, cloud drives, scanned copies, computer software, and many other sources for business. Hence, we’ll have to write different scripts to collect and store information in one place. This is usually done by either using APIs on the web or building RPA (Robotic Process Automation) pipelines.
#2 Process Data
After we collect the data, the next step is to process them. Usually, documents are two types: electronically generated (editable) and the other non-electronically generated (scanned documents). For the electronically generated documents, we can directly send them into the preprocessing pipelines. Still, we’ll need OCR to first read all the data from images and then send them into preprocessing pipelines for the scanned copies.
#3 Choosing the Right Model As discussed in the above sections, choosing a suitable model mostly depends on the type of data we’re working with. Today, there are several state-of-the-art models we could rely on. Below are some of the frequently use open-source models:
- Named Entity Recognition on CoNLL 2003 (English)
- Key Information Extraction From Documents: Evaluation And Generator
- Deep Reader: Information extraction from Document images via relation extraction and Natural Language
#4 Evaluation of the Model
We evaluate the training process is crucial before we use the models in production. This is usually done by creating a testing dataset and finding some key metrics:Accuracy: the ratio of correct predictions made against the size of the test data.
- Precision: the ratio of true positives and total predicted positives.
- Recall the ratio of true positives and total actual positives.
- F1-Score: harmonic mean of precision and recall.
#5 Deploying Model in Production
The full potential of the NLP models only knows when they are deployed in production. Today, as the world is entirely digital, these models are stored on cloud servers with a suitable background. In most cases, Python is utilised as its more handy programming language when it comes to Text data and machine learning. The model is either exported as API or an SDK (software development kit) for integrating with business tools.
Few applications of Information Extraction
There are several applications of Information Extraction, especially with large capital companies and businesses. However, we can still implement IE tasks when working with significant textual sources like emails, datasets, invoices, reports and many more. Following are some of the applications:
- Invoice Automation:Automate the process of invoice information extraction.
- Healthcare Systems: Manage medical records by identifying patient information and their prescriptions.
- KYC Automation: Automate the process of KYC by extracting ethical information from customer’s identity documents.
- Financial Investigation: Extract import information from financial documents. (Tax, Growth, Quarterly Revenue, Profit/Losses)
Add Comment