Named Entity Extraction (NER) is a subset of Natural Language Processing (NLP).
A named entity is a categorized real-world object, such as a person, place, company, product or a price, date, etc. Let’s look at an example.
In the sentence “Jim is willing to pay more than 300 dollars for an old iPhone from Apple.” you can find 4 entities of different types. Assigning them to a type makes them named entities. Here they are:
“Jim [Person] is willing to pay more than 300 dollars [Amount] for an old iPhone [Product] from Apple [Company]”
NER is typically applied to unstructured text such as letters, contracts, tweets, comments, posts, emails, and other documents or digital content. It helps identify what a text talks about. Combined with other techniques from NLP such as Sentiment Detection, you can gain even more insight. For example, not only do you get to know that someone talks about an iPhone, but also whether they talk about it positively or negatively.
A grammar is a system of rules for a specific language that describes how a proper sentence is built in that language. It defines elements such as nouns, verbs, adjectives, and higher-level elements such as noun phrases (e.g. “a billion dollars”) and prepositional phrases (e.g. “at work”). The rules define what combination and order of words and phrases constitute a correct sentence.
Once a grammar is defined you can use it to break down the text into constituents. NER uses this breakdown and considers for example noun phrases as good candidates for entities. In the next step, it needs to classify each entity to categorize it and make it a named entity.
Pros:
This approach typically yields higher accuracy than other approaches.
Except for the part that classifies entities into categories such as Person and Place, the approach doesn’t lend itself to automatic learning. This means it doesn’t automatically improve.
Cons:
You need a grammar for every language. Each language has different rules so each has a different grammar.
Social media makes it difficult to apply a grammar because let’s be honest, no one writes 100% grammatically correct sentences in a Facebook comment.
You need rare and expensive Computational Linguists, the expert who creates grammars and maintains them in a way that software can utilize them. You probably need one such expert per language unless they are fluent in many languages.
This approach does not rely on a grammar. Instead, it uses Machine Learning to learn from labeled examples. A labeled example is a sentence where all the named entities are manually correctly tagged. Human data labelers need to review millions of sentences, mark the entities and tag them with a category. This massive data set is then used to train a neural network or other Machine Learning algorithm. A popular one for NER is Conditional Random Fields (Wikipedia).
The machine learning algorithm determines from all these samples why a certain entity is of a certain type, and what in a sentence is an entity and what isn’t. The more samples provided, the better the accuracy typically gets.
Pros:
Linguists are not necessarily needed.
Learning from examples has the advantage that the software can continuously learn from watching users review and correct the results.
Cons:
The approach requires a massive set of data. To get started, free public data sets are available, but to play in the major league of NLP software, the vendors need their own sample sets which range in the millions.
While a grammar for one language can be transferred and adjusted for a related language, a sample set for one language cannot be used for other languages. So you would need another million samples for the next language.
OK, now you understand what NER is and how the manufacturers that sell it make it work. How does that help you? Let’s look at a few examples.
Let’s say you need to understand who the parties are in a contract. Because you need to review thousands of multi-page contracts you cannot afford to put legal personnel at the task. Contract parties are either Persons or Companies typically. And the text is unstructured. So unless you have every possible person’s name and all company names in a huge database, you cannot search for these entities. That’s where NER comes in. NER can detect all the persons and companies, and many NLP software products can also detect the role of each. One is the payor, the other the payee, or one is the lessor and the other the lessee.
Or you have an RPA platform and your robot pulls news about acquisitions from various internet resources. It floods you with hundreds of news articles every day and your goal is to understand the price for the acquired company. You could define regular expressions to find dollar amounts mentioned in the text. But oftentimes, the press article doesn’t use numbers. It uses phrases such as “several million dollars”, “a high 7 figure amount” or “at least 4 million Euros”. Try that with a regular expression! NER to the rescue, it can find such phrases.
The last example: Your company launches a product and calls it “The Bean”. You want to know if people talk about, where they talk about, how the product trends in social media. Just searching for “bean” will get you thousands of false hits like news of the soybean market. NER is able to distinguish between products and other meanings of words and phrases (at least with some customization and tweaking of the settings).
NER is another tool in your toolbox. When your team works with an Automation Software like RPA or a BPM platform or a Capture product, they will want to use it for as many business needs as possible. Some of these data extraction needs can be solved with classic regular expressions, some require machine learning tools that let you build your own models, and some will require NER. It is a good thing to have that tool in the box when you decide which automation software to buy. If it has NER in it, it enables you to solve business problems that have no other solution.
.