Natural language processing: the missing tool



Suppose You want to create a web application. In the modern world, created countless SOFTWARE designed to make your life easier. Can use some comprehensive framework, or connect a pair of libraries that solve common tasks like standardization, database management, interactivity, and the like. These libraries provide a consistent interface to solving both common and exceptional tasks that You may not be able immediately to cope.

But among this abundance of tools, there is a significant gap: library for working with natural languages.



"But such a lot!" — would You mind — "NLTK or lingpipe." Of course, but do You use them? "So my project sort of not required natural language processing."

But actually, You are processing natural language without even realizing it. Such a basic action as the bonding of lines is only a special case of generating texts in a natural language, one of the fundamental parts of OEA1. But if You will need to do more complex operations, like the formation of plural forms of nouns, placement of capital letters in the sentence, changing the verb forms, there is no application of linguistics to do2. But when you abruptly need to obtain the plural form of a noun, You probably pull together a few regexps from the Internet than go to look for a suitable, UEA library. This is partially the fault of prosrochennoy region, AEA.

Should figure out what solved OEA tasks can be useful in Your application: generation of keywords, conversion to canonical form, language identification, full-text search, autocompletion, classification of topics and clustering, automatic annotation, analysis of handwriting or maybe something else. Little what app is this abundance will require at the same time, but many could only benefit from making a couple of additional possibilities. Blog, automatically generating tags, performing full-text search, automatically annotating the entries for the feed that determines the temporal sequence according to some features. But very few people is engaged in realization of such opportunities, because the business is fairly trivial. Modern solutions for these problems are based on models for the generation of which requires huge corpus of linguistic data. In most cases the game does not seem worth the candle, because to arm a couple dozen of the heuristics easier.

Both the present example, suggest problems with the practical application of OEE: existing software solutions mean that the user wants to build a system solely dedicated to OEA when he just need to tie a few fishechek. I don't want to get a doctorate in applied linguistics to be able to quickly obtain multiple forms for nouns, or the result of some other well-studied in OEA tasks. Must mean the user is more likely to use the built-in linguistic model, than will train. Although this introduces some limitations, but the principle still provides more functionality than the heuristic. And more importantly, using the model and wanting to improve it, the developer can simply train it on texts that are specific to application,3.

My appeal: I wish to see all the functionality, which is applied AEA in its current infant form, able collected in one place where she can shared linguistic resources and to provide a simple interface that will attract developers. Want to see complicated, but practically applicable technology, OAA accessible not only for linguists. Finally I wish that all this was created on the fundamental principles of OEA, leaving the possibility to improve the initial models and algorithms. Engineers the field of OEA very carefully look after themselves, so as not to give false expectations (in contrast to the loud hopes 80s). Somewhere we are still powerless, and it is in order.
[1] As a field of knowledge, OEA of course not considering the bonding lines as a method. Instead, we study the possibility of generation of text-based functional description of the desired result. As an example, pronominal expressions.

[2] This ability (inference rules multiple forms) is collected in the folder language/ MediaWiki. This is one of the multinational projects with an open code and amazing source of information about linguistic oddities in foreign languages.

[3] as an example, consider how the generation of texts can help in the localization of applications. Suppose You want to notify the user: "You Have three new messages". The simplest solution would be: printf("New messages: %d", numMessages). Going short dear we are spared from the need to generate desired numeral and consistent form of the word "message".

If you still want to display a notification in a more natural form, the next step is to add a couple of functions: the translation of numbers in numeral and generate the desired shape of the word "message". The end result is something like: printf("You Have %s new %s", toNumeral(numMessages), pluralize("message", numMessages)). Since most of the applications originally written in English, poor morphology, a simple bike is enough, and mostly such problems faced in the localization.

However, there is a representation invariant for a given task. Consider the grammatical dependencies that we could extract from our offer bought, OEA:

subj(the message-4, you-1)
num(the message-4, three-2)
amod(the message-4, new-3)
root(ROOT-0, message-4)

Let's ask the question:"Using this data, is it possible to automatically generate a corresponding message in the language that conveys the same information and having a grammatically correct structure?" This is a fundamental issue of generating texts in a natural language. (Though this question is not only machine translation, for receiving a message may vary depending on our functional purpose, which is to pull directly from the text quite difficult.) Undoubtedly it would be great to get the magic black box, issuing us grammatically correct texts on demand, but also tools created for text generation, in our day, can significantly ease the work of translators. In my opinion, this subject deserves careful study.


Link original article.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

March Habrameeting in Kiev

PostgreSQL load testing using JMeter, Yandex.Tank and Overload

Monitoring PostgreSQL with Zabbix