Natural language processing is a subfield of computer science, information engineering, and artificial intelligence , which deals with the interaction between computers and (natural) human languages, especially how computers should be programmed to handle large volumes of information. Analyze and process natural language.
Challenges in natural language processing often include speech recognition, natural language comprehension, and natural language production.
An online automation assistant that provides customer service on a web page. This is a typical application in which natural language processing is a central component.
History of natural language processing
In general, the history of natural language processing dates back to the 1950s, although some can be traced back to earlier periods. In 1950, Alan Turing published an article entitled “Intelligence and the Computing Machine,” which set out what is now known as the Turing experiment as a measure of intelligence.
Georgetown’s experiment in 1954 involved a fully automatic translation of more than sixty sentences from Russian into English. The authors claimed that between three and five years, machine translation would be a solved problem. However, real progress was much slower, and machine translation funding fell sharply after the ALPAC report in 1966, which found that ten years of research had failed to meet expectations. Little research was done on machine translation until the 1980s, but it was during this time that the first statistical machine translation systems were developed.
Some of the relatively successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system operating in limited “block worlds” with limited dictionaries, and ELIZA, a simulation of individual psychotherapy. It was centered and written by Joseph Wiesenbaum between 1964 and 1966. ELIZA sometimes created stunning human interactions without using information about human emotions or thinking. When the “patient” went beyond a very small knowledge base, Eliza might give a general answer, for example, to “my head hurts” with “Why do you say your head hurts?” He replied.
During the 1970s, many programmers began writing “conceptual ontologies,” which incorporated real-world information into computer-understandable data structures. Examples include MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979).
And Plot Units (Lehnert 1981). During this time many talking bots were written, including PARRY, Racter, and Jabberwacky.
Until the 1980s, most natural language processing systems were based on a complex set of handwritten rules. But since the late 1980s, with the introduction of machine learning algorithmsFor language processing, there was a revolution in natural language processing. This was due both to a steady increase in computational power (see Moore’s Law) and to a gradual decline in the dominance of Chomsky’s linguistic theories (e.g., grammar), the theoretical foundations of which lay sculptural linguistics. Which underlies the machine learning method in language processing, he denied. Some of the early algorithms used in machine learning, including decision trees, created hard if-then systems similar to existing handwriting rules. However, word-component labeling introduced the use of hidden Markov models in natural language processing, and research increasingly focused on statistical models, which made soft, probabilistic decisions based on the attachment of real-value-to-attribute weights. The manufacturers of input data. Hidden language models that many speech recognition systems now rely on are examples of such statistical models. Such models are generally more robust when receiving unfamiliar input, especially verbs that contain errors (which is very common in real-world data), and produce more reliable results when integrated with a larger system consisting of several subtasks.
Many of the early major breakthroughs in machine translation were due in particular to the work of IBM Research, which developed more sophisticated statistical models. These systems were able to utilize multilingual writing bodies produced by the Canadian Parliament and the European Union as a result of the need to translate government reports into all the official languages of the corresponding government systems. However, most other systems relied on bodies specifically built for the tasks performed by these systems, which was (and still is) a major limitation to the success of these systems. As a result, much research has been done on effective ways to learn from small amounts of data.
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that is not manually answered favorably, or a combination of this data with responsive data. This is usually much more difficult than supervised learning, and for a certain amount of input data, it usually produces less accurate results. However, a significant amount of unanswered data is available (including all the content of the World Wide Web), which is usually able to compensate for poor results, provided that the time complexity of the algorithm used is so low that it is feasible to implement it. Slowly
In 2010 machine learning methods in the style of display learning and neural networksThey became deeply involved in natural language processing, owing in part to a flood of research showing that such methods could achieve superior results in many natural language activities, for example in language modeling, parsing, and much more. . Some popular techniques include using word embedding to get the conceptual meaning of words, and increasing end-to-end learning of a high-level action (for example, answering a question) instead of relying. On a series of mediation activities (such as labeling components of speech and analysis of dependencies). In some respects, this shift has led to such fundamental changes in the design of NLP systems that methods based on deep neural networks can be considered as a new and distinct paradigm of natural language processing.
Regular NLP vs. Statistical NLP
In the early days, many language processing systems were designed by manually coding a set of rules, such as writing grammars or creating innovative rules for rooting words. However, this is rarely the case with natural language changes.
Since the famous “statistical revolution” of the late 1980s and mid-1990s, much research in natural language processing has relied on machine learning.
The machine learning paradigm, on the other hand, requires statistical inference for the automatic learning of such rules by analyzing large bodies of common real-world examples. Is).
Many different classes of machine learning algorithms have been used for natural language processing tasks. These algorithms receive a large set of “attributes” as input generated from the input data. Some older algorithms, including decision trees, produced rigid systems of if-then rules that were similar to the common handwriting rule systems of the time. But research has increasingly focused on statistical methods, which make soft, probabilistic decisions based on attaching real-value weights to each input property. Such models have the advantage that they can express the relative certainty of many possible answers instead of just one, which produces more reliable results when using such a model as a component of a larger system.
Systems based on machine learning algorithms have several advantages over handwriting rules:
- The learning processes used in machine learning automatically focus on the most common ones, while when writing rules by hand, it is often not at all clear where to turn attention.
- Automated learning processes can use statistical inference algorithms to generate models that are based on unfamiliar input (for example, containing words or structures that have not been seen before). It is generally very difficult, error-prone, and time-consuming to manage such entries generously by handwriting rules – or generally to build systems of handwriting rules that make soft decisions.
- Systems based on automatic rule learning can be made more accurate by simply providing more input data. However, handwritten rule-based systems can only be refined by increasing the complexity of the rules, which is a more difficult task. In particular, the complexity of manuscript-based systems is so complex that systems become increasingly unmanageable. However, generating more data at the input of machine learning systems only requires a corresponding increase in the number of people working, which usually does not significantly increase the complexity of the annotation process.
Major assignments and assessments
A list of the most researched works in natural language processing follows. Note that some of these tasks have direct applications in the real world, while others serve more as sub-tasks in helping to solve larger problems.
Although natural language processing tasks are intertwined, they are often divided into several categories for convenience. A large category follows.
Syntax
Grammatical induction
Produce a formal grammar that describes the syntax of a language.
Lemmatization
Task to delete morphological terminals just to restore the base word to a dictionary.
Monologue division
Divide words into separate morphemes and identify morphemes. The difficulty of this task depends mainly on the complexity of morphemes (eg vocabulary structure) in the target language. Monolingualism in English is relatively simple, especially morphological monolithology, so it is often possible to skip this activity altogether and use only all possible forms of a word (eg “open, opens, opened, opening”) as separate words. Modeled. However, in languages such as Turkish or Manipuri, which is a highly cohesive Indian language, this is not possible, as each dictionary entry may take thousands of forms.
Labeling of word components
Having a sentence determines the part of the word for each word. Many words, especially common words, can act as multiple parts of speech. For example, “book” could be a noun (“book on the table”) or a verb (“book a flight”); “Set” can be a noun, verb or adjective; And “out” can be any of at least five different parts of speech. Some languages have such ambiguities more than others. Languages with little morphology, such as English, are particularly susceptible to such ambiguities. Chinese language also accepts such ambiguity because it is a melodic language when making verbs. Such morphology cannot be easily transmitted by the transition institutions used in the spelling.
Parsing
The grammar tree determines a sentence. Grammar in natural languages is ambiguous, and common sentences have multiple possible analyzes. In fact, it may come as a surprise that for a typical sentence there may be thousands of potential decompositions (most of which, of course, are meaningless to humans). There are two main types of parsing, dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (which marks things like the main object and the predicate), while compound parsing focuses on constructing the parsing tree using a probabilistic text-independent grammar (PCFG). (See also random grammar.)
Breaking the sentence (also known as sentence ambiguity ambiguity)
Having a piece of text, the borders of the sentences must be found. Sentence borders are usually marked with a dot or other punctuation mark, but the same characters can have other purposes as well (for example, punctuation).
Vocabulary rooting
The process of reducing the words used (or sometimes derived) to their root form. (For example, “close” is the root of “closed”, “closing”, “close” or “closer”.)
Vocabulary division
Divides a continuous piece of text into separate words. For languages like English, this is fairly straightforward, as words are usually separated by a space. But some written languages, such as Chinese, Japanese, and Thai, do not mark words in this way, and in these languages text segmentation is a big task, and requires lexical knowledge and lexicography in the language. This process is sometimes used in cases such as building a dictionary (BOW) in data mining.
Extraction of terms
The purpose of terminology extraction is to automatically extract related phrases from a known body.
Semantics
Alphabetical semantics
What is the computational meaning of each word in the context of the text?
Distributive semantics
How can we derive semantic representations from data?
Machine translation
Automatically translate text from one language to another. This is one of the most difficult problems, and is part of a class of problems that are informally called “AI-complete”, in the sense that all the different kinds of human knowledge (grammar, semantics) are needed to solve them optimally. , Real world facts, etc.).
Named entity recognition (NER) – (Named entity recognition)
Having a stream of text determines what elements in the text are written in appropriate names, such as people or places, and what type of name each is (e.g., person, place, organization). Note that although uppercase letters can help identify nominal entities in languages such as English, this information cannot help determine the type of entity, and is often inaccurate or inadequate. For example, the first letter of a sentence is also capitalized, and famous institutions are often polysyllabic, only some of which are capitalized. In addition, many other languages in non-Western texts (such as Chinese or Arabic) have no capital letters at all, and even languages with capital letters may not use them consistently to distinguish names. For example, German capitalizes all nouns, regardless of whether they are nouns, and French and Spanish do not capitalize nouns that are adjectives.
Natural language production
Convert information from computer databases, or semantic purposes, into human-readable language.
Understand natural language
Convert pieces of text to more formal representations such as first-order logic structures that are easier for computer programs to control. Understanding natural language involves identifying the meaning among several possible meanings that can be deduced from a natural language phrase, and usually takes the form of organized signs of natural language concepts. Introducing and constructing an ontology and linguistic framework is an effective but empirical solution. Explicit formalization of the semantics of natural language without confusion with hidden assumptions such as the closed world assumption (CWA) versus the open world assumption, or the actual yes / no versus the correct / incorrect object, is expected to form the basis for semantic formalization.
Optical character recognition (OCR) – (Optical character recognition)
The image is identified by having an image that represents the printed text.
Answer the question
Having a question in human language, the answer is determined. Frequently asked questions have a definite correct answer (such as “What is the capital letter Canada?”), But sometimes open-ended questions are also considered (such as “What is the meaning of life?”). Recent activities have addressed even more complex questions.
Identify textual inference
Having two incomplete pieces of text determines if one is true, whether the other results in another contradiction, or allows the other to be true or false.
Extract relationships
Having a snippet of text identifies relationships between well-known institutions (for example, who’s spouse).
Opinion mining (see also multimodal mining)
Extraction of mental information, usually from a collection of documents, often using online surveys to determine the “polarity” of specific objects. This is especially useful for identifying the trend of public opinion on social media for marketing.
Identify and classify the topic
Having a piece of text, it is divided into sections, each dedicated to a topic, and the topic of each section is identified.
Disambiguation of the meaning of words
Many words have more than one meaning; We must choose the meaning that is most rational in the context. For this, we usually have a list of corresponding words and meanings, for example from a dictionary or from an online source such as WordNet.
Conversation
Automatic summary
Produce a readable summary of a piece of text. A well-known type is often used to create a text summary, such as research articles or articles in the financial section of a newspaper.
Coreference resolution
Having a sentence or a piece larger than the text determines which words (“hints”) refer to an object (“entity”). Anaphora is a clear example of this task, and in particular tries to match pronouns with the nouns or names they refer to. The larger task of specifying a reference involves identifying “mediating relationships” with reference phrases. For example, in a sentence like “he entered John’s house through the front door”, “front door” is a referential phrase, and the mediating relationship that must be identified is the fact that the door to reference is John’s front door (no Another structure that may be referred to).
Discourse analysis
This section contains a number of related tasks. One task is to identify the structure of discourse from connected texts, that is, the nature of discourse relationships between sentences
(Eg explanation, description, symmetry). Another possible activity is to identify and categorize speech actions in a piece of text (eg, yes-no question, content question, statement, affirmation, etc.).
Speech
Speech recognition
Having an audio clip of a person or persons speaking, the text display of their speech is determined. This is text-to-speech reversal, and one of the most difficult issues is called “AI-complete” (see above). In natural speech, pauses between consecutive words rarely occur, and therefore speech segmentation is a necessary subdivision for speech recognition (see below). Note that in most spoken languages, sounds representing consecutive letters are hidden from each other in a process called coarticulation, so converting analog (continuous) signals to discrete characters can be a very difficult process. Also, given that words in a language are spoken by people with different accents, speech recognition software should be able to identify a wide range of inputs that are similar in textual equivalent to others.
Speech segmentation
Having an audio clip of a person or group of people talking divides it into words. This is a sub-task of speech recognition and is usually integrated with it.
Text to speech
Having a text, its units, is transmitted and a spoken representation is produced. Text-to-speech can be used to help people with poor eyesight.
dialogue
In 2018, the first work in this field was published by an artificial intelligence called 1 the Road and was marketed as a novel. This novel has sixty million words.