Natural Language Processing
Natural Language Processing
History
Main article: History of natural language processing
The history of NLP generally starts in the 1950s, although work can be found from earlier
periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence"
which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty
Russian sentences into English. The authors claimed that within three or five years, machine
translation would be a solved problem.[2] However, real progress was much slower, and after the
ALPAC report in 1966, which found that ten year long research had failed to fulfill the
expectations, funding for machine translation was dramatically reduced. Little further research in
machine translation was conducted until the late 1980s, when the first statistical machine
translation systems were developed.
Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural
language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA,
a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to
1966. Using almost no information about human thought or emotion, ELIZA sometimes
provided a startlingly human-like interaction. When the "patient" exceeded the very small
knowledge base, ELIZA might provide a generic response, for example, responding to "My head
hurts" with "Why do you say your head hurts?".
During the 1970s many programmers began to write 'conceptual ontologies', which structured
real-world information into computer-understandable data. Examples are MARGIE (Schank,
1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM
(Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time,
many chatterbots were written including PARRY, Racter, and Jabberwacky.
Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting
in the late 1980s, however, there was a revolution in NLP with the introduction of machine
learning algorithms for language processing. This was due to both the steady increase in
computational power resulting from Moore's Law and the gradual lessening of the dominance of
Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical
underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning
approach to language processing.[3] Some of the earliest-used machine learning algorithms, such
as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, Part of speech tagging introduced the use of Hidden Markov Models to NLP, and
increasingly, research has focused on statistical models, which make soft, probabilistic decisions
based on attaching real-valued weights to the features making up the input data. The cache
language models upon which many speech recognition systems now rely are examples of such
statistical models. Such models are generally more robust when given unfamiliar input,
especially input that contains errors (as is very common for real-world data), and produce more
reliable results when integrated into a larger system comprising multiple subtasks.
Many of the notable early successes occurred in the field of machine translation, due especially to
work at IBM Research, where successively more complicated statistical models were developed.
These systems were able to take advantage of existing multilingual textual corpora that had been
produced by the Parliament of Canada and the European Union as a result of laws calling for the
translation of all governmental proceedings into all official languages of the corresponding
systems of government. However, most other systems depended on corpora specifically
developed for the tasks implemented by these systems, which was (and often continues to be) a
major limitation in the success of these systems. As a result, a great deal of research has gone
into methods of more effectively learning from limited amounts of data.
Recent research has increasingly focused on unsupervised and semi-supervised learning
algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the
desired answers, or using a combination of annotated and non-annotated data. Generally, this task
is much more difficult than supervised learning, and typically produces less accurate results for a
given amount of input data. However, there is an enormous amount of non-annotated data
available (including, among other things, the entire content of the World Wide Web), which can
often make up for the inferior results.
Modern NLP algorithms are based on machine learning, especially statistical machine learning.
The paradigm of machine learning is different from that of most prior attempts at language
processing. Prior implementations of language-processing tasks typically involved the direct
hand coding of large sets of rules. The machine-learning paradigm calls instead for using general
learning algorithms often, although not always, grounded in statistical inference to
automatically learn such rules through the analysis of large corpora of typical real-world
examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual
sentences) that have been hand-annotated with the correct values to be learned.
Many different classes of machine learning algorithms have been applied to NLP tasks. These
algorithms take as input a large set of "features" that are generated from the input data. Some of
the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar
to the systems of hand-written rules that were then common. Increasingly, however, research has
focused on statistical models, which make soft, probabilistic decisions based on attaching
real-valued weights to each input feature. Such models have the advantage that they can express
the relative certainty of many different possible answers rather than only one, producing more
reliable results when such a model is included as a component of a larger system.
Systems based on machine-learning algorithms have many advantages over hand-produced rules:
The learning procedures used during machine learning automatically focus on the most
common cases, whereas when writing rules by hand it is often not obvious at all where the
effort should be directed.
Automatic learning procedures can make use of statistical inference algorithms to produce
models that are robust to unfamiliar input (e.g. containing words or structures that have not
been seen before) and to erroneous input (e.g. with misspelled words or words accidentally
omitted). Generally, handling such input gracefully with hand-written rules or more
generally, creating systems of hand-written rules that make soft decisions is extremely
difficult, error-prone and time-consuming.
Systems based on automatically learning the rules can be made more accurate simply by
supplying more input data. However, systems based on hand-written rules can only be made
more accurate by increasing the complexity of the rules, which is a much more difficult task. In
particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond
which the systems become more and more unmanageable. However, creating more data to
input to machine-learning systems simply requires a corresponding increase in the number of
man-hours worked, generally without significant increases in the complexity of the annotation
process.
The subfield of NLP devoted to learning approaches is known as Natural Language Learning
(NLL) and its conference CoNLL and peak body SIGNLL are sponsored by ACL, recognizing
also their links with Computational Linguistics and Language Acquisition. When the aims of
computational language learning research is to understand more about human language
acquisition, or psycholinguistics, NLL overlaps into the related field of Computational
Psycholinguistics.