Natural Language Processing (NLP)
Natural Language Processing (NLP) involves the use of information technology in the processing of natural human language to achieve practical ends.
NLP combines several overlapping academic disciplines (computer science, linguistics, mathematics, machine learning, data mining and artificial intelligence) with a broad range of practical language applications.
NLP encompasses both written and spoken text, but speech processing is considered a separate field.
While there is no shortage of theory in NLP (finite state automata, probability theory, information theory, grammar and linguistics, etc), the field in general has an "applied" character and is primarily concerned providing practical solutions to real problems.
NLP Applications
Practical tasks to which NLP is currently applied include:
- Information extraction (IE): extraction of structured information from unstructured machine-readable documents. For example, the extraction of a structure such as MergerBetween(company1,company2,date) from an online news sentence such as "Yesterday, Sydney based Foo Inc. announced their acquisition of Bar Corp."
- Information retrieval (IR): the storing, searching and retrieval of information. A broader field within computer science (closer to databases), but increasingly uses NLP techniques (for example, stemming). Web search engines are specialised IR applications.
- Query expansion: reformulation of queries to improve information retrieval performance.
- Machine translation: translation from one human language to another.
- Text summarisation: creation of short summaries of longer texts, eg abstracts.
- Topic identification: identification of the topic or topics addressed by a text.
- Text simplification: creation of text versions that are easier to read/process than the originals. For example, the reduction of complex clauses to a series of grammatically simpler independent clauses, with no loss of meaning. Can be pre-processing for further NLP tasks such as Information Extraction.
- Text classification: classification of texts by topic and other criteria.
- Author identification: identification of the author of a text.
- Natural language generation (NLG): generation of natural language texts from structured data. For example, the generation of weather reports or medical reports from structured numeric data, or the generation of spoken texts from numeric/visual data such as graphs for the visually impaired.
- Natural language understanding: the understanding (of the meaning) of spoken or written texts by computers. The opposite of NLG.
- Named entity recognition (NER): indentification of which tokens in a stream of text map to proper names, such as people or places. Unlike English, not all languages do not use capitalization to distinguish named entities.
- Optical character recognition (OCR): conversion of graphical data to machine-readable text.
- Question answering: production of human language answers to human language questions. For example, "What is the capital of Australia?".
- Speech recognition: conversion of spoken text to written text. The opposite of text-to-speech.
- Text-to-speech: conversion of written texts to spoken texts
- Text-proofing: identification (and possibly correction) of grammar and spelling errors, or stylistic improvements.
- Foreign language reading and writing aids: assistance with pronunciation and intonation, and lexicon/grammar.