Vakerraum-retrieval – Wikipedia

before-content-x4

The Vector room retrieval (engl.: Vector Space Model ( VSM )) is a procedure for
Procurement of information in which the information is represented as points in a high -dimensional, metric vector room. The mathematical distance between the search vector and the document/information vector is used for evaluation. The vectors’ model was for the first time in the smart system [first] Implemented, which was developed under the direction of Gerard Salton at Cornell University.

after-content-x4

One can imagine the model on which this form of information procurement is based as follows: A dimension is assigned to every word of the document. In order to determine the point of a document (or an inquiry) in this vector room, a very simple variant of the vector room model, for example, can be counted on how often the individual words occur in the document. The point of the document in the vector room (the Document vector ) then corresponds to the frequencies of these words. For example, the document consisting of one sentence could be “The explosion destroys the vegetation” As a vector (0, …, 2, …, 1, …, 1, …, 1, …) describe: The word the occurs twice Explosion , destroyed and Vegetation once; Other words do not appear (0 times).

Search queries can be coded in the same way; A fictional search query “Does the explosion destroy the vegetation?” Corresponding to the same (request) vector (0, …, 2, …, 1, …, 1, …, 1, …). The problem of finding documents that agree with the search query as well can therefore be solved with the help of the vectors’ model by looking for those whose vector is “similar” to the vector of the search query. An easy way could be, for example, to look for document vectors that are parallel to the inquiry vector or just deviate a small angle from it.

In reality, vector room models are considerably more complex and take into account different word frequencies, for example. Words like “Die” or “is” occur, for example, in almost every German -language document and are therefore not very meaningful, whereas words such as “Desoxyribonucleic acid” are less common and therefore potentially more suitable to differentiate the document from others in terms of content.

In order to enable vector room retrieval, some preparatory work is necessary. The first step consists in the establishment of a document vector room and the document indexing, in which the documents of the document quantity are mapped to exactly one point (document vectors) in the document vector room. For this purpose, there are a variety of characteristic weight models, all of which build on the frequency of characteristics such as terms, lemmata or n-grams in individual documents and the entire document quantity.

The retrieval in the vectors’ model initially carries out a query indication, in which the request is shown to a vector in the vectors. The subsequent retrieval function determines a subset of the document vectors that have a certain similarity regarding the queryvector, and the ranking function depicts this subset on an orderly list of document vectors. A list of documents that corresponds to the list of document vectors are presented to the user who has provided the query.

  • BAEZA-YATES, Richado; Ribeiro-Neto, Berthier: Modern Information Retrieval . Acm Press, New York, 1999, ISBN 0-201-39829-x.
  • Ferber, Reginald: Information retrieval-search models and data mining procedures for text collections and the web . Heidelberg, 2003, ISBN 3-89864-213-5.
  • Grossman, D.A.; Frieder, o.: Information Retrieval . Springer, Netherlands, 2nd edition, 2004, ISBN 1-4020-3004-5.
  • Kowalski, Gerald; Maybury, M.T.: Information Storage and Retrieval Systems . Kluwer, Boston, 2000.
  • Manyr, Jiří: Automatic classification and information retrieval . Tübingen, 1986.
  • Manyr, Jiří: Vectors’ model and cluster analysis in information-retrieval systems . In: News for Documentation 38, pp. 13–20, 1987.
  • Salton, Gerard; McGill, M.J.: Information Retrieval . MacGraw-Hill, 1987.
  1. The European Technology Platform on Smart Systems Integration (EPoSS)
  2. Software Framework for Topic Modelling with Large Corpora. In: Gensim. Accessed on February 3, 2019 (English).
  3. A Beginner’s Guide to Word2Vec and Neural Word Embeddings. SKYMIND.AI, accessed on February 3, 2019 (English).

after-content-x4