460-4074/01 – Methods of Analysis of Textual Data (MATD )

Gurantor departmentDepartment of Computer ScienceCredits4
Subject guarantordoc. Mgr. Jiří Dvorský, Ph.D.Subject version guarantordoc. Mgr. Jiří Dvorský, Ph.D.
Study levelundergraduate or graduateRequirementOptional
Year2Semesterwinter
Study languageCzech
Year of introduction2015/2016Year of cancellation
Intended for the facultiesFEI, USPIntended for study typesFollow-up Master
Instruction secured by
LoginNameTuitorTeacher giving lectures
DVO26 doc. Mgr. Jiří Dvorský, Ph.D.
VAS218 Ing. Michal Vašinek, Ph.D.
Extent of instruction for forms of study
Form of studyWay of compl.Extent
Full-time Graded credit 2+2
Part-time Graded credit 18+0

Subject aims expressed by acquired skills and competences

The aim of the course is to introduce students with the basic and advanced techniques of analysis of textual data. After finishing the course the student will be able to: describe different methods of analysis of textual data, understand these methods, implement these methods, or use existing libraries, incorporate these methods into your own design analysis of specific data.

Teaching methods

Lectures
Tutorials

Summary

The course deals with basic principles of analysis of text documents. Text documents are understood as a typical representative of weak structured data. Individual areas of processing of text data - documents, web pages will be presented. The subject includes algorithms for pattern matching in the text, design of index systems for text data, work with natural languages in which texts are written. The various approaches to searching in text data, including methods of latent semantics analysis, will be also described. At the end, the course focuses on web search.

Compulsory literature:

1. Manning, C. D.; Raghavan, P. & Schutze, H. Introduction to Information Retrieval, Cambridge University Press, 2008 2. Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3 3. Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X 4. Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579 5. Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821 6. Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254 7. Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006 8. Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997

Recommended literature:

1. Witten, I. H.; Gori, M. & Numerico, T. Web Dragons: Inside the Myths of Search Engine Technology, Morgan Kaufmann, 2006

Way of continuous check of knowledge in the course of semester

Conditions for granting credit Implementation and presentation of the project. Programming applications to simple exercises. Attendance on exercises.

E-learning

Other requirements

Knowledge of programming and mathematics at the level of bachelor degree.

Prerequisities

Subject has no prerequisities.

Co-requisities

Subject has no co-requisities.

Subject syllabus:

A brief outline of the lectures' topics: 1. Introduction to information systems. The history and evolution of text retrieval. Differences between database systems and information retrieval (IR) systems. The general model of information retrieval system. 2. Pattern matching. One sample pattern matching. Aho-Corasick algorithm. Regular expressions, finite automata. Algorithms for approximate pattern matching. 3. Suffix trees. DAWG. Patricia and similar data structures. 4. Primary processing of texts. Lexical analysis. Stemming. Lemmatization. Stop words. 5. Construction of index systems. Zipf law and the estimated size of the index system. Indexing based on classification. Positional index systems. Methods for weighting terms. TF-IDF weight terms. Methods of compression index systems. Methods for encoding natural numbers. 6. Query Languages​​. Relevance document. The degree of similarity between pairs of document-query. Relevance vs. similarity. The structure and query evaluation. Boolean DIS. IR system evaluation (accuracy, completeness, F-measure). 7. Signature methods. Chained and layered coding signatures. Efficient evaluation of queries. 8. Latent semantics. Methods for dimension reduction. Methods based on matrix decomposition. Random projection. Vector DIS. Construction and evaluation of the query vector. Other types of DIS (extended Boolean). Indexing, query structure, evaluation questions. 9. Search the site. Analysis of hypertext documents, structural methods. PageRank and HITS. Metasearch and cooperative search. Application of computational intelligence and soft computing in processing a text search. 10. Methods for automatic summarization: abstraction and extraction. Detection and evolution theme. Sentiment analysis, classification and clustering of documents. 11. Parallel and distributed search. Decentralized P2P and search. 12. Semantic and contextual search. Neural Information Retrieval.

Conditions for subject completion

Full-time form (validity from: 2015/2016 Winter semester)
Task nameType of taskMax. number of points
(act. for subtasks)
Min. number of pointsMax. počet pokusů
Graded credit Graded credit 100  51 3
Mandatory attendence participation: Participation in the exercises is compulsory and is monitored. The scope of the compulsory participation will be communicated to the students by the course supervisor at the beginning of the semester.

Show history

Conditions for subject completion and attendance at the exercises within ISP: Course completion requirements - Completion of all mandatory tasks within individually agreed deadlines. Attendance at exercises - The level of attendance at exercises is agreed by the student with the course supervisor at the beginning of the semester.

Show history

Occurrence in study plans

Academic yearProgrammeBranch/spec.Spec.ZaměřeníFormStudy language Tut. centreYearWSType of duty
2024/2025 (N0613A140034) Computer Science DS P Czech Ostrava 1 Choice-compulsory type A study plan
2024/2025 (N0613A140034) Computer Science AZD P Czech Ostrava 1 Choice-compulsory type A study plan
2024/2025 (N0613A140034) Computer Science DS K Czech Ostrava 1 Choice-compulsory type A study plan
2024/2025 (N0613A140034) Computer Science AZD K Czech Ostrava 1 Choice-compulsory type A study plan
2024/2025 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2024/2025 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2024/2025 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2024/2025 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2023/2024 (N0613A140034) Computer Science DS K Czech Ostrava 1 Choice-compulsory type A study plan
2023/2024 (N0613A140034) Computer Science AZD K Czech Ostrava 1 Choice-compulsory type A study plan
2023/2024 (N0613A140034) Computer Science DS P Czech Ostrava 1 Choice-compulsory type A study plan
2023/2024 (N0613A140034) Computer Science AZD P Czech Ostrava 1 Choice-compulsory type A study plan
2023/2024 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2023/2024 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2023/2024 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2023/2024 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2023/2024 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2023/2024 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2023/2024 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2023/2024 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2022/2023 (N0613A140034) Computer Science DS K Czech Ostrava 1 Choice-compulsory type A study plan
2022/2023 (N0613A140034) Computer Science AZD K Czech Ostrava 1 Choice-compulsory type A study plan
2022/2023 (N0613A140034) Computer Science DS P Czech Ostrava 1 Choice-compulsory type A study plan
2022/2023 (N0613A140034) Computer Science AZD P Czech Ostrava 1 Choice-compulsory type A study plan
2022/2023 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2022/2023 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2022/2023 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2022/2023 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2022/2023 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2022/2023 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2022/2023 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2022/2023 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2021/2022 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2021/2022 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2021/2022 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2021/2022 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2021/2022 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2021/2022 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2021/2022 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2021/2022 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2020/2021 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2020/2021 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2020/2021 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2020/2021 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2020/2021 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2020/2021 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2020/2021 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2020/2021 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2019/2020 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2019/2020 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2019/2020 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2019/2020 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2019/2020 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics P Czech Ostrava Optional study plan
2019/2020 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC P Czech Ostrava Optional study plan
2019/2020 (N0541A170007) Computational and Applied Mathematics (S01) Applied Mathematics K Czech Ostrava Optional study plan
2019/2020 (N0541A170007) Computational and Applied Mathematics (S02) Computational Methods and HPC K Czech Ostrava Optional study plan
2018/2019 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2018/2019 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2018/2019 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2018/2019 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2018/2019 (N2658) Computational Sciences (2612T078) Computational Sciences P Czech Ostrava 2 Choice-compulsory study plan
2017/2018 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2017/2018 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2017/2018 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2017/2018 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2017/2018 (N2658) Computational Sciences (2612T078) Computational Sciences P Czech Ostrava 2 Choice-compulsory study plan
2016/2017 (N2658) Computational Sciences (2612T078) Computational Sciences P Czech Ostrava 2 Choice-compulsory study plan
2016/2017 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2016/2017 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2016/2017 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2016/2017 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan
2015/2016 (N2647) Information and Communication Technology (2612T059) Mobile Technology P Czech Ostrava 2 Optional study plan
2015/2016 (N2647) Information and Communication Technology (2612T059) Mobile Technology K Czech Ostrava 2 Optional study plan
2015/2016 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology P Czech Ostrava 2 Choice-compulsory study plan
2015/2016 (N2647) Information and Communication Technology (2612T025) Computer Science and Technology K Czech Ostrava 2 Choice-compulsory study plan

Occurrence in special blocks

Block nameAcademic yearForm of studyStudy language YearWSType of blockBlock owner

Assessment of instruction



2022/2023 Summer
2021/2022 Winter
2020/2021 Winter
2019/2020 Winter
2018/2019 Winter
2017/2018 Winter
2016/2017 Winter