Data Science

NATURAL LANGUAGE PROCESSING AND TEXT MINING

Obiettivi formativi

General Objectives 1. Knowledge of the main application scenarios in analyzing collections of textual data using NLP techniques. 2. Knowledge and understanding of the main methodological and analytical challenges. 3. Knowledge of the main data analysis and machine learning techniques for natural language and the primary tools available to implement them. 4. Understanding of the theoretical foundations underlying advanced techniques for textual data analysis and natural language learning. 5. Ability to translate acquired notions into programs that solve specific problems. 6. Knowledge of the main evaluation techniques and their application to practical scenarios. Specific Objectives Abilities - Identify the most suitable text-mining and/or NLP techniques to address a given problem. - Implement the proposed solution by selecting the most appropriate tools. - Design and conduct experiments to evaluate proposed solutions under realistic conditions. Knowledge and Understanding - Knowledge of the main application scenarios. - Knowledge of the main analysis techniques. - Understanding of the theoretical and methodological assumptions underlying the main techniques. - Knowledge and understanding of the main evaluation techniques and corresponding performance indices. Applying Knowledge and Understanding - Translate application requirements into concrete data-analysis problems. - Identify the most suitable techniques and tools to address those problems. - Qualitatively estimate the scalability of the proposed solutions in advance. Critical and Judgment Skills - Evaluate experimentally the effectiveness, efficiency, and scalability of proposed solutions. Communication Skills - Effectively describe the requirements of a problem and communicate the chosen solutions and their rationale to others. Learning Ability - Develop independent-study skills on course-related topics and critically consult advanced manuals and scientific literature to tackle new scenarios or apply alternative techniques.

Canale 1

FABRIZIO SILVESTRI Scheda docente

Programmi - Frequenza - Esami

Programma

Parte 1 - Ranking e similarity search 1. Problemi di interesse. Ranking di documenti. Link analysis: rivisitazione del Pagerank come sistema di ranking indipendente dalle query. Link analysis dipendente dal contesto: Topic sensitive e Personalized Pagerank. Hub e autorità: l'algoritmo HITS. 2. Similarity search in collezioni di dati ad elevata dimensionalità: ricerca dei Top-k e nearest-neighbour (approssimati). Similarità tra insiemi e distanza di Jaccard. Permutazioni minwise indipendenti. Caso ideale e sua analisi. Firme e stima della similarità. Implementatione con famiglie universali di funzioni hash. Miglioramento dell'accuratezza e tecnica delle bande. Stima della frazione di falsi positivi e negativi. 3. Proprietà generali della tecnica delle bande. Locality Sensitive Hashing per altre distanze: distanza di Hamming e distanza angolare. 4. Altre tecniche per similarity search efficiente. Clustering dei punti e quantizzazione vettoriale: proprietà e limiti. Quantizzazione per prodotti. Definizione, proprietà e implementazione. Efficienza della tecnica basata su quantizzazione dei prodotti. 5. Metodi basati su grafi. Reti piccolo mondo navigabili. Il modello di Kleinberg di rete navigabile su lattici bidimensionali. Reti piccolo mondo navigabili su insiemi di punti nello spazio euclideo. Algoritmi di ricerca e di costruzione della rete. Parte 2 - Riduzione della dimensionalità e clustering 1. Un'applicazione di riferimento: sistemi di raccomandazione e Collaborative filtering 2. Rivisitazione dei principali aspetti di SVD e PCA: varianza, approssimazione di una matrice rispetto alla norma di Frobenius. Uso della SVD per collaborative filtering. SVD troncata ed embedding spettrali. 3. Fattori latenti. Altre decomposizioni di matrici. Decomposizioni a fattori positivi. Algoritmi per il calcolo di componenti a fattori positivi: l'algoritmo iterativo di Lee e Seung. Parte 3 – Reti neurali e linguaggio naturale 1. Reti neurali e modelli linguistici di grandi dimensioni 2. Neural Information Retrieval

Prerequisiti

- Nozioni di algebra lineare - Conoscenze di Analisi Matematica e studio delle funzioni, conoscenze di base di calcolo delle probabilità e statistica - Programmazione, algoritmi e strutture dati fondamentali

Testi di riferimento

- Christopher D. Manning, Prabhakar Raghavan, Henrich Schueze Introduction to Information Retrieval, Cambridge University Press, 2008 - J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets, Cambridge University Press. - Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft). - Note e lavori scientifici sugli argomenti trattati suggeriti dal docente

Frequenza

Nessuna frequenza obbligatoria

Modalità di esame

Gli studenti sono valutati sulla base di: - Valutazione di progetti assegnati in itinere e da consegnare durante lo svolgimento del corso, validi per l'anno accademico - Prova scritta su tutto il programma del corso o su quelle parti che corrispondono a progetti non consegnati - Prova orale E' sempre possibile, su richiesta dello studente, essere valutato attraverso un esame scritto e una prova orale sull'intero programma del corso

Modalità di erogazione

Lezioni e esercitazioni svolte in classe

LUCA BECCHETTI Scheda docente

Programmi - Frequenza - Esami

Programma

Section I – Ranking and similarity search 1. Problems of interest. Document Ranking. Link analysis: review of Pagerank as a query-independent ranking algorithm. Context-dependent link analysis: Topic-sensiti and Personalized Pagerank. Hubs and authorities: the HITS algorithm. 2. Similarity search in high dimensions: Top-k and approximate near(est)-neighbour search. Similarity between sets and Jaccard similarity/distance. Minwise independent permutations. Ideal case and its analysis. Minwise signatures and Jaccard similarity estimation. Implementation with universal (pairwise independent) hash families.Improving accuracy: the banding technique. Estimation of false positive and negative rates. 3. General properties of the banding technique. Locality Sensitive Hashing for other distance measures: Hamming and Cosine distances. 4. Unit 4: Other techniques for efficient similarity search. Clustering and vectorial quantization: properties and limits. Product quantization: definition, properties, implementation. Efficiency of product quantization. 5. Graph-based methods. Navigable small world networks. Kleinberg's navigable small world network for 2-dimensional lattices. Navigable small world networks on euclidean points sets. Search and network construction algorithms. Section II – Dimensionality reduction and clustering 1. A motivating application: recommender systems and collaborative filtering 2. A review of the Singular Value Decomposition: Properties of the SVD (and PCA). Explained variance, best low-rank approximation in Frobenius norm. Using the SVD for classification and recommandation. Spectral embeddings using Truncated SVD. Section III – Deep learning and Natural Language Processing 1. Neural networks and Large Language Models 2. Neural Information Retrieval

Prerequisiti

- Linear algebra - Calculus and basic knowledge of probability theory and statistics - Programming, fundamental algorithms and data structures

Testi di riferimento

Frequenza

No mandatory attendance

Modalità di esame

Evaluation is based on: - Homeworks assigned during the course and valid for the current academic year - Written exam on the entire study plan or on parts of it corresponding to homeworks that were not delivered by the students - Oral exam It is always possible to take the written exam plus an oral one

Bibliografia

Lectures and exercised solved in the classroom

Modalità di erogazione

Lectures and exercised solved in the classroom

Codice insegnamento10621173
Anno accademico2025/2026
CorsoData Science
CurriculumCurriculum unico
Anno1º anno
Semestre2º semestre
SSDING-INF/05
CFU6

Catalogo dei corsi di studio

11/11/2025 - Recruiting Day Iconsulting – Mercoledì 3 Dicembre alle ore 10:00

04/11/2025 - Bando per conferimento Borse di Studio intitolate alla memoria di "Antonio Ventura" e dedicate ai laureati magistrali delle Facoltà di Ingegneria

03/11/2025 - BANDO PER N. 27 BORSE DI COLLABORAZIONE PER ATTIVITÀ DI SUPPORTO

NATURAL LANGUAGE PROCESSING AND TEXT MINING

Obiettivi formativi

Programmi - Frequenza - Esami

Programma

Prerequisiti

Testi di riferimento

Frequenza

Modalità di esame

Modalità di erogazione

Programmi - Frequenza - Esami

Programma

Prerequisiti

Testi di riferimento

Frequenza

Modalità di esame

Bibliografia

Modalità di erogazione

Data Science

Avvisi in evidenza

11/11/2025 - Recruiting Day Iconsulting – Mercoledì 3 Dicembre alle ore 10:00

04/11/2025 - Bando per conferimento Borse di Studio intitolate alla memoria di "Antonio Ventura" e dedicate ai laureati magistrali delle Facoltà di Ingegneria

03/11/2025 - BANDO PER N. 27 BORSE DI COLLABORAZIONE PER ATTIVITÀ DI SUPPORTO

NATURAL LANGUAGE PROCESSING AND TEXT MINING

Obiettivi formativi

Programmi - Frequenza - Esami

Programma

Prerequisiti

Testi di riferimento

Frequenza

Modalità di esame

Modalità di erogazione

Programmi - Frequenza - Esami

Programma

Prerequisiti

Testi di riferimento

Frequenza

Modalità di esame

Bibliografia

Modalità di erogazione