Generative AI

Prof. Fabrizio Silvestri

Course Instructor

View Profile →

Ali Ghasemi

Teaching Assistant

View Profile →

Course Syllabus

4 Lectures

Part I: Foundations

Probability Theory & Linear Algebra
Optimization & Information Theory
Deep Learning Architectures & Attention
CLIP, Contrastive Learning & Autoencoders

4 Lectures

Part II: Generative AI for Images

VQ-VAE & Generative Adversarial Networks
Normalizing Flows (Continuous Flows, Neural ODEs)
Diffusion Models (DDPM, Score-Based)
Diffusion Architectures (LDM, DiT, Adapters)

💻 Hands-on Labs

Lab 1: VAE & GAN Implementation
Lab 2: Diffusion Models: DDPM Training & Sampling

5 Lectures

Part III: Generative AI for Text

NLP Foundations (Tokenization, Embeddings)
LLM Architecture (GPT, LLaMA)
LLM Architecture & Scaling Laws
Alignment (RLHF, DPO, ORPO, LoRA)
Retrieval Augmented Generation (RAG)

💻 Hands-on Labs

Lab 3: NanoGPT: Building a Micro-GPT
Lab 4: LLM Applications: RAG & Agents

3 Lectures

Part IV: Frontiers & Advanced Topics

Agentic AI (ReAct, Reflexion, Tree of Thoughts, Tool Use)
JEPA (I-JEPA, V-JEPA 2, LLM-JEPA)
Multimodal LLMs (CLIP, LLaVA, GPT-4o, Gemini)

💻 Hands-on Labs

Lab 4: LLM Applications: RAG & Agents
Lab 5: Multimodal: VLM Applications & Fine-tuning

Tentative Schedule

Date	Day	Type	Topic	Content
Part I: Foundations (Ch 1–7)
Feb 25	Wed	Lecture	01: Foundations I	Probability Theory, Linear Algebra.
Feb 27	Fri	Lecture	02: Foundations II	Optimization, Information Theory.
Mar 04	Wed	Lecture	03: Deep Learning & Attention	DL Architectures, CNNs, Transformers.
Mar 06	Fri	Lecture	04: CLIP & Autoencoders	Contrastive Learning, VAEs, ELBO.
Part II: Generative AI for Images (Ch 8–13)
Mar 11	Wed	Lecture	05: VQ-VAE & GANs	Vector Quantized Models, Adversarial Training.
Mar 13	Fri	Lab 01	VAE & GAN Lab	VAE implementation, GAN training.
Mar 18	Wed	Lecture	06: Normalizing Flows	Invertible Networks, Continuous Flows, Neural ODEs.
Mar 20	Fri	Lecture	07: Diffusion Models	DDPM, Score-Based Models, U-Net.
Mar 25	Wed	Lab 02	Diffusion Lab	DDPM & Latent Diffusion Training.
Mar 27	Fri	Exercise	08: Comprehensive Vision AI Review	Exercises on Foundations, VAEs, GANs, Diffusion.
Part III: Generative AI for Text (Ch 14–19)
Apr 01	Wed	Lecture	09: NLP Foundations	Tokenization, Embeddings, RNNs.
Apr 03	Fri	Holiday	Easter Break	(No Class)
Apr 08	Wed	Lecture	10: LLM Architecture	GPT, LLaMA, Inference Optimization.
Apr 10	Fri	Lecture	10: LLM Architecture (cont.)	Scaling Laws, KV Cache, GQA, Inference.
Apr 15	Wed	Lecture	11: Alignment	RLHF, DPO, ORPO, PEFT.
Apr 17	Fri	Lab 03	NanoGPT Lab	Building a Micro-GPT from Scratch.
Apr 22	Wed	Lecture	11/12: Alignment (Part II) + RAG & Agentic AI (Part I)	Close RLHF/DPO/ORPO/LoRA; open RAG (Lewis 2020, DPR, pipeline).
Apr 24	Fri	Lecture	12: RAG & Agentic AI (Part II)	Review of NLP, LLMs, Alignment, RAG.
Apr 29	Wed	Lab 04	LLM Applications Lab	RAG & Agent Implementation.
Part IV: Frontiers & Advanced Topics (Ch 20–24)
May 01	Fri	Holiday	Labor Day	(No Class)
May 06	Wed	Lecture	15: JEPA	LeCun's non-generative bet. EBMs, collapse & anti-collapse (BYOL, DINO, VICReg), I-JEPA, V-JEPA 2, LLM-JEPA, theory & outlook.
May 08	Fri	Exercise	16: Deep Generative Modeling - Theory & Practice	VAE, GAN, Diffusion, Transformer Exercises.
May 13	Wed	Lecture	17: Multimodal LLMs	CLIP recap, Flamingo, LLaVA, GPT-4o, Gemini, Chameleon, audio/video/3D, evaluation.
May 15	Fri	Lab 05	Multimodal Lab	VLM Applications & Fine-tuning.
May 20	Wed	Exercise	18: Exercises I	Foundations, Images & Diffusion Review.
May 22	Fri	Lecture	Invited Lecture — PhD Students	Guest research lecture by PhD students.
May 27	Wed	Exercise	19: Final Comprehensive Exercise	Exam-style review across all course topics.

Important Papers

Canonical references for every paper introduced in the lecture decks or the lecture notes, in IEEE citation style. Peer-reviewed venues are preferred over preprints; where a paper appeared at both, the conference or journal is cited and the arXiv ID is given in parentheses.

Part I — Foundations

C. E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
H. Robbins and S. Monro, "A Stochastic Approximation Method," The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review, vol. 65, no. 6, pp. 386–408, 1958.
K. Hornik, M. Stinchcombe, and H. White, "Multilayer Feedforward Networks Are Universal Approximators," Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," in Proc. ICLR, 2015.
I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," in Proc. ICLR, 2019.
A. Vaswani et al., "Attention Is All You Need," in Advances in NeurIPS, 2017.
D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," in Proc. ICLR, 2015.
J. Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," Neurocomputing, vol. 568, Art. no. 127063, 2024 (arXiv:2104.09864).
T. Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," in Advances in NeurIPS, 2022.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. CVPR, 2016, pp. 770–778.
O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation," in Proc. MICCAI, 2015, pp. 234–241.
S. Ioffe and C. Szegedy, "Batch Normalization," in Proc. ICML, 2015, pp. 448–456.
J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer Normalization," arXiv:1607.06450, 2016.

Part II — Generative AI for Images

D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," in Proc. ICLR, 2014.
I. Goodfellow et al., "Generative Adversarial Nets," in Advances in NeurIPS, 2014.
I. Higgins et al., "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework," in Proc. ICLR, 2017.
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, "Neural Discrete Representation Learning," in Advances in NeurIPS, 2017.
L. Dinh, J. Sohl-Dickstein, and S. Bengio, "Density Estimation Using Real NVP," in Proc. ICLR, 2017.
D. P. Kingma and P. Dhariwal, "Glow: Generative Flow with Invertible 1×1 Convolutions," in Advances in NeurIPS, 2018.
D. Rezende and S. Mohamed, "Variational Inference with Normalizing Flows," in Proc. ICML, 2015, pp. 1530–1538.
J. Sohl-Dickstein et al., "Deep Unsupervised Learning Using Nonequilibrium Thermodynamics," in Proc. ICML, 2015, pp. 2256–2265.
J. Ho, A. Jain, and P. Abbeel, "Denoising Diffusion Probabilistic Models," in Advances in NeurIPS, 2020.
Y. Song and S. Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution," in Advances in NeurIPS, 2019.
A. Nichol and P. Dhariwal, "Improved Denoising Diffusion Probabilistic Models," in Proc. ICML, 2021, pp. 8162–8171.
P. Dhariwal and A. Nichol, "Diffusion Models Beat GANs on Image Synthesis," in Advances in NeurIPS, 2021.
J. Ho and T. Salimans, "Classifier-Free Diffusion Guidance," in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021 (arXiv:2207.12598).
W. Peebles and S. Xie, "Scalable Diffusion Models with Transformers," in Proc. ICCV, 2023, pp. 4195–4205.
K. He et al., "Masked Autoencoders Are Scalable Vision Learners," in Proc. CVPR, 2022, pp. 16000–16009.
M. Heusel et al., "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)," in Advances in NeurIPS, 2017.

Part III — Generative AI for Text

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in Workshop at ICLR, 2013.
S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proc. NAACL, 2019, pp. 4171–4186.
A. Radford et al., "Improving Language Understanding by Generative Pre-Training," OpenAI Tech. Rep., 2018.
T. Brown et al., "Language Models Are Few-Shot Learners (GPT-3)," in Advances in NeurIPS, 2020.
H. Touvron et al., "LLaMA: Open and Efficient Foundation Language Models," arXiv:2302.13971, 2023.
H. Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models," arXiv:2307.09288, 2023.
J. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020.
J. Hoffmann et al., "Training Compute-Optimal Large Language Models (Chinchilla)," in Advances in NeurIPS, 2022.
N. Shazeer, "Fast Transformer Decoding: One Write-Head Is All You Need (MQA)," arXiv:1911.02150, 2019.
J. Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," in Proc. EMNLP, 2023 (arXiv:2305.13245).
L. Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)," in Advances in NeurIPS, 2022.
J. Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017.
J. Schulman et al., "Trust Region Policy Optimization," in Proc. ICML, 2015, pp. 1889–1897.
R. Rafailov et al., "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model," in Advances in NeurIPS, 2023.
J. Hong, N. Lee, and J. Thorne, "ORPO: Monolithic Preference Optimization without Reference Model," in Proc. EMNLP, 2024.
Z. Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO)," arXiv:2402.03300, 2024.
E. J. Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," in Proc. ICLR, 2022.
T. Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs," in Advances in NeurIPS, 2023.
P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Advances in NeurIPS, 2020.
K. Guu et al., "REALM: Retrieval-Augmented Language Model Pre-training," in Proc. ICML, 2020.
V. Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering," in Proc. EMNLP, 2020, pp. 6769–6781.
O. Khattab and M. Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT," in Proc. SIGIR, 2020, pp. 39–48.
K. Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction," in Proc. NAACL, 2022.
T. Formal, B. Piwowarski, and S. Clinchant, "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking," in Proc. SIGIR, 2021, pp. 2288–2292.
G. Izacard and É. Grave, "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (FiD)," in Proc. EACL, 2021, pp. 874–880.
A. Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection," in Proc. ICLR, 2024 (arXiv:2310.11511).
D. Edge et al., "From Local to Global: A Graph RAG Approach to Query-Focused Summarization," arXiv:2404.16130, 2024.
Y. A. Malkov and D. A. Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
N. Thakur et al., "BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models," in Advances in NeurIPS Datasets and Benchmarks, 2021.
N. Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," in Proc. EACL, 2023, pp. 2014–2037.
N. Kandpal et al., "Large Language Models Struggle to Learn Long-Tail Knowledge," in Proc. ICML, 2023, pp. 15696–15707.
N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," Transactions of the ACL, vol. 12, pp. 157–173, 2024.
F. Cuconasu et al., "The Power of Noise: Redefining Retrieval for RAG Systems," in Proc. SIGIR, 2024, pp. 719–729.
G. Trappolini, F. Cuconasu, S. Filice, Y. Maarek, and F. Silvestri, "Redefining Retrieval Evaluation in the Era of LLMs," in Proc. EACL, 2026, pp. 8359–8375.
S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, "RAGAS: Automated Evaluation of Retrieval Augmented Generation," in Proc. EACL: System Demonstrations, 2024, pp. 150–158.
S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," in Proc. ICLR, 2023 (arXiv:2210.03629).
N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," in Advances in NeurIPS, 2023 (arXiv:2303.11366).
S. Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," in Advances in NeurIPS, 2023 (arXiv:2305.10601).
T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," in Advances in NeurIPS, 2023 (arXiv:2302.04761).
G. Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models," Transactions on Machine Learning Research (TMLR), 2024.
C. E. Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?," in Proc. ICLR, 2024 (arXiv:2310.06770).
S. Zhou et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents," in Proc. ICLR, 2024.
K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," in Proc. ACM Workshop on AI and Security (AISec), 2023, pp. 79–90.

Part IV — Frontiers (JEPA, Multimodal)

Y. LeCun, "A Path Towards Autonomous Machine Intelligence," OpenReview preprint, Version 0.9.2, 2022.
M. Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)," in Proc. CVPR, 2023, pp. 15619–15629.
A. Bardes, J. Ponce, and Y. LeCun, "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning," in Proc. ICLR, 2022.
J. Zbontar et al., "Barlow Twins: Self-Supervised Learning via Redundancy Reduction," in Proc. ICML, 2021, pp. 12310–12320.
J.-B. Grill et al., "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (BYOL)," in Advances in NeurIPS, 2020.
M. Caron et al., "Emerging Properties in Self-Supervised Vision Transformers (DINO)," in Proc. ICCV, 2021, pp. 9650–9660.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)," in Proc. ICML, 2020, pp. 1597–1607.
Y. Tian, X. Chen, and S. Ganguli, "Understanding Self-Supervised Learning Dynamics Without Contrastive Pairs," in Proc. ICML, 2021, pp. 10268–10278.
R. Shwartz-Ziv, R. Balestriero, K. Kawaguchi, T. G. J. Rudner, and Y. LeCun, "An Information Theory Perspective on Variance-Invariance-Covariance Regularization," in Advances in NeurIPS, 2023 (arXiv:2303.00633).
E. Littwin, O. Saremi, M. Advani, C. Huang, P. Nakkiran, J. Susskind, and V. Thilak, "How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self-Distillation Networks," in Advances in NeurIPS, 2024 (arXiv:2407.03475).
M. Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning," arXiv:2506.09985, 2025.
H. Huang, Y. LeCun, and R. Balestriero, "LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures," arXiv:2509.14252, 2025.
A. Radford et al., "Learning Transferable Visual Models from Natural Language Supervision (CLIP)," in Proc. ICML, 2021, pp. 8748–8763.
M. Tsimpoukelli et al., "Multimodal Few-Shot Learning with Frozen Language Models," in Advances in NeurIPS, 2021.
J.-B. Alayrac et al., "Flamingo: A Visual Language Model for Few-Shot Learning," in Advances in NeurIPS, 2022.
J. Li, D. Li, S. Savarese, and S. Hoi, "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," in Proc. ICML, 2023, pp. 19730–19742.
H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual Instruction Tuning (LLaVA)," in Advances in NeurIPS, 2023.
H. Liu et al., "Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)," in Proc. CVPR, 2024, pp. 26296–26306 (arXiv:2310.03744).
Chameleon Team (Meta), "Chameleon: Mixed-Modal Early-Fusion Foundation Models," arXiv:2405.09818, 2024.
A. Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," in Proc. CoRL, 2023, pp. 2165–2183.
D. Driess et al., "PaLM-E: An Embodied Multimodal Language Model," in Proc. ICML, 2023, pp. 8469–8488.
A. Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)," in Proc. ICML, 2023, pp. 28492–28518.
Z. Borsos et al., "AudioLM: A Language Modeling Approach to Audio Generation," IEEE/ACM Trans. Audio, Speech and Language Processing, vol. 31, pp. 2523–2533, 2023.
X. Yue et al., "MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI," in Proc. CVPR, 2024, pp. 9556–9567.
Y. Li et al., "Evaluating Object Hallucination in Large Vision-Language Models (POPE)," in Proc. EMNLP, 2023, pp. 292–305.

Join the Google Classroom

Prof. Fabrizio Silvestri

Ali Ghasemi

Course Syllabus

Part I: Foundations

Part II: Generative AI for Images

💻 Hands-on Labs

Part III: Generative AI for Text

💻 Hands-on Labs

Part IV: Frontiers & Advanced Topics

💻 Hands-on Labs

Tentative Schedule

Important Papers

Part I — Foundations

Part II — Generative AI for Images

Part III — Generative AI for Text

Part IV — Frontiers (JEPA, Multimodal)