SYNASC 2025

27th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

September 22-25, Timișoara, România

Tutorials

Synthetic Data Generation with LLMs

Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

National University of Science and Technology Politehnica Bucharest
 
Description: This tutorial provides a technical overview of synthetic data generation using Large Language Models (LLMs), focusing on core methodologies and their integration. Synthetic data has become an essential tool for addressing key limitations in the availability, cost, and distributional coverage of manually annotated datasets. It enables scalable experimentation, facilitates data augmentation in low-resource settings, and supports iterative model refinement. The session begins with a discussion of generation methods and filtering strategies designed to enforce quality constraints. Next, the tutorial examines practical use cases. These include alignment tuning, where synthetic datasets are used to steer model behavior; inference-time augmentation, where generated exemplars support few-shot generalization or contextual adaptation; and self-improvement workflows, where models contribute to their iterative training through synthetic supervision.
 
Short bios:
  • Andreea Dutulescu is a PhD student at the National University of Science and Technology Politehnica Bucharest. Her research interests include natural language processing (NLP) in education, synthetic data generation, and related areas in applied machine learning. Andreea has gained experience in both academic and industrial settings. She has completed multiple internships at Google, where she worked on practical ML tasks and contributed to real-world applications. In her academic work, she has been involved in various research projects involving NLP.
  • Ștefan Rușeți is an associate professor in the Department of Computer Science. His research activity spans multiple areas of Natural Language Processing (NLP), with most of his work focused on applying NLP techniques to develop tools for educational scenarios. He has experience in both national and international projects and has published over 80 papers, including 13 articles at top-tier conferences (EMNLP, COLING, ECIR, AIED) and 6 articles in Q1 journals (Computers & Education, Computers in Human Behavior, International Journal of Artificial Intelligence in Education).
  • Mihai Dascalu is a professor of Computer Science at the National University of Science and Technology  Politehnica  Bucharest. He teaches object-oriented programming, algorithm design, and machine learning. He holds a dual Ph.D. in Computer Science and Educational Sciences and has authored over 300 research papers, including many in top-tier conferences and Q1 journals. Mihai has participated extensively in national and international projects, received prestigious awards such as a Fulbright scholarship, and is a corresponding member of the Academy of Romanian Scientists.