Papernotes: TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Why is few shot needed?

Example: Risk stratification in healthcare, where there thousands of diseases with very few patients. Hence the need for methods that exploit prior knowledge.

tabular data lacks locality, contains mixed data types, and the number of columns is usually fairly small compared to the number of features in text or image data (Borisov et al., 2022a).
  • TabLLM is a general framework to leverage LLMs for few-shot classification of tabular data.
  • Nine different serializations and the T0 language model of different sizes (Sanh et al., 2022).
  • Parameter-efficient fine-tuning method T-Few (Liu et al., 2022) to update the LLM’s parameters using some labeled examples.
  • Tabular data tasks so far:
    • prediction of masked cells, the identification or correction of corrupted cells, and contrastive losses over augmentations
    • LLMs for tabular data tasks
      • Yin et al. (2020) for semantic parsing of natural language queries over tabular data
      • Li et al. (2020) for entity matching
      • Harari and Katz (2022) study data enrichment by linking each table row with additional unstructured text (e.g., from Wikipedia) from which they generated additional features using a language model.
      • Bertsimas et al. (2022) – generate feature embeddings
      • All the above use BERT like models
      • Narayan et al. (2022) for data management tasks.
      • Borisov et al. (2022b) – LLM agnostic method to generate tabular data
  • Serialization
  • simple list serialization;
  • Yin et al. (2020) also include the column data type in the serialized string.
  • LLM input = (serialize(F,x), p)
  • incorporating another LLM and (ii) employing feature selection as a substep.
  • LLM ((serialize¹𝐹– xº– 𝑝 )) belongs_to 𝑉

Create a free website or blog at WordPress.com.

Up ↑