Language modeling and n-gram models, Regular expressions for tokenization, Sentence segmentation and normalization, Stemming, lemmatization, and part-of-speech tagging, Stopword removal and text normalization, Bag-of-...
Language modeling and n-gram models, Regular expressions for tokenization, Sentence segmentation and normalization, Stemming, lemmatization, and part-of-speech tagging, Stopword removal and text normalization, Bag-of-words and TF-IDF representations, Character-level and subword tokenization (BPE, WordPiece), Unicode handling and multilingual text processing.
Word2Vec (skip-gram, CBOW), GloVe and fastText embeddings, Contextual embeddings (ELMo, Flair), RNN/LSTM/GRU for sequence modeling, Bidirectional encoders, Sequence labeling tasks (NER, POS tagging, chunking), CRF layer for structured prediction, Attention mechanisms (self-attention, multi-head attention).
Transformer model architecture (encoder-decoder, positional encoding), Self-attention and scaled dot-product attention, Multi-head attention and layer normalization, BERT pretraining objectives (MLM, NSP), RoBERTa, DistilBERT, and ALBERT variants, Fine-tuning strategies and domain adaptation, Sentence-BERT for semantic similarity.
GPT architecture evolution (GPT-1 to GPT-4), Decoder-only transformers, In-context learning and few-shot prompting, Chain-of-thought reasoning, Retrieval-Augmented Generation (RAG) architecture, Knowledge graphs and dense retrieval, Prompt engineering techniques (zero-shot, few-shot, instruction tuning), Hallucination mitigation strategies.
Text generation evaluation (BLEU, ROUGE, BERTScore, human evaluation), Question answering systems (extractive, generative), Conversational AI (dialogue state tracking, response generation), Multilingual NLP (mBERT, XLM-R), Model deployment (TGI, vLLM), Ethical considerations (bias detection, toxicity classification, fairness evaluation), RAG evaluation frameworks.