References for AI and NLP
4 min readDec 24, 2022
I have been collecting some useful articles related to AI and NLP for the past few years. This could be more useful if I keep building this as a public collection.
Looking forward to your comments and suggestions to expand the collection.
AI system design
- Project structure https://towardsdatascience.com/practices-to-enhance-your-ml-code-part-1-e77d00db826d
- ML book: https://book.mlcompendium.com/data-science-tools
- AI & Data Foundation landscape https://landscape.lfai.foundation/
- Comprehensive System design series: https://medium.com/coders-mojo/complete-system-design-series-part-1-45bf9c8654bc
- System design of some big tech companies: https://www.theinsaneapp.com/2021/03/system-design-and-recommendation-algorithms.html
- https://medium.com/analytics-vidhya/mlops-end-to-end-machine-learning-pipeline-cicd-1a7907698a8e
- https://towardsdatascience.com/the-only-3-ml-tools-you-need-1aa750778d33
- Software engineering for artificial intelligence and machine learning software: A systematic literature review https://arxiv.org/ftp/arxiv/papers/2011/2011.03751.pdf
NLP:
- https://nlpforhackers.io/
- http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf
- https://www.analyticsvidhya.com/blog/2020/01/learning-path-nlp-2020/
- https://www.facebook.com/IIITH/videos/431575504254179/?v=431575504254179
- https://medium.com/huggingface/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1
- https://www.analyticsvidhya.com/datahack-summit-2019/workshop/natural-language-processing-for-beginners/
- https://medium.com/@gon.esbuyo/get-started-with-nlp-part-i-d67ca26cc828
- Pytorch from scratch: https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/?utm_source=blog&utm_medium=key-takeaways-iclr-2020
- https://www.nlphistory.com/
- Text Preprocessing:
(i). https://www.freecodecamp.org/news/all-you-need-to-know-about-text-preprocessing-for-nlp-and-machine-learning-bc1c5765ff67/
(ii). https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
(iii). https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79
(iv). https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b - Tokenization: https://blog.floydhub.com/tokenization-nlp/
- Data augmentation NLP: https://medium.com/neptune-ai/data-augmentation-in-nlp-best-practices-from-a-kaggle-master-289489cde2f9
- Named entity linking: https://medium.com/analytics-vidhya/entity-linking-a-primary-nlp-task-for-information-extraction-22f9d4b90aa8
- Information Extraction: https://wandb.ai/tulasi1729/information_extraction/reports/Information-Extraction-From-Documents---Vmlldzo0MDc3MDQ?galleryTag=NLP
- Named Entity Recognition:
NER with BERT https://towardsdatascience.com/unsupervised-ner-using-bert-2d7af5f90b8a - Knowledge Graph:
(i). https://web.stanford.edu/class/cs520/
(ii). https://kgtutorial.github.io/
(iii). https://www.kaggle.com/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk
(iv). https://towardsdatascience.com/the-building-a-large-scale-accurate-and-fresh-knowledge-graph-71ebd912210e - GNN Graph neural networks: https://arxiv.org/abs/2106.06090
https://distill.pub/2021/gnn-intro/ - spaCy: NLP pipeline https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/
- Transformers and BERT:
(i). https://heartbeat.fritz.ai/research-guide-for-transformers-3ff751493222
(ii). https://wandb.ai/wandb/gallery/reports/SimpleTransformers-Transformers-Made-Easy--VmlldzoyNDQzNTg?galleryTag=NLP
(iii).https://wandb.ai/carlolepelaars/transformer_deep_dive/reports/Transformer-Deep-Dive--VmlldzozODQ4NDQ
(iv). http://jalammar.github.io/illustrated-bert/
(v). http://nlp.seas.harvard.edu/2018/04/03/attention.html
(vi). Where BERT is not good: very good paper https://arxiv.org/pdf/1907.13528v1.pdf https://towardsdatascience.com/bert-is-not-good-at-7b1ca64818c5
(vii). https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
(viii). https://towardsdatascience.com/beyond-bert-6f51a8bc5ce1 - Training transformer:
(i). https://huggingface.co/blog/how-to-train
(ii). Finetuning BERT with fastai https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/
(iii). Training BERT in 76 mins: https://arxiv.org/abs/1904.00962 - RNN equivalent with BERT results trained on single GPU https://arxiv.org/pdf/1911.11423.pdf
- Word embeddings with BERT https://towardsdatascience.com/word-embedding-using-bert-in-python-dd5a86c00342
- Fast BERT inference on CPU https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26
- Inference: Fastformers 200x speedup https://medium.com/ai-in-plain-english/fastformers-233x-faster-transformers-inference-on-cpu-4c0b7a720e1
- Speed Transformer Inference Huggingface: https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333
- Sentence similarity with BERT: https://towardsdatascience.com/cutting-edge-semantic-search-and-sentence-similarity-53380328c655
- Text generation: This is a very good paper where text is generated with context. CTRL: A Conditional Transformer Language Model for Controllable Generation https://arxiv.org/pdf/1909.05858.pdf
https://huggingface.co/blog/how-to-generate
https://medium.com/@madottoandrea/language-model-as-few-shot-learner-for-task-oriented-dialogue-systems-db4765796744 - Models for text generation:
https://huggingface.co/EleutherAI/gpt-j-6B
https://huggingface.co/t5-base
https://huggingface.co/facebook/opt-66b
https://huggingface.co/bigscience/bloomz-3b
https://www.forefront.ai
https://nlpcloud.com
https://www.goose.ai
https://cohere.ai/generate - Remove duplication during text generation: https://arxiv.org/pdf/1908.04319.pdf https://arxiv.org/pdf/1904.09751.pdf
- Assessing The Factual Accuracy of Text Generation https://research.google/pubs/pub48082/
- Evaluation of text generation survey: https://arxiv.org/abs/2006.14799
- BLEURT: Learning Robust Metrics for Text Generation https://arxiv.org/abs/2004.04696 https://www.youtube.com/watch?v=rl4nUngiR2k
- GEM metric for NLG evaluation: https://arxiv.org/abs/2102.01672
- Allen NLP GENIE metrics: https://genie.apps.allenai.org/
- Hugging face NLG evaluation metrics : https://huggingface.co/metrics
- CheckList https://arxiv.org/pdf/2005.04118.pdf https://towardsdatascience.com/beyond-accuracy-behavioral-testing-of-nlp-models-with-checklist-48544db3fef1
- Chatbots: https://arxiv.org/abs/2001.09977 https://arxiv.org/abs/2004.13637
- Good reference for architecture of complex generation models (Text to video generation) http://research.baidu.com/Blog/index-view?id=134
- GPT: Beginers guide to retrain GPT https://www.youtube.com/watch?v=4iK-IuvatxI&feature=youtu.be
- GPT2: https://www.gwern.net/GPT-2 Good poetry generation example https://medium.com/@isamu.website Understanding GPT2 source code http://jalammar.github.io/illustrated-gpt2/
- Train GPT2 https://github.com/Xirider/finetune-gpt2xl
- GPT3: https://joeddav.github.io/blog/2020/05/29/ZSL.html
https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122
https://www.eleuther.ai/ - Smaller GPT3: PET https://arxiv.org/abs/2009.07118
- GPT3 J opensource https://studio.ai21.com/
- Faster and lighter transformers https://arxiv.org/abs/2103.14636
- Long memory language models: Transformer XL, Reformer, Longformer, Linformer, Big Bird, Pegasus
https://github.com/allenai/naacl2021-longdoc-tutorial - Comparative analysis of long memory transformers https://arxiv.org/abs/2011.04006
- NLP data augmentation survey https://amitness.com/2020/05/data-augmentation-for-nlp/
- Contextual word embeddings:
https://github.com/facebookresearch/wsd-biencoders
https://github.com/begab/sparsity_makes_sense
https://github.com/HSLCY/GlossBERT
https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
https://github.com/timoschick/bertram (very interesting, you can give multiple contexts and get word embedding) - Word sense disambiguation :
https://github.com/SapienzaNLP/ewiser https://github.com/getalp/disambiguate - Target Sense Verification (TSV) https://medium.com/analytics-vidhya/automatic-extraction-of-hypernym-relations-from-text-using-ml-4b04eb33097f
- Lesk:
https://medium.com/analytics-vidhya/comparative-word-sense-disambiguation-1c3f0f4be1fa
http://www.nltk.org/howto/wsd.html
https://medium.com/analytics-vidhya/automatic-extraction-of-hypernym-relations-from-text-using-ml-4b04eb33097f https://annabreit.medium.com/is-word-sense-disambiguation-outdated-ef05a139576 - WordNet: https://globalwordnet.github.io/gwadoc/
- Vision and NLP:
- VilBERT (Combining images and text) https://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks
- https://www.microsoft.com/en-us/research/blog/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture/
- Simple N-gram language model https://nlpforhackers.io/language-models/
- Abstractive summarization: https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
https://github.com/mathsyouth/awesome-text-summarization#abstractive-text-summarization
http://nlpprogress.com/english/summarization.html
https://github.com/allenai/scitldr (TLDR: Extreme Summarization of Scientific Documents) - Advanced key-word extraction: (can be used for fig component generation or mirrored claim identification) https://towardsdatascience.com/data-scientists-guide-to-summarization-fc0db952e363
- Tabular data: https://arxiv.org/pdf/2110.01889.pdf
- Datasets:
Squad https://rajpurkar.github.io/SQuAD-explorer/
Google PAWS (Paraphrase Adversaries from Word Scrambling)
Patent dataset https://www.ir-facility.org/prototypes/marec
Legal documents https://zenodo.org/record/4595826
Invoices https://www.cs.cmu.edu/~aharley/rvl-cdip/
NLP datasets https://opendatascience.com/24-open-nlp-datasets - An alternative ensembling approach called Stochastic weight averaging: https://towardsdatascience.com/stochastic-weight-averaging-a-new-way-to-get-state-of-the-art-results-in-deep-learning-c639ccf36a
- Python packages for PDF https://medium.com/@umerfarooq_26378/python-for-pdf-ef0fac2808b0?
PDF figures: https://github.com/allenai/pdffigures
https://github.com/Coleridge-Initiative/rclc
AWS PDF read https://towardsdatascience.com/guide-on-aws-textract-set-up-8cddb8a3b3a3
Shape detection https://www.pyimagesearch.com/2016/02/08/opencv-shape-detection/ - Create Visio programatically http://www.visguy.com/2006/09/13/create-visio-flowcharts-programmatically/
- Extract data from MS word using Python: https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122
- Outlier detection:
https://scikit-learn.org/stable/modules/outlier_detection.html
Mean absolute deviation https://stackoverflow.com/questions/22354094/pythonic-way-of-detecting-outliers-in-one-dimensional-observation-data
https://github.com/yzhao062/pyod
DBSCAN
HDBSCAN https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html - Text annotation tool: https://github.com/CogComp/talen
MLOps:
- https://app.experiencewelcome.com/events/vgujOo/stages/XKfD09
- https://stanford-cs329s.github.io/syllabus.html
- https://arxiv.org/ftp/arxiv/papers/2205/2205.02302.pdf
Data Labeling:
- NLP: Figure 8; Scale.AI, Heartex
- www.v7labs.com