References for AI and NLP

Uday Pulleti
4 min readDec 24, 2022

--

I have been collecting some useful articles related to AI and NLP for the past few years. This could be more useful if I keep building this as a public collection.

Looking forward to your comments and suggestions to expand the collection.

AI system design

  1. Project structure https://towardsdatascience.com/practices-to-enhance-your-ml-code-part-1-e77d00db826d
  2. ML book: https://book.mlcompendium.com/data-science-tools
  3. AI & Data Foundation landscape https://landscape.lfai.foundation/
  4. Comprehensive System design series: https://medium.com/coders-mojo/complete-system-design-series-part-1-45bf9c8654bc
  5. System design of some big tech companies: https://www.theinsaneapp.com/2021/03/system-design-and-recommendation-algorithms.html
  6. https://medium.com/analytics-vidhya/mlops-end-to-end-machine-learning-pipeline-cicd-1a7907698a8e
  7. https://towardsdatascience.com/the-only-3-ml-tools-you-need-1aa750778d33
  8. Software engineering for artificial intelligence and machine learning software: A systematic literature review https://arxiv.org/ftp/arxiv/papers/2011/2011.03751.pdf

NLP:

  1. https://nlpforhackers.io/
  2. http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf
  3. https://www.analyticsvidhya.com/blog/2020/01/learning-path-nlp-2020/
  4. https://www.facebook.com/IIITH/videos/431575504254179/?v=431575504254179
  5. https://medium.com/huggingface/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1
  6. https://www.analyticsvidhya.com/datahack-summit-2019/workshop/natural-language-processing-for-beginners/
  7. https://medium.com/@gon.esbuyo/get-started-with-nlp-part-i-d67ca26cc828
  8. Pytorch from scratch: https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/?utm_source=blog&utm_medium=key-takeaways-iclr-2020
  9. https://www.nlphistory.com/
  10. Text Preprocessing:
    (i). https://www.freecodecamp.org/news/all-you-need-to-know-about-text-preprocessing-for-nlp-and-machine-learning-bc1c5765ff67/
    (ii). https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
    (iii). https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79
    (iv). https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
  11. Tokenization: https://blog.floydhub.com/tokenization-nlp/
  12. Data augmentation NLP: https://medium.com/neptune-ai/data-augmentation-in-nlp-best-practices-from-a-kaggle-master-289489cde2f9
  13. Named entity linking: https://medium.com/analytics-vidhya/entity-linking-a-primary-nlp-task-for-information-extraction-22f9d4b90aa8
  14. Information Extraction: https://wandb.ai/tulasi1729/information_extraction/reports/Information-Extraction-From-Documents---Vmlldzo0MDc3MDQ?galleryTag=NLP
  15. Named Entity Recognition:
    NER with BERT https://towardsdatascience.com/unsupervised-ner-using-bert-2d7af5f90b8a
  16. Knowledge Graph:
    (i). https://web.stanford.edu/class/cs520/
    (ii). https://kgtutorial.github.io/
    (iii). https://www.kaggle.com/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk
    (iv). https://towardsdatascience.com/the-building-a-large-scale-accurate-and-fresh-knowledge-graph-71ebd912210e
  17. GNN Graph neural networks: https://arxiv.org/abs/2106.06090
    https://distill.pub/2021/gnn-intro/
  18. spaCy: NLP pipeline https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/
  19. Transformers and BERT:
    (i). https://heartbeat.fritz.ai/research-guide-for-transformers-3ff751493222
    (ii). https://wandb.ai/wandb/gallery/reports/SimpleTransformers-Transformers-Made-Easy--VmlldzoyNDQzNTg?galleryTag=NLP
    (iii).https://wandb.ai/carlolepelaars/transformer_deep_dive/reports/Transformer-Deep-Dive--VmlldzozODQ4NDQ
    (iv). http://jalammar.github.io/illustrated-bert/
    (v). http://nlp.seas.harvard.edu/2018/04/03/attention.html
    (vi). Where BERT is not good: very good paper https://arxiv.org/pdf/1907.13528v1.pdf https://towardsdatascience.com/bert-is-not-good-at-7b1ca64818c5
    (vii). https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
    (viii). https://towardsdatascience.com/beyond-bert-6f51a8bc5ce1
  20. Training transformer:
    (i). https://huggingface.co/blog/how-to-train
    (ii). Finetuning BERT with fastai https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/
    (iii). Training BERT in 76 mins: https://arxiv.org/abs/1904.00962
  21. RNN equivalent with BERT results trained on single GPU https://arxiv.org/pdf/1911.11423.pdf
  22. Word embeddings with BERT https://towardsdatascience.com/word-embedding-using-bert-in-python-dd5a86c00342
  23. Fast BERT inference on CPU https://robloxtechblog.com/how-we-scaled-bert-to-serve-1-billion-daily-requests-on-cpus-d99be090db26
  24. Inference: Fastformers 200x speedup https://medium.com/ai-in-plain-english/fastformers-233x-faster-transformers-inference-on-cpu-4c0b7a720e1
  25. Speed Transformer Inference Huggingface: https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-2443578f4333
  26. Sentence similarity with BERT: https://towardsdatascience.com/cutting-edge-semantic-search-and-sentence-similarity-53380328c655
  27. Text generation: This is a very good paper where text is generated with context. CTRL: A Conditional Transformer Language Model for Controllable Generation https://arxiv.org/pdf/1909.05858.pdf
    https://huggingface.co/blog/how-to-generate
    https://medium.com/@madottoandrea/language-model-as-few-shot-learner-for-task-oriented-dialogue-systems-db4765796744
  28. Models for text generation:
    https://huggingface.co/EleutherAI/gpt-j-6B
    https://huggingface.co/t5-base
    https://huggingface.co/facebook/opt-66b
    https://huggingface.co/bigscience/bloomz-3b
    https://www.forefront.ai
    https://nlpcloud.com
    https://www.goose.ai
    https://cohere.ai/generate
  29. Remove duplication during text generation: https://arxiv.org/pdf/1908.04319.pdf https://arxiv.org/pdf/1904.09751.pdf
  30. Assessing The Factual Accuracy of Text Generation https://research.google/pubs/pub48082/
  31. Evaluation of text generation survey: https://arxiv.org/abs/2006.14799
  32. BLEURT: Learning Robust Metrics for Text Generation https://arxiv.org/abs/2004.04696 https://www.youtube.com/watch?v=rl4nUngiR2k
  33. GEM metric for NLG evaluation: https://arxiv.org/abs/2102.01672
  34. Allen NLP GENIE metrics: https://genie.apps.allenai.org/
  35. Hugging face NLG evaluation metrics : https://huggingface.co/metrics
  36. CheckList https://arxiv.org/pdf/2005.04118.pdf https://towardsdatascience.com/beyond-accuracy-behavioral-testing-of-nlp-models-with-checklist-48544db3fef1
  37. Chatbots: https://arxiv.org/abs/2001.09977 https://arxiv.org/abs/2004.13637
  38. Good reference for architecture of complex generation models (Text to video generation) http://research.baidu.com/Blog/index-view?id=134
  39. GPT: Beginers guide to retrain GPT https://www.youtube.com/watch?v=4iK-IuvatxI&feature=youtu.be
  40. GPT2: https://www.gwern.net/GPT-2 Good poetry generation example https://medium.com/@isamu.website Understanding GPT2 source code http://jalammar.github.io/illustrated-gpt2/
  41. Train GPT2 https://github.com/Xirider/finetune-gpt2xl
  42. GPT3: https://joeddav.github.io/blog/2020/05/29/ZSL.html
    https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122
    https://www.eleuther.ai/
  43. Smaller GPT3: PET https://arxiv.org/abs/2009.07118
  44. GPT3 J opensource https://studio.ai21.com/
  45. Faster and lighter transformers https://arxiv.org/abs/2103.14636
  46. Long memory language models: Transformer XL, Reformer, Longformer, Linformer, Big Bird, Pegasus
    https://github.com/allenai/naacl2021-longdoc-tutorial
  47. Comparative analysis of long memory transformers https://arxiv.org/abs/2011.04006
  48. NLP data augmentation survey https://amitness.com/2020/05/data-augmentation-for-nlp/
  49. Contextual word embeddings:
    https://github.com/facebookresearch/wsd-biencoders
    https://github.com/begab/sparsity_makes_sense
    https://github.com/HSLCY/GlossBERT
    https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
    https://github.com/timoschick/bertram (very interesting, you can give multiple contexts and get word embedding)
  50. Word sense disambiguation :
    https://github.com/SapienzaNLP/ewiser
    https://github.com/getalp/disambiguate
  51. Target Sense Verification (TSV) https://medium.com/analytics-vidhya/automatic-extraction-of-hypernym-relations-from-text-using-ml-4b04eb33097f
  52. Lesk:
    https://medium.com/analytics-vidhya/comparative-word-sense-disambiguation-1c3f0f4be1fa

    http://www.nltk.org/howto/wsd.html
    https://medium.com/analytics-vidhya/automatic-extraction-of-hypernym-relations-from-text-using-ml-4b04eb33097f https://annabreit.medium.com/is-word-sense-disambiguation-outdated-ef05a139576
  53. WordNet: https://globalwordnet.github.io/gwadoc/
  54. Vision and NLP:
  55. VilBERT (Combining images and text) https://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks
  56. https://www.microsoft.com/en-us/research/blog/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture/
  57. Simple N-gram language model https://nlpforhackers.io/language-models/
  58. Abstractive summarization: https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
    https://github.com/mathsyouth/awesome-text-summarization#abstractive-text-summarization
    http://nlpprogress.com/english/summarization.html
    https://github.com/allenai/scitldr (TLDR: Extreme Summarization of Scientific Documents)
  59. Advanced key-word extraction: (can be used for fig component generation or mirrored claim identification) https://towardsdatascience.com/data-scientists-guide-to-summarization-fc0db952e363
  60. Tabular data: https://arxiv.org/pdf/2110.01889.pdf
  61. Datasets:
    Squad https://rajpurkar.github.io/SQuAD-explorer/
    Google PAWS (Paraphrase Adversaries from Word Scrambling)
    Patent dataset https://www.ir-facility.org/prototypes/marec
    Legal documents https://zenodo.org/record/4595826
    Invoices https://www.cs.cmu.edu/~aharley/rvl-cdip/
    NLP datasets https://opendatascience.com/24-open-nlp-datasets
  62. An alternative ensembling approach called Stochastic weight averaging: https://towardsdatascience.com/stochastic-weight-averaging-a-new-way-to-get-state-of-the-art-results-in-deep-learning-c639ccf36a
  63. Python packages for PDF https://medium.com/@umerfarooq_26378/python-for-pdf-ef0fac2808b0?
    PDF figures: https://github.com/allenai/pdffigures
    https://github.com/Coleridge-Initiative/rclc
    AWS PDF read https://towardsdatascience.com/guide-on-aws-textract-set-up-8cddb8a3b3a3
    Shape detection https://www.pyimagesearch.com/2016/02/08/opencv-shape-detection/
  64. Create Visio programatically http://www.visguy.com/2006/09/13/create-visio-flowcharts-programmatically/
  65. Extract data from MS word using Python: https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122
  66. Outlier detection:
    https://scikit-learn.org/stable/modules/outlier_detection.html
    Mean absolute deviation https://stackoverflow.com/questions/22354094/pythonic-way-of-detecting-outliers-in-one-dimensional-observation-data
    https://github.com/yzhao062/pyod
    DBSCAN
    HDBSCAN https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  67. Text annotation tool: https://github.com/CogComp/talen

MLOps:

  1. https://app.experiencewelcome.com/events/vgujOo/stages/XKfD09
  2. https://stanford-cs329s.github.io/syllabus.html
  3. https://arxiv.org/ftp/arxiv/papers/2205/2205.02302.pdf

Data Labeling:

  1. NLP: Figure 8; Scale.AI, Heartex
  2. www.v7labs.com

--

--

No responses yet