georesearch/ccq/ccq_explore.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "80e0cc42",
   "metadata": {},
   "source": [
    "# 0 : Cosa c'è in questo notebook\n",
    "Scaricamanto ed utilizzo modelli di Natural Language Processing da huggingface.com per fare un po di esperimenti ed esempi con dati testuali per il progetto __geografia__.\n",
    "\n",
    "In particolare si esplorano un po di metodi e strumenti per la comprensione e annotazione dei testi, sfruttando modelli già allenati su task che ci posssono interessare come ad esempio :\n",
    "- __Mask-Filling__ (con huggingface transformer): predice un token mancante in una frase\n",
    "- __Zero-Shot classification__ (con huggingface transformer): classifica una frase rispetto a delle classi variabili (utile per topic analysis, sentiment retrival e menate generiche)\n",
    "- __Pos-Tagging__ (con la libreria spacy) : analisi grammaticale e sintattica del testo\n",
    "\n",
    "#### 0.1 Transformer Models da hugging face\n",
    "I modelli pre-allenati sono stati sottoposti ad estensivi training con tantissimi dati, in genere macinati per giorni. i task non-supervisionati tipici utilizzati per l'allenamento sono in genere:\n",
    "- next sentence prediction ( prendi un paragrafo, lo splitti sulla base del carattere '.' ed usi la prima frase come dato e la seconda come label da predirre)\n",
    "- fill-mask: predirre la parola mancante di una frase ( nel training si prendono frasi complete e si maschera un termine a random per frase)\n",
    "\n",
    "Accanto a questi task usati per allenare la struttura della rete neurale (aka \"il grosso\" dei nodi profondi), i modelli di huggingface possono essere già \"fine-tunati\" rispetto ad ulteriori downstream task (tipo sentiment analysis o zero shot) e quindi utilizzabili direttamente. per alcune coppie di modelli-task invece è necessario un ulteriore sessione di fine-tuning con un dataset rilevante per il task d'interesse.\n",
    "Qua c'è una lista delle tipologie di task piu comuni:\n",
    "https://huggingface.co/tasks\n",
    " \n",
    "Infine un altra differenza sostanziale che ci interessa è se il modello sia:\n",
    "- monolingua\n",
    "- multilingua\n",
    "\n",
    "i modelli considerati per questo girovagare sono:\n",
    "- il modello __dbmdz/bert-base-italian-cased__ è piu leggero e solo in italiano, ma di task ready-to-go disponibili c'è solo fill-mask e pos-tag\n",
    "- il modello __facebook/bart-large-mnli__ è molto pesante, ma è multilingua ed ha implementato lo zero-shot-classification\n",
    "- il modello __bert-base-cased__ solo inglese ma con praticamente tutti i downstream task già pronti all'utilizzo\n",
    "\n",
    "#### 0.2 Spacy Library\n",
    "https://spacy.io/\n",
    "libreria python piu semplice ed efficiente rispetto ai modelli transformer di hugging face, ma meno versatile per quantità di task possibili. ottima per pulire testi e tokenization (preprocessing), molto buono il pos-tagging\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "756d5d05",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/agropunx/anaconda3/envs/geografia/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "2022-07-06 20:22:23.485088: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
      "2022-07-06 20:22:23.485116: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoModel, AutoTokenizer \n",
    "from transformers import pipeline\n",
    "multilingual_model_name = \"facebook/bart-large-mnli\" # 1.5 GB\n",
    "italian_model_name = \"dbmdz/bert-base-italian-cased\" # 422 MB\n",
    "alltask_model_name = \"bert-base-cased\" \n",
    "\n",
    "## Download model and configuration from huggingface.co and cache. \n",
    "#muli_tokenizer = AutoTokenizer.from_pretrained(multilingual_model_name)\n",
    "#muli_model = AutoModel.from_pretrained(multilingual_model_name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f1657a9",
   "metadata": {},
   "source": [
    "## 1 : Fill-Mask task example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2ace43c4",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at dbmdz/bert-base-italian-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']\n",
      "- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'score': 0.9144049286842346,\n",
       "  'token': 482,\n",
       "  'token_str': 'stato',\n",
       "  'sequence': 'Umberto Eco è stato un grande scrittore'},\n",
       " {'score': 0.025699790567159653,\n",
       "  'token': 5801,\n",
       "  'token_str': 'diventato',\n",
       "  'sequence': 'Umberto Eco è diventato un grande scrittore'},\n",
       " {'score': 0.022715440019965172,\n",
       "  'token': 409,\n",
       "  'token_str': 'anche',\n",
       "  'sequence': 'Umberto Eco è anche un grande scrittore'},\n",
       " {'score': 0.006274252198636532,\n",
       "  'token': 1402,\n",
       "  'token_str': 'oggi',\n",
       "  'sequence': 'Umberto Eco è oggi un grande scrittore'},\n",
       " {'score': 0.004773850552737713,\n",
       "  'token': 14743,\n",
       "  'token_str': 'divenuto',\n",
       "  'sequence': 'Umberto Eco è divenuto un grande scrittore'}]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fill = pipeline('fill-mask',model=italian_model_name)\n",
    "masked_sentence = 'Umberto Eco è [MASK] un grande scrittore'\n",
    "fill(masked_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "df12c632",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'score': 0.1162119209766388,\n",
       "  'token': 7607,\n",
       "  'token_str': 'troviamo',\n",
       "  'sequence': 'a che ora ci troviamo?'},\n",
       " {'score': 0.06763438880443573,\n",
       "  'token': 17567,\n",
       "  'token_str': 'aspettiamo',\n",
       "  'sequence': 'a che ora ci aspettiamo?'},\n",
       " {'score': 0.06530702114105225,\n",
       "  'token': 1303,\n",
       "  'token_str': 'sarà',\n",
       "  'sequence': 'a che ora ci sarà?'},\n",
       " {'score': 0.05703262239694595,\n",
       "  'token': 4238,\n",
       "  'token_str': 'vediamo',\n",
       "  'sequence': 'a che ora ci vediamo?'},\n",
       " {'score': 0.05302278324961662,\n",
       "  'token': 17387,\n",
       "  'token_str': 'vedete',\n",
       "  'sequence': 'a che ora ci vedete?'}]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masked_sentence = 'a che ora ci [MASK]?'\n",
    "fill(masked_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6d87ab9c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'score': 0.36587169766426086,\n",
       "  'token': 510,\n",
       "  'token_str': 'cosa',\n",
       "  'sequence': 'a che cosa ci troviamo?'},\n",
       " {'score': 0.046960778534412384,\n",
       "  'token': 739,\n",
       "  'token_str': 'modo',\n",
       "  'sequence': 'a che modo ci troviamo?'},\n",
       " {'score': 0.04375715181231499,\n",
       "  'token': 212,\n",
       "  'token_str': 'non',\n",
       "  'sequence': 'a che non ci troviamo?'},\n",
       " {'score': 0.023330960422754288,\n",
       "  'token': 1302,\n",
       "  'token_str': 'livello',\n",
       "  'sequence': 'a che livello ci troviamo?'},\n",
       " {'score': 0.02314317226409912,\n",
       "  'token': 1711,\n",
       "  'token_str': 'condizioni',\n",
       "  'sequence': 'a che condizioni ci troviamo?'}]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masked_sentence = 'a che [MASK] ci troviamo?'\n",
    "fill(masked_sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffa2e70f",
   "metadata": {},
   "source": [
    "## 2 : Zero-Shot classification task example\n",
    "\n",
    "uso modello multilingua ( unico disponibile su huggingface in grado di performare l'inferenza zero-shot in italiano)\n",
    " \n",
    "link generico in italiano sullo __zero-shot__ learning e inference: https://zephyrnet.com/it/Zero-shot-learning-puoi-classificare-un-oggetto-senza-vederlo-prima/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ba49a3f4",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at dbmdz/bert-base-italian-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']\n",
      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-cased and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n",
      "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.\n"
     ]
    }
   ],
   "source": [
    "zs = pipeline(\"zero-shot-classification\",model=italian_model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f11053eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "zs('che merda di giornata', candidate_labels=['positivo','negativo'],hypothesis_template='questa frase ha un tono {}',) #multiclass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c151ae8",
   "metadata": {},
   "outputs": [],
   "source": [
    "zs(sequences='dove vai stasera?',hypothesis_template='questa frase è una {}',candidate_labels=['affermazione','domanda'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87c76244",
   "metadata": {},
   "outputs": [],
   "source": [
    "zs(sequences='where are you going?',hypothesis_template='this phrase is a {}',candidate_labels=['question','affirmation'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a29d6a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "zs(sequences='Voglio uscire con voi stasera, ma devo pulire casa prima',candidate_labels=['dubbio','invito','carro','positivo','negativo'], multiclass=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "fcf8ff42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sequence': 'Sono felice di uscire con voi stasera',\n",
       " 'labels': ['relax',\n",
       "  'pace',\n",
       "  'indifferenza',\n",
       "  'interesse',\n",
       "  'fastidio',\n",
       "  'dubbio',\n",
       "  'noia',\n",
       "  'rabbia',\n",
       "  'sconforto',\n",
       "  'calma',\n",
       "  'tristezza',\n",
       "  'eccitazione',\n",
       "  'gioia',\n",
       "  'felicità'],\n",
       " 'scores': [0.07741541415452957,\n",
       "  0.07665539532899857,\n",
       "  0.0763017013669014,\n",
       "  0.07597608864307404,\n",
       "  0.07575443387031555,\n",
       "  0.07570748031139374,\n",
       "  0.0756787359714508,\n",
       "  0.07537700980901718,\n",
       "  0.07522716373205185,\n",
       "  0.07521089166402817,\n",
       "  0.07490663230419159,\n",
       "  0.07221430540084839,\n",
       "  0.07031130790710449,\n",
       "  0.02326352894306183]}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#https://www.google.it/search?q=thayer+mood+plane+italia&biw=1528&bih=788&tbm=isch&source=iu&ictx=1&vet=1&fir=idr_CWra3dTA3M%252C3vr2PPAMir9TdM%252C_%253BEhLHcfw0fkhWHM%252CEh6xJHBGzk4cMM%252C_%253BaP-KPSMboAJg2M%252CnleMstLur1huUM%252C_%253BxHGASYvcnwBR5M%252CrZKJnx3na11TtM%252C_%253Bl2JjpwGxrk9aOM%252CEh6xJHBGzk4cMM%252C_%253BGC1du1ZkYglw-M%252Co4brRqZQk5p8CM%252C_%253BO8E8kjpRcFlMWM%252CnleMstLur1huUM%252C_%253BoYYMLq4zR-JFeM%252Co4brRqZQk5p8CM%252C_%253B1CIJmqw2RNg0VM%252C5o_XDTGFr1iuXM%252C_%253BaDO14J7XISRjIM%252CEh6xJHBGzk4cMM%252C_&usg=AI4_-kSIgRNOu3F0-x0YYMbNCEIyLXfEzg&sa=X&ved=2ahUKEwiyqNDN_cD3AhVc7rsIHWi3CFYQ9QF6BAghEAE#imgrc=EhLHcfw0fkhWHM\n",
    "moods=['fastidio','rabbia','sconforto','tristezza','noia','pace','relax','calma','felicità','gioia','indifferenza','interesse','dubbio','eccitazione']\n",
    "zs(sequences='Sono felice di uscire con voi stasera',candidate_labels=moods)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8ec3e16f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sequence': 'ti ammazzo',\n",
       " 'labels': ['noia',\n",
       "  'eccitazione',\n",
       "  'pace',\n",
       "  'dubbio',\n",
       "  'gioia',\n",
       "  'rabbia',\n",
       "  'indifferenza',\n",
       "  'tristezza',\n",
       "  'sconforto',\n",
       "  'felicità',\n",
       "  'fastidio',\n",
       "  'calma',\n",
       "  'relax',\n",
       "  'interesse'],\n",
       " 'scores': [0.07284864038228989,\n",
       "  0.07268563657999039,\n",
       "  0.07168520241975784,\n",
       "  0.07163083553314209,\n",
       "  0.07157977670431137,\n",
       "  0.07146152853965759,\n",
       "  0.07145626842975616,\n",
       "  0.07143832743167877,\n",
       "  0.07120247930288315,\n",
       "  0.07110902667045593,\n",
       "  0.07106047123670578,\n",
       "  0.07105831056833267,\n",
       "  0.07071790099143982,\n",
       "  0.07006552070379257]}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentenza = 'ti ammazzo'\n",
    "zs(sequences=sentenza,candidate_labels=moods)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6a806191",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sequence': 'non capisco perchè te la prendi con me, ti ho già regalato tutte le mi capre, non posso darti anche il bue',\n",
       " 'labels': ['eccitazione',\n",
       "  'indifferenza',\n",
       "  'gioia',\n",
       "  'relax',\n",
       "  'felicità',\n",
       "  'dubbio',\n",
       "  'tristezza',\n",
       "  'fastidio',\n",
       "  'pace',\n",
       "  'rabbia',\n",
       "  'noia',\n",
       "  'interesse',\n",
       "  'sconforto',\n",
       "  'calma'],\n",
       " 'scores': [0.07354768365621567,\n",
       "  0.07300304621458054,\n",
       "  0.07294241338968277,\n",
       "  0.07219138741493225,\n",
       "  0.07172053307294846,\n",
       "  0.07144749164581299,\n",
       "  0.07143128663301468,\n",
       "  0.07104679197072983,\n",
       "  0.07093875855207443,\n",
       "  0.07085791230201721,\n",
       "  0.0706729143857956,\n",
       "  0.07063259184360504,\n",
       "  0.07041624188423157,\n",
       "  0.06915094703435898]}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentenza = 'non capisco perchè te la prendi con me, ti ho già regalato tutte le mi capre, non posso darti anche il bue'\n",
    "zs(sequences=sentenza,candidate_labels=moods, multiclass=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0d2d0f93",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sequence': 'le tue orecchie sono bruttissime, non uscirò mai con te',\n",
       " 'labels': ['logica', 'attualità', 'cotillons', 'estetica'],\n",
       " 'scores': [0.2509896755218506,\n",
       "  0.25027403235435486,\n",
       "  0.24949680268764496,\n",
       "  0.2492394745349884]}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zs(sequences='le tue orecchie sono bruttissime, non uscirò mai con te',candidate_labels=['estetica','logica','attualità','cotillons'], multiclass=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64c9b9af",
   "metadata": {},
   "source": [
    "## 3 : POS-TAG task example\n",
    "\n",
    "Positional tagging, task già noto alla maggior parte delle persone aventi la terza elementare come __analisi grammaticale__\n",
    "Nell'analisi grammaticale italiana ogni parola può appartenere ad una delle nove categorie lessicali dell'italiano, cinque variabili: articolo, aggettivo, sostantivo o nome, pronome, verbo, e quattro invariabili: avverbio, preposizione, congiunzione, interiezione o esclamazione. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "118e6275",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "9b74db59",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "testo originale --->  San Francisco prevede di bandire i robot di consegna porta a porta\n",
      "\n",
      "*** Pos-Tagging\n",
      "WORD        POS-TAG     SYNTACTIC DEP\n",
      "-------------------------------------\n",
      "San         PROPN       nsubj\n",
      "Francisco   PROPN       flat:name\n",
      "prevede     VERB        ROOT\n",
      "di          ADP         mark\n",
      "bandire     VERB        xcomp\n",
      "i           DET         det\n",
      "robot       NOUN        obj\n",
      "di          ADP         case\n",
      "consegna    NOUN        nmod\n",
      "porta       VERB        advcl\n",
      "a           ADP         case\n",
      "porta       NOUN        obl\n",
      "\n",
      "*** Analisi Entità specifiche nel testo:\n",
      "ENTITY NAME           LABEL\n",
      "-------------------------------------\n",
      "San Francisco         LOC\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "from spacy.lang.it.examples import sentences \n",
    "\n",
    "try:\n",
    "    nlp = spacy.load(\"it_core_news_lg\")\n",
    "except:\n",
    "    !python -m spacy download it_core_news_lg\n",
    "    nlp = spacy.load(\"it_core_news_lg\")\n",
    "doc = nlp(sentences[2])\n",
    "print('testo originale ---> ',doc.text)\n",
    "print('\\n*** Pos-Tagging')\n",
    "print('WORD', ' '*(10-len('word')), 'POS-TAG',' '*(10-len('POS-TAG')), 'SYNTACTIC DEP')\n",
    "print('-------------------------------------')\n",
    "for token in doc:\n",
    "    print(token.text,' '*(10-len(token.text)), token.pos_,' '*(10-len(token.pos_)), token.dep_ )\n",
    "\n",
    "print('\\n*** Analisi Entità specifiche nel testo:')    \n",
    "print('ENTITY NAME',' '*(20-len('ENTITY NAME')), 'LABEL')\n",
    "print('-------------------------------------')\n",
    "for ent in doc.ents:\n",
    "    print(ent.text ,' '*(20-len(ent.text)), ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "92712a0d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "testo originale --->  Londra è una grande città del Regno Unito.\n",
      "\n",
      "*** Pos-Tagging\n",
      "WORD        POS-TAG     SYNTACTIC DEP\n",
      "-------------------------------------\n",
      "Londra      PROPN       nsubj\n",
      "è           AUX         cop\n",
      "una         DET         det\n",
      "grande      ADJ         amod\n",
      "città       NOUN        ROOT\n",
      "del         ADP         case\n",
      "Regno       PROPN       nmod\n",
      "Unito       PROPN       flat:name\n",
      ".           PUNCT       punct\n",
      "\n",
      "*** Analisi Entità specifiche nel testo:\n",
      "ENTITY NAME           LABEL\n",
      "-------------------------------------\n",
      "Londra                LOC\n",
      "Regno Unito           LOC\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(sentences[3])\n",
    "print('testo originale ---> ',doc.text)\n",
    "print('\\n*** Pos-Tagging')\n",
    "print('WORD', ' '*(10-len('word')), 'POS-TAG',' '*(10-len('POS-TAG')), 'SYNTACTIC DEP')\n",
    "print('-------------------------------------')\n",
    "for token in doc:\n",
    "    print(token.text,' '*(10-len(token.text)), token.pos_,' '*(10-len(token.pos_)), token.dep_ )\n",
    "\n",
    "print('\\n*** Analisi Entità specifiche nel testo:')    \n",
    "print('ENTITY NAME',' '*(20-len('ENTITY NAME')), 'LABEL')\n",
    "print('-------------------------------------')\n",
    "for ent in doc.ents:\n",
    "    print(ent.text ,' '*(20-len(ent.text)), ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "7ab8299b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "testo originale --->  non capisco perchè te la prendi con Giacomo, ti ha già regalato tutte le sue capre, non può darti anche il bue\n",
      "\n",
      "*** Pos-Tagging\n",
      "WORD        POS-TAG     SYNTACTIC DEP\n",
      "-------------------------------------\n",
      "non         ADV         advmod\n",
      "capisco     VERB        ROOT\n",
      "perchè      SCONJ       mark\n",
      "te          PRON        expl\n",
      "la          PRON        obj\n",
      "prendi      VERB        ccomp\n",
      "con         ADP         case\n",
      "Giacomo     PROPN       obl\n",
      ",           PUNCT       punct\n",
      "ti          PRON        iobj\n",
      "ha          AUX         aux\n",
      "già         ADV         advmod\n",
      "regalato    VERB        conj\n",
      "tutte       DET         det:predet\n",
      "le          DET         det\n",
      "sue         DET         det:poss\n",
      "capre       NOUN        obj\n",
      ",           PUNCT       punct\n",
      "non         ADV         advmod\n",
      "può         AUX         aux\n",
      "darti       VERB        conj\n",
      "anche       ADV         advmod\n",
      "il          DET         det\n",
      "bue         NOUN        obj\n",
      "\n",
      "*** Analisi Entità specifiche nel testo:\n",
      "ENTITY NAME           LABEL\n",
      "-------------------------------------\n",
      "Giacomo               PER\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(sentenza.replace('me','Giacomo').replace('mi','sue').replace('ho','ha').replace('posso','può'))\n",
    "print('testo originale ---> ',doc.text)\n",
    "print('\\n*** Pos-Tagging')\n",
    "print('WORD', ' '*(10-len('word')), 'POS-TAG',' '*(10-len('POS-TAG')), 'SYNTACTIC DEP')\n",
    "print('-------------------------------------')\n",
    "for token in doc:\n",
    "    print(token.text,' '*(10-len(token.text)), token.pos_,' '*(10-len(token.pos_)), token.dep_ )\n",
    "\n",
    "print('\\n*** Analisi Entità specifiche nel testo:')    \n",
    "print('ENTITY NAME',' '*(20-len('ENTITY NAME')), 'LABEL')\n",
    "print('-------------------------------------')\n",
    "for ent in doc.ents:\n",
    "    print(ent.text ,' '*(20-len(ent.text)), ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f5fafff",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "geografia",
   "language": "python",
   "name": "geografia"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}