You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

372 lines
144 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "538a6c6a",
"metadata": {},
"source": [
"# ChatCalledQuest demo\n",
"\n",
"The ccq module is thought as Geografia's backend engine and is aimed to translate from written spoken-italian to written LIS-glossed texts (comprensione) and viceversa (produzione).\n",
"\n",
"Sign Language don't have an exact written counterpart since its simultaneous/non-linear gesture/facial expression structure, thus the adoption of gloss-level intermediate representation of the sign message (basically a simplified version of the spoken language) .\n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"id": "4db96029",
"metadata": {},
"source": [
"### LIS <--> iTA Scheme : from beloved Ilenia"
]
},
{
"attachments": {
"schema_lis_ilenia-2.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAqQAAAHLCAIAAACRbh3sAAAAA3NCSVQICAjb4U/gAAAgAElEQVR4nOydZ2Adx3mu35nZenrHQQdJsHdSFFUpUb2LKpYs27EdW47juFz7Jrl2yk2c3NwUx75xSWzLkotkyVbvXaIskRJVKIq9ACAJECA6Ds45OH3LzP0BgARAUJITWzaReX4BZ7/dKbs778w338wSIQQkEolEIpHMXOjvOgMSiUQikUh+u0ixl0gkEolkhiPFXiKRSCSSGY4Ue4lEIpFIZjhS7CUSiUQimeFIsZdIJBKJZIYjxV4ikUgkkhmOFHuJRCKRSGY4UuwlEolEIpnhSLGXSCQSiWSGI8VeIpFIJJIZjhR7iUQikUhmOFLsJRKJRCKZ4Uixl0gkEolkhiPFXiKRSCSSGY4Ue4lEIpFIZjhS7CUSiUQimeFIsZdIJBKJZIYjxV4ikUgkkhmOFHuJRCKRSGY4UuwlEolEIpnhSLGXSCQSiWSGI8VeIpFIJJIZjhR7iUQikUhmOFLsJRKJRCKZ4Uixl0gkEolkhiPFXiKRSCSSGY4Ue4lEIpFIZjhS7CUSiUQimeFIsZdIJBKJZIYjxV4ikUgkkhmOFHuJRCKRSGY4UuwlEolEIpnhSLGXSCQSiWSGI8VeIpFIJJIZjvK7zoBE8t+eQsvjP71/07ZdxdWf+583nV2dfm7Cf80J/QPKBeeCEBBCPqD0JBLJB4gc2Uskv1ucnXd8/eHcypu/8OHV0aBa2n7H3z2SWzH2n80/oExUXv3Wxy655W+f2t1jfUApSiSSDxI5spdIfre0vPZahzg97GtY+/FlRGm5bUsHXxP2NZzx8WVEUdhvMWF++IWX0kuXzE0mA9Bnn37ZhuraufGA+ltMUSKR/K6QYi+RfGC4pcGjHd1DZS3eMKs6bKoUAAaHUrbLCaGKqgKpoZTtuuP/AYAoD3d29Fu+RF3V+Cmj2JkjfXY8FvKYKgC4hf7O7pwer4kHPao9fORwT8lX05AMeVQKwC0NHT3SnSqpsfqmZMSjUeRe/sV/POL/02RtMhkAqs+65Y84URT2bl58tzTU3dk9VFSj9U3VYY/GAJHv6cjq8UjIZ/42OyYSieS/hhR7ieSD4pV/+fxdQ7E59f7Wp5+3LvyLv/3UhfGep35+z4utnT0j//C1dxacfX5t7o0XW4/0ZP/ha9sXXvYnX7xhDd3+na/fvts/Zxba3joy5yN/9fnVlY0/+edv7/afVVvp2LNDu+Gfr1XeeuTu18WS2X63MNiyo40vuujcBpIZ6DtyYM9R31Vf//5Xb1jd4N/8rf/x875gQ33g0LMvVs7707++sebg/T988K13yP/p2nrBp76ylrz96A//fd+C//0vX15dfva7X//eM9l5V336S1+5ca336Mt3//u3X3Av+sM/+4T3yb99uMdTVx9qf35j+Zwv/+WnL1sSfu07X/iLXzrX//s/3Xre4pic7pdIfl+RYi+RfFBE19z8pUVLmuKxVr7zs6+93nnVqjlLr/7j699+aNvhS/78a3945sI4EY3Dr+w6dNGff+1TZy1KavmNf/Nvz6tX/M0ff/iM2d13/OHnHnlk7fLPnbm46ofbnHm3/MXf1Le9NbR2daj4xkM71PWf+9J1K+srj//FLd9K1f3ZV/9ybXNo/+1/9NnHt+4bOH9hgz962k2fnzu/sSpxkOz53KY3Oq743CUXrLzjAXfxV7722QtW1DME25/86QGn4vDk2k988oqXX7vTTjQnPCEjHJztMVdec+3FZzTHM6fd+Cdz5jYmqw4pBz7/ypvtvactrJq14qx1eX1FMuKVSi+R/B4jxV4i+aBYtP5CUFHO57iAUygWXYczVVcVSpiqG4aha4DGKGWaYRiGxnBg+/ZuY27MqZSGhrw+v3uk81CxPIdRYgRjsWBo4YarBKV9jBJKFU03Pd5ofU2ID1iUqR6Pt7EmbrjFguO4wILz1gNuuZAXnDjFUsHlVFEoZYqmG7rKAIXR0Th8yozl11259qFvbXml48rF80Ldz+9PrP5QbWNSJcl15wO8XMhzDqdYKjqOQ5su/dL/vQhMVWVTIpH8PiPfUInkg6G0/6F//sYDrSTZ1KC1Hh4u1on3OqMwNDRi9+18+oG7dkdNOP6zLli2NBnQWgFCCCGgigJg4oj62MI5cfwfAZRbHv3Gv96/3403NuqHDqUKcQi8S+q04cpr1t39tS3P7b22ue/5rtpVZybqQrTc8sg3v3X/Xjva2GAcbhsqhAQAUEXTZSsikfzeI19TieQDIf+ru37wEtnwT39682lz+u9oefGp945nM0xTJcnVN3/mC+uXVDEAoIqm9mz6tZN++Z7bXnIv+ZuvfPSseYM/PfTyE+/12pPQ+Rsumfvcwy9svMvpa/7kZclaD8k//8vbX7LX/+X/+Ng5C1J3Htn8uAzbl0hOIeQ6e4nkA8Eq5AsO0XTTNAxNoe9n7xq2ZNH8UO+Bt/oLBWoYhjF64n8q6aJDNMMwDUNT2FjSjFJiWSXuuieeQfRVG65cJXZ8/750zdxEbbVOYBfzRftY/tlYPkTXS7f9wz/f+crBwfKvny+JRPKBwb7+9a//rvMgkfw3wPCMvPPEfY+/+Mbmlza1tu7bdjCNxPqVeuvjj9z32JaW4Vyee+rpwScfe/Tx1w4Mj+S5p7GpqnZxs69r4113PPyr119/9aXnXjiIWaT9pacfe+ado2UnULu4KRFIvf3gfQ88ve1oWa+eHxp49dknHtt0IMNDTU38wAtPPf701s5KuPnMlaGulx978sXXN23c3NK6d9vBYR47b9VidviVR194ZcfOlpLIvL1543O/2tlbdoJ1C5viQY9Kgw2i7ZlDNVd+9NJzFsQNAsMc2fnsg0+8sGXzS5sPtOzZdjDFo+tW1Xc+8W/fvXOXd91ZS5sSMkZPIvm9hQjxnjOHEonkvw63c71t+zsLnurG+qiHgjDNMBTh2JWKxamiKJqmwJ3wn8YYEXa5kOnr7OjJknBtY11VxKO4VqXiCKZomq4yKrhVKVccwVRNVym3rYrNiaJqGoNrWxVHMFXXVRT7Dx7ozBtVDfUxDwNlmmGoIte970BX2Vc3tzHhVYVlORi/KiGAKHXtP4LqmkQooBMA3M73HTrQmdOrGuqjHkZGL+IOte0d1OsaamJ+TWq9RPJ7ixR7ieQDQ3CXC0Lp+3LiHz9LcM4Ffu3z3jvp98iP4HwsFPBd7AXngkyykkgkv39IsZdIJBKJZIYjA/QkEolEIpnhSLGXSCQSiWSGI8VeIpFIJJIZjhR7iUQikUhmOKfCDnq9L/Xe+apVJjTeVPPJG5nXHD/AKzteLAxEveuWaYZKYLsDg47FBQAQFk4qHkYIUE5bw0XBhSAKoQ5cnNzAUCOKmy4IPjVqkQaTqs91B1Pu2PVBqKFEw0xj47uVVpyBYddyx04khOgBNeylbKw7JQopK1sWXAAgZlgLecjox0TdvNW6r3h0yBWGWr/A01ylqFO2VuN8eMAuORPzNB74zJjf4IWS4ByYtP8pDSZVL3dTKbfiCjAlEmemMjleuuT0Z1zbFcTQw4qTLXKXC4BQTYlGmKZgcnC1KPSV9rdWhvLQI9q8RWbST0fzXzqUffLlfFuXU+b6xX8SW5NUjCn3j2d6Rjp6yyMO9YTM+lpvzEuPfUfVLZc7O/I9aZtrWrLe3xhVtSnfWHXs3qO5rgGryInu0RJJX21UMSbaTDXw1kZVgxG7WMkUT9wwhnojmlfh+bRdsk+ITSUAZV7mlh24x45NMDIDus+gjADcGerOHemr5DkNxH1z6ky/dvJodFGCewg8BaGCzQKLg0x57xy4R+D2QrggHtBa0AToZBvRB6cTvAhooHGwOlATsMEzEM7UFIkP1ITIglvTZYiCMAgH026ZS4IgJggBXPCjcLohHNAEWCOoOY39Mdy9oDUg4WM5hvMOOIUyF9Q3XbUU4LaDpyAoSBCsHjQIQicbdIAPQVCQAFjDVAO3FUiAhsaeVTEMuwWogloHokGUIHIQo0+ABhoAUQFApMHLY/UAMV4J0xkQL4gX5CRbHfIj4Dpo
}
},
"cell_type": "markdown",
"id": "f456d24f",
"metadata": {},
"source": [
"![schema_lis_ilenia-2.png](attachment:schema_lis_ilenia-2.png)"
]
},
{
"cell_type": "markdown",
"id": "333fdcf3",
"metadata": {},
"source": [
"## Comprensione : gloss2spoken examples\n",
"\n",
"plain string as input"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6c3cbe9",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ['TOKENIZERS_PARALLELISM']='false'\n",
"import spacy\n",
"import spacy_transformers\n",
"import ccq, config\n",
"import importlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e73136b0",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"importlib.reload(ccq)\n",
"engine = ccq.CCQ()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbe3f699",
"metadata": {},
"outputs": [],
"source": [
"affirmative_spoken = 'Luca va in Spagna la prossima estate'\n",
"engine.translate(affirmative_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "487719bf",
"metadata": {},
"outputs": [],
"source": [
"affirmative_spoken2 = 'Stasera voglio bere una birra'\n",
"engine.translate(affirmative_spoken2, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72eaf720",
"metadata": {},
"outputs": [],
"source": [
"negative_spoken = 'Luca non va in Spagna la prossima estate'\n",
"engine.translate(negative_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "319a6382",
"metadata": {},
"outputs": [],
"source": [
"closed_interrogative_spoken = 'Luca va in Spagna la prossima estate?'\n",
"engine.translate(closed_interrogative_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffeafe91",
"metadata": {},
"outputs": [],
"source": [
"open_interrogative_spoken = 'Dove andrà Luca la prossima estate?'\n",
"engine.translate(open_interrogative_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "markdown",
"id": "8912bca9",
"metadata": {},
"source": [
"###### spoken2gloss fail test"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe87dca4",
"metadata": {},
"outputs": [],
"source": [
"fail_spoken = open_interrogative_spoken*3\n",
"engine.translate(fail_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e979fc55",
"metadata": {},
"outputs": [],
"source": [
"fail_spoken = 'ciao Lucia, sei strana con questa punteggiatura.'\n",
"engine.translate(fail_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab24aec6",
"metadata": {},
"outputs": [],
"source": [
"fail_spoken = 'Luca, Antonio, Paolo e Marco vanno al mare'\n",
"engine.translate(fail_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f438fc8",
"metadata": {},
"outputs": [],
"source": [
"fail_spoken = 'Luca, Paolo e Marco vanno al mare'\n",
"engine.translate(fail_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "markdown",
"id": "fc0e9344",
"metadata": {},
"source": [
"## Produzione : gloss2spoken examples\n",
"\n",
"In here it is assumed as input a list containing space/punct separated word tokens and their relative possible attribute.\n",
"\n",
"The current possible (and required if the corresponding token is present) attributes are:\n",
"\n",
"- subject\n",
"- verb\n",
"- time"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3319d71f",
"metadata": {},
"outputs": [],
"source": [
"affirmative_gloss = [('prossimo','time'),('estate','time'),('luca','subject'),('spagna',''),('andare','verb')]\n",
"\n",
"engine.translate(affirmative_gloss, direction='gloss2spoken')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94554f78",
"metadata": {},
"outputs": [],
"source": [
"engine.translate(closed_interrogative_spoken, direction='spoken2gloss')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0914f889",
"metadata": {},
"outputs": [],
"source": [
"negative_gloss = [('prossimo','time'),('estate','time'),('luca','subject'),('spagna',''),('andare','verb'),('no','')]\n",
"\n",
"engine.translate(negative_gloss, direction='gloss2spoken')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1cf57a3",
"metadata": {},
"outputs": [],
"source": [
"open_interrogative_gloss = [('prossimo','time'),('estate','time'),('luca','subject'),('andare','verb'),('dove',''),('?','')]\n",
"\n",
"engine.translate(open_interrogative_gloss, direction='gloss2spoken')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65abdc79",
"metadata": {},
"outputs": [],
"source": [
"closed_interrogative_gloss = [('prossimo','time'),('estate','time'),('spagna',''),('andare','verb'),('luca','subject'),('?','')]\n",
"\n",
"engine.translate(closed_interrogative_gloss, direction='gloss2spoken')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "39f5b780",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "bdbdf7db",
"metadata": {},
"source": [
"## zero.shot learning"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dcb132c0",
"metadata": {},
"outputs": [],
"source": [
"from transformers import pipeline\n",
"classifier = pipeline(\"zero-shot-classification\",model=\"Jiva/xlm-roberta-large-it-mnli\", use_fast=True, multi_label=True) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2edf1ee5",
"metadata": {},
"outputs": [],
"source": [
"# we will classify the following wikipedia entry about Sardinia\"\n",
"sequence_to_classify = \"La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna.\"\n",
"# we can specify candidate labels in Italian:\n",
"candidate_labels = [\"geografia\", \"politica\", \"macchine\", \"cibo\", \"moda\"]\n",
"classifier(sequence_to_classify, candidate_labels)"
]
},
{
"cell_type": "markdown",
"id": "12a75a53",
"metadata": {},
"source": [
"# todos\n",
"\n",
"- extend & test failure management for gloss2spoken translation\n",
"\n",
"- abbellify gloss2spoken : add articles, conj, verb declination\n",
"\n",
"- the backend will be somehow linked to somekind of db storing image having with indexes glosses; The db will be most probably pretty limited in quantity and general in semantics terms (an image could have more than one index, and probably also an index could have more than one image) thus there will be a need for implementing also a word vector similarity engine (for synonims) and a zero-shot transformer (in order to exploit the contextual meaning of a sentence for the representantion of each word token)\n",
"\n",
"- cleaner code\n",
"\n",
"\n",
"# useful links\n",
"\n",
"- __nlp cheat sheet__\n",
"\n",
"https://github.com/janlukasschroeder/nlp-cheat-sheet-python\n",
"\n",
"- __spacy__\n",
" \n",
" https://spacy.io/models & https://spacy.io/models/it\n",
"\n",
"\n",
"- __transformers__ \n",
" - __italian zero-shot__ : https://huggingface.co/Jiva/xlm-roberta-large-it-mnli\n",
"\n",
" - __italian fill mask__ : https://huggingface.co/Musixmatch/umberto-wikipedia-uncased-v1\n",
"\n",
"- __stanford transformers--> stanza__\n",
"\n",
" https://github.com/stanfordnlp/stanza\n",
"\n",
"- __transformer + spacy : italian NER__\n",
" \n",
" https://huggingface.co/bullmount/it_nerIta_trf\n",
"\n",
" pip install https://huggingface.co/bullmount/it_nerIta_trf/resolve/main/it_nerIta_trf-any-py3-none-any.whl\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ef7c791",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "geografia",
"language": "python",
"name": "geografia"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}