Data transformations¶
It is very likely that the data you have is not in the format as required by the library. Hence, data transformations provide a way to convert data in raw form to standard tsv format required.
Transform functions¶
Transform functions are the functions which can be used for performing transformations.
Each function is defined to take raw data in certain format, perform the defined transformation steps and
create the respective tsv
file.
Sample transform functions¶
-
utils.tranform_functions.
snips_intent_ner_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the data present in snips_data/. Raw data is in BIO tagged format with the sentence intent specified at the end of each sentence. The transformation function converts the each raw data file into two separate tsv files, one for intent classification task and another for NER task. Following transformed files are written at wrtDir
- NER transformed tsv file.
- NER label map joblib file.
- intent transformed tsv file.
- intent label map joblib file.
For using this transform function, set
transform_func
: snips_intent_ner_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) – Dictionary of function specific parameters. Not required for this transformation function.
-
utils.tranform_functions.
snli_entailment_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the SNLI entailment data available at SNLI for sentence pair entailment task. Contradiction and neutral labels are mapped to 0 representing non-entailment scenario. Only entailment label is mapped to 1, representing an entailment scenario. Following transformed files are written at wrtDir
- Sentence pair transformed tsv file for entailment task
For using this transform function, set
transform_func
: snli_entailment_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) – Dictionary of function specific parameters. Not required for this transformation function.
-
utils.tranform_functions.
create_fragment_detection_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms data for fragment detection task (detecting whether a sentence is incomplete/fragment or not). It takes data in single sentence classification format and creates fragment samples from the sentences. In the transformed file, label 1 and 0 represent fragment and non-fragment sentence respectively. Following transformed files are written at wrtDir
- Fragment transformed tsv file containing fragment/non-fragment sentences and labels
For using this transform function, set
transform_func
: create_fragment_detection_tsv in transform file. :param dataDir: Path to the directory where the raw data files to be read are present.. :type dataDir:str
:param readFile: This is the file which is currently being read and transformed by the function. :type readFile:str
:param wrtDir: Path to the directory where to save the transformed tsv files. :type wrtDir:str
:param transParamDict: Dictionary requiring the following parameters as key-valuedata_frac
(defaults to 0.2) : Fraction of data to consider for making fragments.seq_len_right
: (defaults to 3) : Right window length for making n-grams.seq_len_left
(defaults to 2) : Left window length for making n-grams.sep
(defaults to ” “) : column separator for input file.query_col
(defaults to 2) : column number containing sentences. Counting starts from 0.
-
utils.tranform_functions.
msmarco_answerability_detection_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the MSMARCO triples data available at triples
The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage). Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)
Following transformed files are written at wrtDir
- Sentence pair transformed downsampled file.
- Sentence pair transformed train tsv file for answerability task
- Sentence pair transformed dev tsv file for answerability task
- Sentence pair transformed test tsv file for answerability task
For using this transform function, set
transform_func
: msmarco_answerability_detection_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) –Dictionary of function specific parameters. Not required for this transformation function.
data_frac
(defaults to 0.01) : Fraction of data to keep in downsampling as the original data size is too large.
-
utils.tranform_functions.
msmarco_query_type_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the MSMARCO QnA data available at MSMARCO_QnA for query-type detection task (given a query sentence, detect what type of answer is expected). Queries are divided into 5 query types - NUMERIC, LOCATION, ENTITY, DESCRIPTION, PERSON. The function transforms the json data to standard single sentence classification type tsv data. Following transformed files are written at wrtDir
- Query type transformed tsv data file.
- Query type label map joblib file.
For using this transform function, set
transform_func
: msmarco_query_type_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) –Dictionary requiring the following parameters as key-value
data_frac
(defaults to 0.05) : Fraction of data to consider for downsampling.
-
utils.tranform_functions.
bio_ner_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the BIO style data and transforms into the tsv format required for NER. Following transformed files are written at wrtDir,
- NER transformed tsv file.
- NER label map joblib file.
For using this transform function, set
transform_func
: bio_ner_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) –Dictionary requiring the following parameters as key-value
save_prefix
(defaults to ‘bio_ner’) : save file name prefix.col_sep
: (defaults to ” “) : separator for columnstag_col
(defaults to 1) : column number where label NER tag is present for each row. Counting starts from 0.sen_sep
(defaults to ” “) : end of sentence separator.
-
utils.tranform_functions.
coNLL_ner_pos_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the data present in coNLL_data/. Raw data is in BIO tagged format with the POS and NER tags separated by space. The transformation function converts the each raw data file into two separate tsv files, one for POS tagging task and another for NER task. Following transformed files are written at wrtDir
- NER transformed tsv file.
- NER label map joblib file.
- POS transformed tsv file.
- POS label map joblib file.
For using this transform function, set
transform_func
: snips_intent_ner_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) – Dictionary of function specific parameters. Not required for this transformation function.
-
utils.tranform_functions.
qqp_query_similarity_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the QQP (Quora Question Pairs) query similarity data available at QQP
If the second query is similar to first query in a query-pair, the pair is labeled -> 1 and if not, then labeled -> 0. Following transformed files are written at wrtDir
- Sentence pair transformed train tsv file for query similarity task
- Sentence pair transformed dev tsv file for query similarity task
- Sentence pair transformed test tsv file for query similarity task
For using this transform function, set
transform_func
: snli_entailment_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) –Dictionary of function specific parameters. Not required for this transformation function.
train_frac
(defaults to 0.8) : Fraction of data to consider for training. Remaining will be divided into dev and test.
-
utils.tranform_functions.
query_correctness_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ - Query correctness transformed file
For using this transform function, set
transform_func
: query_correctness_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) – Dictionary of function specific parameters. Not required for this transformation function.
-
utils.tranform_functions.
imdb_sentiment_data_to_tsv
(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]¶ This function transforms the IMDb moview review data available at IMDb after accepting the terms.
The data is having total 50k samples labeled as positive or negative. The reviews have some html tags which are cleaned by this function. Following transformed files are written at wrtDir
- IMDb train transformed tsv file for sentiment analysis task
- IMDb test transformed tsv file for sentiment analysis task
For using this transform function, set
transform_func
: imdb_sentiment_data_to_tsv in transform file.Parameters: - dataDir (
str
) – Path to the directory where the raw data files to be read are present.. - readFile (
str
) – This is the file which is currently being read and transformed by the function. - wrtDir (
str
) – Path to the directory where to save the transformed tsv files. - transParamDict (
dict
, defaults toNone
) –Dictionary of function specific parameters. Not required for this transformation function.
train_frac
(defaults to 0.05) : Fraction of data to consider for train/test split.
Your own transform function¶
In case, you need to convert some custom format data into the standard tsv format, you can do that by writing your own transform function. You must keep the following points in mind while writing your function
- The function must take the standard input arguments like sample transform functions
Any extra function specific parameter can be added to the
transParamDict
argument. - You should add the function in
utils/tranform_functions.py
file. - You should add a name map for the function in
utils/data_utils.py
file underTRANSFORM_FUNCS
map. This step is required for transform file to recognize your function. - You should be able to use your function in the transform file.
Transform File¶
You can easily use the sample transformation functions or your own transformation function,
by defining a YAML format transform_file
. Say you want to perform these transformations -
sample_transform1, sample_transform2, …, sample_transform5.
Following is an example for the transform file,
sample_transform1:
transform_func: snips_intent_ner_to_tsv
read_file_names:
- snips_train.txt
- snips_dev.txt
- snips_test.txt
read_dir: snips_data
save_dir: demo_transform
sample_transform2:
transform_func: snli_entailment_to_tsv
read_file_names:
- snli_train.jsonl
- snli_dev.jsonl
- snli_test.jsonl
read_dir : snli_data
save_dir: demo_transform
sample_transform3:
transform_func: bio_ner_to_tsv
transform_params:
save_prefix : sample
tag_col : 1
col_sep : " "
sen_sep : "\n"
read_file_names:
- coNLL_train.txt
- coNLL_testa.txt
- coNLL_testb.txt
read_dir: coNLL_data
save_dir: demo_transform
sample_transform4:
transform_func: fragment_detection_to_tsv
transform_params:
data_frac : 0.2
seq_len_right : 3
seq_len_left : 2
sep : "\t"
query_col : 2
read_file_names:
- int_snips_train.tsv
- int_snips_dev.tsv
- int_snips_test.tsv
read_dir: data
save_dir: demo_transform
sample_transform5:
transform_func: msmarco_query_type_to_tsv
transform_params:
data_frac : 0.2
read_file_names:
- train_v2.1.json
- dev_v2.1.json
- eval_v2.1_public.json
read_dir: msmarco_qna_data
save_dir: demo_transform
NOTE:- The transform names (sample_transform1, sample_transform2, …) are unique identifiers for the transform, hence the transform names must always be distinct.
Transform file parameters¶
Detailed description of the parameters available in the transform file.
transform_func
(required) : Name of the transform function to use.transform_params
(optional) : Dictionary of function specific parameters which will go intransParamDict
parameter of function.read_file_names
(required) : List of raw data files for transformations. The first file will be considered as train file and will be used to create label map file when required.read_dir
(required) : Directory containing the input files.save_dir
(required) : Directory to save the transformed tsv/label map files.
Running data transformations¶
Once you have made the transform file with all the transform operations, you can run data transformations with the following terminal command.
$ python data_transformations.py \
--transform_file 'transform_file.yml'