Data transformations

It is very likely that the data you have is not in the format as required by the library. Hence, data transformations provide a way to convert data in raw form to standard tsv format required.

Transform functions

Transform functions are the functions which can be used for performing transformations. Each function is defined to take raw data in certain format, perform the defined transformation steps and create the respective tsv file.

Sample transform functions

utils.tranform_functions.snips_intent_ner_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the data present in snips_data/. Raw data is in BIO tagged format with the sentence intent specified at the end of each sentence. The transformation function converts the each raw data file into two separate tsv files, one for intent classification task and another for NER task. Following transformed files are written at wrtDir

  • NER transformed tsv file.
  • NER label map joblib file.
  • intent transformed tsv file.
  • intent label map joblib file.

For using this transform function, set transform_func : snips_intent_ner_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) – Dictionary of function specific parameters. Not required for this transformation function.
utils.tranform_functions.snli_entailment_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the SNLI entailment data available at SNLI for sentence pair entailment task. Contradiction and neutral labels are mapped to 0 representing non-entailment scenario. Only entailment label is mapped to 1, representing an entailment scenario. Following transformed files are written at wrtDir

  • Sentence pair transformed tsv file for entailment task

For using this transform function, set transform_func : snli_entailment_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) – Dictionary of function specific parameters. Not required for this transformation function.
utils.tranform_functions.create_fragment_detection_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms data for fragment detection task (detecting whether a sentence is incomplete/fragment or not). It takes data in single sentence classification format and creates fragment samples from the sentences. In the transformed file, label 1 and 0 represent fragment and non-fragment sentence respectively. Following transformed files are written at wrtDir

  • Fragment transformed tsv file containing fragment/non-fragment sentences and labels

For using this transform function, set transform_func : create_fragment_detection_tsv in transform file. :param dataDir: Path to the directory where the raw data files to be read are present.. :type dataDir: str :param readFile: This is the file which is currently being read and transformed by the function. :type readFile: str :param wrtDir: Path to the directory where to save the transformed tsv files. :type wrtDir: str :param transParamDict: Dictionary requiring the following parameters as key-value

  • data_frac (defaults to 0.2) : Fraction of data to consider for making fragments.
  • seq_len_right : (defaults to 3) : Right window length for making n-grams.
  • seq_len_left (defaults to 2) : Left window length for making n-grams.
  • sep (defaults to ” “) : column separator for input file.
  • query_col (defaults to 2) : column number containing sentences. Counting starts from 0.
utils.tranform_functions.msmarco_answerability_detection_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the MSMARCO triples data available at triples

The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage). Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)

Following transformed files are written at wrtDir

  • Sentence pair transformed downsampled file.
  • Sentence pair transformed train tsv file for answerability task
  • Sentence pair transformed dev tsv file for answerability task
  • Sentence pair transformed test tsv file for answerability task

For using this transform function, set transform_func : msmarco_answerability_detection_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) –

    Dictionary of function specific parameters. Not required for this transformation function.

    • data_frac (defaults to 0.01) : Fraction of data to keep in downsampling as the original data size is too large.
utils.tranform_functions.msmarco_query_type_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the MSMARCO QnA data available at MSMARCO_QnA for query-type detection task (given a query sentence, detect what type of answer is expected). Queries are divided into 5 query types - NUMERIC, LOCATION, ENTITY, DESCRIPTION, PERSON. The function transforms the json data to standard single sentence classification type tsv data. Following transformed files are written at wrtDir

  • Query type transformed tsv data file.
  • Query type label map joblib file.

For using this transform function, set transform_func : msmarco_query_type_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) –

    Dictionary requiring the following parameters as key-value

    • data_frac (defaults to 0.05) : Fraction of data to consider for downsampling.
utils.tranform_functions.bio_ner_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the BIO style data and transforms into the tsv format required for NER. Following transformed files are written at wrtDir,

  • NER transformed tsv file.
  • NER label map joblib file.

For using this transform function, set transform_func : bio_ner_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) –

    Dictionary requiring the following parameters as key-value

    • save_prefix (defaults to ‘bio_ner’) : save file name prefix.
    • col_sep : (defaults to ” “) : separator for columns
    • tag_col (defaults to 1) : column number where label NER tag is present for each row. Counting starts from 0.
    • sen_sep (defaults to ” “) : end of sentence separator.
utils.tranform_functions.coNLL_ner_pos_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the data present in coNLL_data/. Raw data is in BIO tagged format with the POS and NER tags separated by space. The transformation function converts the each raw data file into two separate tsv files, one for POS tagging task and another for NER task. Following transformed files are written at wrtDir

  • NER transformed tsv file.
  • NER label map joblib file.
  • POS transformed tsv file.
  • POS label map joblib file.

For using this transform function, set transform_func : snips_intent_ner_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) – Dictionary of function specific parameters. Not required for this transformation function.
utils.tranform_functions.qqp_query_similarity_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the QQP (Quora Question Pairs) query similarity data available at QQP

If the second query is similar to first query in a query-pair, the pair is labeled -> 1 and if not, then labeled -> 0. Following transformed files are written at wrtDir

  • Sentence pair transformed train tsv file for query similarity task
  • Sentence pair transformed dev tsv file for query similarity task
  • Sentence pair transformed test tsv file for query similarity task

For using this transform function, set transform_func : snli_entailment_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) –

    Dictionary of function specific parameters. Not required for this transformation function.

    • train_frac (defaults to 0.8) : Fraction of data to consider for training. Remaining will be divided into dev and test.
utils.tranform_functions.query_correctness_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]
  • Query correctness transformed file

For using this transform function, set transform_func : query_correctness_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) – Dictionary of function specific parameters. Not required for this transformation function.
utils.tranform_functions.imdb_sentiment_data_to_tsv(dataDir, readFile, wrtDir, transParamDict, isTrainFile=False)[source]

This function transforms the IMDb moview review data available at IMDb after accepting the terms.

The data is having total 50k samples labeled as positive or negative. The reviews have some html tags which are cleaned by this function. Following transformed files are written at wrtDir

  • IMDb train transformed tsv file for sentiment analysis task
  • IMDb test transformed tsv file for sentiment analysis task

For using this transform function, set transform_func : imdb_sentiment_data_to_tsv in transform file.

Parameters:
  • dataDir (str) – Path to the directory where the raw data files to be read are present..
  • readFile (str) – This is the file which is currently being read and transformed by the function.
  • wrtDir (str) – Path to the directory where to save the transformed tsv files.
  • transParamDict (dict, defaults to None) –

    Dictionary of function specific parameters. Not required for this transformation function.

    • train_frac (defaults to 0.05) : Fraction of data to consider for train/test split.

Your own transform function

In case, you need to convert some custom format data into the standard tsv format, you can do that by writing your own transform function. You must keep the following points in mind while writing your function

  • The function must take the standard input arguments like sample transform functions Any extra function specific parameter can be added to the transParamDict argument.
  • You should add the function in utils/tranform_functions.py file.
  • You should add a name map for the function in utils/data_utils.py file under TRANSFORM_FUNCS map. This step is required for transform file to recognize your function.
  • You should be able to use your function in the transform file.

Transform File

You can easily use the sample transformation functions or your own transformation function, by defining a YAML format transform_file. Say you want to perform these transformations - sample_transform1, sample_transform2, …, sample_transform5. Following is an example for the transform file,

sample_transform1:
  transform_func: snips_intent_ner_to_tsv
  read_file_names:
    - snips_train.txt
    - snips_dev.txt
    - snips_test.txt
  read_dir: snips_data
  save_dir: demo_transform


sample_transform2:
  transform_func: snli_entailment_to_tsv
  read_file_names:
    - snli_train.jsonl
    - snli_dev.jsonl
    - snli_test.jsonl
  read_dir : snli_data
  save_dir: demo_transform

sample_transform3:
  transform_func: bio_ner_to_tsv
  transform_params:
    save_prefix : sample
    tag_col : 1
    col_sep : " "
    sen_sep : "\n"
  read_file_names:
    - coNLL_train.txt
    - coNLL_testa.txt
    - coNLL_testb.txt

  read_dir: coNLL_data
  save_dir: demo_transform

sample_transform4:
  transform_func: fragment_detection_to_tsv
  transform_params:
    data_frac : 0.2
    seq_len_right : 3
    seq_len_left : 2
    sep : "\t"
    query_col : 2
  read_file_names:
    - int_snips_train.tsv
    - int_snips_dev.tsv
    - int_snips_test.tsv

  read_dir: data
  save_dir: demo_transform

sample_transform5:
  transform_func: msmarco_query_type_to_tsv
  transform_params:
    data_frac : 0.2
  read_file_names:
    - train_v2.1.json
    - dev_v2.1.json
    - eval_v2.1_public.json

  read_dir: msmarco_qna_data
  save_dir: demo_transform

NOTE:- The transform names (sample_transform1, sample_transform2, …) are unique identifiers for the transform, hence the transform names must always be distinct.

Transform file parameters

Detailed description of the parameters available in the transform file.

  • transform_func (required) : Name of the transform function to use.
  • transform_params (optional) : Dictionary of function specific parameters which will go in transParamDict parameter of function.
  • read_file_names (required) : List of raw data files for transformations. The first file will be considered as train file and will be used to create label map file when required.
  • read_dir (required) : Directory containing the input files.
  • save_dir (required) : Directory to save the transformed tsv/label map files.

Running data transformations

Once you have made the transform file with all the transform operations, you can run data transformations with the following terminal command.

$ python data_transformations.py \
      --transform_file 'transform_file.yml'