Oasist Blog

Deliver posts regarding linguistics, engineering and life at my will.

Automation of Botpress Accuracy Inspection Vol.1 - CSV → JSON Converter -

f:id:oasist:20201024112839p:plain
Botpress

Contents

1. Background

I am working on inspection of Botpress response accuracy.
I had some manual labours and was not happy with them, so I made the following tasks automated.

  • Genrate JSON file for Q&A import: Mentioned in this article
  • Generate a matrix chart of confidence by via API with test data: Mentioned in the next article

Please refer to source codes and README in my GitHub repository for how to build Botpress in your local environment(→ 5. Source Code).
I will explain the way we generate JSON file for Q&A import in Botpress.

2. Deliverables

It is require to convert CSV of Q&A learning data to JSON and export it as a file Botpress can import.
Here is the structure of JSON.

{
    "qnas": [
        {
            "id": "{Serial_Num}",
            "data": {
                "action": "text",
                "contexts": [
                    "{context}"
                ],
                "enabled": true,
                "answers": {
                    "ja": [
                        "{Answers}"
                    ]
                },
                "questions": {
                    "ja": [
                        "{Questions1}",
                        "{Questions2}",
                        "{Questions3}",
                        "{Questions4}",
                        ...
                    ]
                },
                "redirectFlow": "",
                "redirectNode": ""
            }
        }
    ]
}

At least 3 Questions are required per Q&A to gurantee renponse confidence.

3. Implementation

3-1. Read CSV

CSV to import has the following format.

  • Place headers Serial_Nums, Questions and Answers
  • Remove HTML tags and quotes to avoid an error raised by CSV library of Python
  • Duplicate Serial_Nums and Answers(Serial_Nums & Answers: Questions = 1:N ).
  • For the details, check the Sample

This CSV will be read line by line as Array or List.

  • In Python, I designated keys with an index like qna[0].
# Sample
with open(csv_path) as f:
    reader = csv.reader(f)
    next(reader)
    for qna in reader:
        print(qna[0])
        print(qna[1])
        print(qna[2])
  • In Ruby, I explicitly designated keys with headers option.
# Sample
CSV.foreach(csv_path, headers: true) do |qna|
  p qna['Serial_Nums']
  p qna['Questions']
  p qna['Answers']
end

3-2. Generate Q&A Array or List

First, implement the method or function to initialise Dict and Hash for Botpress format.

def gen_dict_template():
    return {
      "id": "",
      "data": {
        "action": "text",
        "contexts": [
          "hoge"
        ],
        "enabled": True,
        "answers": {
          "ja": ["hoge"]
        },
        "questions": {
          "ja": []
        },
        "redirectFlow": "",
        "redirectNode": ""
      }
    }
def gen_hash_template
  {
    id: '',
    data: {
      action: 'text',
      contexts: [
        'hoge'
      ],
      enabled: true,
      answers: {
        ja: []
      },
      questions: {
        ja: []
      },
      'redirectFlow': '',
      'redirectNode': ''
    }
  }
end

Second, prepare for an empty Array or List to include Q&A Dict and Hash.
Third, assign Serial_Nums, Questions and Answers to corresponding keys in Q&A Dict and Hash.
Answers are duplicate in CSV and their type is Array or List, so the conditional statement must be as follows.

  • Previous Answers element is the same as the one loaded now
    • Questions element is added to Questions in Q&A Dict and Hash
  • Previous Answers element is NOT the same as the one loaded now
    • Serial_Nums is assigned to the one in Q&A Dict and Hash, when Questions and Answers are added to the corresponding ones in Q&A Dict and Hash

This process is handled line by line, removes duplicate Questions and Q&A Dict and Hash, add them to the empty Array or List and return it.

def gen_learning_data_list(csv_path):
    learning_data = []
    dict_template = gen_dict_template()
    with open(csv_path) as f:
        reader = csv.reader(f)
        next(reader)
        for learning_datum in reader:
            if dict_template["data"]["answers"]["ja"][-1] == learning_datum[2]:
                dict_template["data"]["questions"]["ja"].append(learning_datum[1])
            else:
                dict_template = gen_dict_template()
                dict_template["id"] = learning_datum[0]
                dict_template["data"]["questions"]["ja"].append(learning_datum[1])
                dict_template["data"]["answers"]["ja"].remove("hoge")
                dict_template["data"]["answers"]["ja"].append(learning_datum[2])
            dict_template["data"]["questions"]["ja"] = uniq_list(dict_template["data"]["questions"]["ja"])
            learning_data.append(dict_template)
    return uniq_list(learning_data)
def gen_learning_data_arr(csv_path)
  learning_data = []
  hash_template = gen_hash_template
  CSV.foreach(csv_path, headers: true) do |learning_datum|
    if hash_template[:data][:answers][:ja].last == learning_datum['Answers']
      hash_template[:data][:questions][:ja] << learning_datum['Questions']
    else
      hash_template = gen_hash_template
      hash_template[:id] = learning_datum['Serial_Nums']
      hash_template[:data][:questions][:ja] << learning_datum['Questions']
      hash_template[:data][:answers][:ja] << learning_datum['Answers']
    end
    hash_template[:data][:questions][:ja].uniq!
    learning_data << hash_template
  end
  learning_data.uniq
end

3-3. Convert Dict and Hash to JSON

Assign Q&A Array or List to the variable as the value of learning_data key in the Dict and Hash.

def csv_to_dict(self, csv_path):
    self.obj = { "qnas": gen_learning_data_list(csv_path) }
    return self.obj["qnas"]
def csv_to_hash
  @obj = { qnas: gen_learning_data_arr(csv_path) }
end

Convert Dict and Hash to JSON, assign a path and export a JSON file.

def dict_to_json(self, json_path):
    write_json(json_path, self.obj)
def hash_to_json
  write_json(json_path, obj)
end

4. Unit Test

  • TestJsonConverter#setUp generates an instance, execute JsonConverter#csv_to_dict to get a Q&A Hash or Dict and access its first element.
  • TestJsonConverter#test_number_of_learning_data checks if the length of a CSV learning data agrees with that of the Q&A Hash or Dict.
  • TestJsonConverter#test_qa_num checks if the QA number agrees with that of the first elements in the Q&A Hash or Dict.
  • TestJsonConverter#test_questions checks if the questions agree with those of the first elements in the Q&A Hash or Dict.
  • TestJsonConverter#test_answers checks if the answer agrees with that of the first elements in the Q&A Hash or Dict.
  • TestJsonConverter#test_json_presence checks if the JSON file is exported in the right path.

  • Python

import unittest
import json
import csv
import os.path
from os import path
import sys
sys.path.append("../lib")
sys.path.append("../lib/concerns")
from json_converter import JsonConverter
from file_handler import csv_to_list_with_index, uniq_list

class TestJsonConverter(unittest.TestCase):
    def setUp(self):
        self.json_converter = JsonConverter()
        self.csv_path       = "../csv/learning_data.csv"
        self.learning_data  = self.json_converter.csv_to_dict(self.csv_path)
        self.first_qna      = self.learning_data[0]

    def test_number_of_learning_data(self):
        self.assertEqual(len(uniq_list(csv_to_list_with_index(self.csv_path, 0))), len(self.learning_data))

    def test_qa_num(self):
        self.assertEqual("QA001", self.first_qna["id"])

    def test_questions(self):
        self.assertEqual("GitHubとは何ですか", self.first_qna["data"]["questions"]["ja"][0])
        self.assertEqual("GitHubとはどんなシステムか", self.first_qna["data"]["questions"]["ja"][1])
        self.assertEqual("GitHubって何", self.first_qna["data"]["questions"]["ja"][2])
        self.assertEqual("GitHubってなに", self.first_qna["data"]["questions"]["ja"][3])

    def test_answers(self):
        self.assertEqual("ソフトウェア開発のプラットフォームであり、ソースコードをホスティングする。コードのバージョン管理システムにはGitを使用します。", self.first_qna["data"]["answers"]["ja"][0])

    def test_json_presence(self):
        json_path = "../json/learning_data.json"
        self.json_converter.dict_to_json(json_path)
        self.assertEqual(True, path.exists(json_path))

if __name__ == "__main__":
    unittest.main()
require 'minitest/autorun'
require 'csv'
require_relative '../lib/json_converter'

class JsonConverterTest < Minitest::Test
  def setup
    csv_path        = '../csv/learning_data.csv'
    json_path       = '../json/learning_data.json'
    @json_converter = JsonConverter.new(csv_path, json_path)
    @json_converter.csv_to_hash
    @first_qna = @json_converter.obj.dig(:qnas).first
  end

  def test_number_of_learning_data
    assert_equal CSV.read(@json_converter.csv_path, headers: true)['Serial_Nums'].uniq!.size, @json_converter.obj.dig(:qnas).size
  end

  def test_qa_num
    assert_equal 'QA001', @first_qna.dig(:id)
  end

  def test_questions
    assert_equal 'GitHubとは何ですか', @first_qna.dig(:data).dig(:questions).dig(:ja)[0]
    assert_equal 'GitHubとはどんなシステムか', @first_qna.dig(:data).dig(:questions).dig(:ja)[1]
    assert_equal 'GitHubって何', @first_qna.dig(:data).dig(:questions).dig(:ja)[2]
    assert_equal 'GitHubってなに', @first_qna.dig(:data).dig(:questions).dig(:ja)[3]
  end

  def test_answers
    assert_equal 'ソフトウェア開発のプラットフォームであり、ソースコードをホスティングする。コードのバージョン管理システムにはGitを使用します。', @first_qna.dig(:data).dig(:answers).dig(:ja)[0]
  end

  def test_json_presence
    @json_converter.hash_to_json
    assert_equal true, File.exist?(@json_converter.json_path)
  end
end

5. Conclusion

Botress uses Ngram instead of morphological analysis for feature extraction and it need tuning, so a bunch of tools have to be Python.
That is why I converted Ruby scripts to Python ones.

I worked on Python coding for the first time in months, so I somewhat forgot how.
It does not provide as intuitive method or fuction calling as that of Ruby, and it was painstaking to import liburaries even for basic ones.
What was more challenging was Python raised exceptions to some method or fuction calling Ruby treats as nil.
For example, Ruby returns nil when point a non-existent index, whereas Python raises IndexError, which stops the procedure.
That is why I had to prepare for the different initial value of Questions in the format of Q&A Dict and Hash to avoid procedure halts.

I would not like you to take me wrong.
Python is so wonderful that it is rich in liburaries Ruby does not have, which I was surprised at when I made a NLP application.
As to deep leaning or Natural Language Processing, Python will be the only one choice.

As an assignment, I must write Python more and more and I will.

6. Source Code