Dirty Data Parser

Octaria developed a custom Python program for parsing dirty data. An excel file that contains dirty OCR data needs to be parsed.

This program allows you to specify columns to be parsed and then to specify the regex for parsing.

It includes automated test generation based on a json-like input format for both positive and negative automated tests.

 

Input File

The input file contains dirty data imported from OCR (optical character recognition)

Octaria - Dirty Data Parser

Parsing

Columns are defined as parse-able or static. For all of the parse-able columns, known regex is passed over the columns, and different groups are parsed out into different columns.

Here is an example of one regex rule:

KNOWN_REGEX = [
  #### Regex 1 ####
  # "Account <name> phone number <number>"
  # "Account <name> Phone number <number>"
  ("OCR_Notes", "^[ ]*[Aa]ccount *([a-zA-Z0-9]*) *[Pp]hone *[Nn]umber *(\d*) *$",
    "Account",
    "Phone_Number"),
]

Automated Test Case Generation

And here is an example of test cases for that one rule above. The json-like format used will auto generate positive or negative test cases for every entry.

So, the following code will generate 8 automated test cases that correspond with the first regex rule:

TEST_CASES = [
  # TESTS 1.
  {
    "pos" : [
      "account CompanyName1 phone number 1234565555",
      " account CompanyName1 phone number 1234565555 ",
      "Account CompanyName1 phone number 1234565555",
      " Account CompanyName1 phone number 1234565555"
    ],
    "neg" : [
      " Account CompanyName1",
      " account CompanyName1",
      "Account CompanyName1",
      "account CompanyName1"
    ]
  },
]

Output File

The output file ends up with the parsed entries in separate columns.

Octaria - Dirty Data Parser