The Python CSV module can’t distinguish between a NULL value and an empty string.

Let’s say we have a CSV file with 3 columns a,b,c. First column contains a non empty string, second column contains a NULL and the third column contains an empty string:

a,b,c
some string,,""

Let’s write a simple Python script to read the CSV file:

from csv import DictReader
row_list = list(DictReader(open('csv_file', 'r')))
print(row_list)

This is how row_list looks like:

[{'a': 'some string', 'b': '', 'c': ''}]

Well, fuck… Notice that both b and c are the same now. As far as Python is concerned, these 2 columns are no different, even though they clearly are in the CSV file.

If we write row_list to another CSV file:

from csv import DictReader, DictWriter

row_list = list(DictReader(open('csv_file', 'r')))

writer = DictWriter(open('csv_file_processed', 'w'), fieldnames=['a', 'b', 'c', ])
writer.writeheader()
writer.writerows(row_list)

The result CSV file looks like this:

a,b,c
some string,,

Both b and c columns are now NULLs, even though c started out as an empty string.

This issue is currently being discussed here

My personal experience with this issue

I worked on a service that reads CSV files, processes the data, writes the processed data to other CSV files and copies these processed files into Postgres.

The Postgres model had NOT NULL defined for column c, it expected a (non)empty string. After the files were processed, Python tried to insert a NULL into a NOT NULL column.

Postgres doesn’t like that.

Copy exception for table: null value in column "c" violates not-null constraint

This specific Postgres issue is mentioned in the thread above here.

Solutions

The CSV parsing functionality in Python is written in CPython. Unfortunately I couldn’t just override a method to parse the CSV the way I wanted to. I had to come up with something else.

Update the Postgres model

Remove the NOT NULL option from the Postgres model.

While this solution works, it wasn’t feasible for me because we would have to change many models, and then change the transformations and queries for those models.

FORCE NOT NULL

A quick and dirty fix is to update the copy sql with the FORCE NOT NULL option for the needed columns. This will forcefully insert an empty string instead of a NULL for the specified columns.

COPY table (a, b, c) FROM stdin WITH CSV HEADER DELIMITER as ',' FREEZE FORCE NOT NULL c;

This approach comes with a few disadvantage:

  1. You need to manually specify which columns to FORCE NOT NULL on. It’s not dynamic.

  2. This is not a solution to the parsing problem, it’s a workaround. You’re still changing the input data that you receive, and then trying to change it back using hacky techniques.

Use an intermediary Postgres

Instead of parsing the data with Python, you can copy the input file in an intermediary Postgres table, process the data, then export the table as a CSV. As opposed to Python, Postgres does distinguish between NULL and empty string in CSV files. This way the integrity of the data is maintained.

Fix the Python CSV module

In my opinion, the Python CSV module should be able to distinguish between NULL and empty string, because they are different types. To me this would be the most optimal solution.