csv2db – the little CSV command line loader you always wanted

It was over the last holiday season that I wanted to take some time and play around a bit with data. The outcome was rather different than I initially expected. Rather than having worked with the data, I spent a lot of time trying to get some CSV data loaded into a database. I knew that there are great tools such as SQL Developer out there that make loading CSV data trivial, but I wanted something that I could execute via the command line, not requiring a GUI. The simple reason was that I wanted to script the loading part, rather than having to execute some manual, potentially error prone steps whenever loading some data again. Too late I found out, actually just last week, that SQLcl already had what I was after, a load command that does exactly what I wanted:

SQL> help load
LOAD
-----

Loads a comma separated value (csv) file into a table.
The first row of the file must be a header row. The columns in the header row must match the columns defined on the table.

The columns must be delimited by a comma and may optionally be enclosed in double quotes.
Lines can be terminated with standard line terminators for windows, unix or mac.
File must be encoded UTF8.

The load is processed with 50 rows per batch.
If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches.
The load is terminated if more than 50 errors are found.

LOAD [schema.]table_name[@db_link] file_name

Unfortunately back then Google didn’t offer me this insight when looking for “load csv data into oracle” and so I was stuck with the methods I already knew and that I found as too cumbersome to deal with. After all, all I wanted was a simple tool that loaded data from CSV files into a table and match the column names with the header of the file:

cat people.csv

"first_name","last_name","city"
"Gerald","Venzl","San Francisco"
"Andreas","Pumberger","New York City"

SELECT * FROM PEOPLE

FIRST_NAME  LAST_NAME  CITY
----------  ---------  -------------
Gerald      Venzl      San Francisco
Andreas     Pumberger  New York City

I knew that what I was after wasn’t difficult at all and probably just a few lines of code, and so, after hours and hours of trying to load that data, I wrote a little Python program to do exactly what I needed. Writing the program took me just about two hours and around 300 lines of code. After that I was able to load the data in an easy and scriptable way into my database. Thinking of how much time I have spent looking for this and trying different means to no desired outcome, I thought that I might as well just help the world and provide this script to others as well. Of course, it not difficult to write such a little program but after all, if I can save somebody out there some time, why not? So I polished the code I have written, wrote some documentation around it and named the program csv2db. It is now available under https://github.com/csv2db/csv2db and can load data into an Oracle, MySQL, PostgreSQL and Db2 database.

In the meantime, I have added some nice little features to it, such as:

  • Load many CSV files inside a directory at once
  • Load CSV files based on wildcards and pathname patterns, i.e. ~/data/bike*2018*csv
  • Being able to specify the batch size
  • Generate a SQL CREATE TABLE syntax with all the header columns from one or more CSV files
  • Being able to load compressed CSV files in .zip and .gz format

The last feature is nice since most CSV files come in compressed form and uncompressing them first adds no benefit but just fills up the disk unnecessarily.

The CREATE TABLE feature came out of an immediate need by myself. I had many CSV files (50+) and not all of them had the same structure (because their structure changed over time). I really didn’t want to uncompress them one by one, copy/paste all the headers into a text file and then figure out what columns I needed in my table. I just wanted to get the data into the database and then cleanse and process it further with SQL. So what I was after was something that just gave me a create table statement with all the columns from the CSV files, their data type being just string, no constraints or anything else. With other words, just allow me to load the data into a staging table and take it from there. While this isn’t required anymore once the CSV column structure is known, it makes getting started with new, unknown CSV data just so easy.

Loading data from a CSV file into a database with csv2db is a simple one-line command:

$ ./csv2db load --file 201811-citibike-tripdata.csv.gz \
--table citibikes --dbtype oracle --host dbserver1 \
--user csv_data --password csv_data --dbname ORCLPDB1

Loading file 201811-citibike-tripdata.csv.gz
Done

This command will load all data from the file 201811-citibike-tripdata.csv.gz into the table CITIBIKES. The rest of the commands are database connections details. --dbtype tells csv2db which database driver to use, --host, --dbname--user, and --password are the connection details to the database and the database user. By default, the default port for each database will be used, but this can be overwritten with the --port parameter, if required. The data type conversion is left to the implicit data type conversion of the database. That means that all inputs are sent as strings and it’s up to the database to convert them into the respective data types of the individual columns.

If you get a new set of CSV files and want to create a staging table for them, you can do so again with a simple one-line command:

$ ./csv2db generate -f *-citibike-tripdata.csv.gz \
--table STAGING --column-type VARCHAR2(1000)
CREATE TABLE STAGING
(
END_STATION_LATITUDE VARCHAR2(1000),
GENDER VARCHAR2(1000),
START_STATION_LATITUDE VARCHAR2(1000),
END_STATION_NAME VARCHAR2(1000),
BIRTH_YEAR VARCHAR2(1000),
START_STATION_NAME VARCHAR2(1000),
STOPTIME VARCHAR2(1000),
END_STATION_ID VARCHAR2(1000),
STARTTIME VARCHAR2(1000),
START_STATION_LONGITUDE VARCHAR2(1000),
START_STATION_ID VARCHAR2(1000),
BIKEID VARCHAR2(1000),
USERTYPE VARCHAR2(1000),
TRIPDURATION VARCHAR2(1000),
END_STATION_LONGITUDE VARCHAR2(1000)
);

This command scans all files matching the pattern *-citibike-tripdata.csv.gz, extract their headers, i.e. the first line of each file, and produce a CREATE TABLE statement with all the columns in the CSV headers. Spaces in the header columns are automatically replaced by _. The --table parameter allows you to specify a table name, when omitted you will get a CREATE TABLE <TABLE NAME> instead and you can fill in the table name afterwards. The --column-type parameter allows you to specify the desired data type for all columns. Note, the goal for csv2db here is not to parse the entire CSV file and guess the data types for each column. Instead it just gives you the syntax to create a staging table that will hold your data. It is up to you to figure out the data types you want to use, or whether you want to load data into such a staging table and then do an INSERT AS SELECT * afterwards, etc. This command is just here to make it easy getting started with your CSV data.

You can find the entire set of parameters that csv2db provides in the ReadMe file as well as the User’s Guide. I hope that csv2db will help some folk out there getting their CSV data into their database easily. Bug reports, enhancement requests, ideas, comments, etc. are all welcome. Just submit them as issues on GitHub, preferably.

And, of course, please feel free to spread the word!

Database Requirements for Modern Development

A while ago I wrote a paper talking about what key capabilities databases should provide for today’s software development world. The title of the paper is “Database Requirements for Modern Development“. It got published originally on the Oracle Technology Network and I just republished it here on my website. You can find it under: https://geraldonit.com/articles/database-requirements-for-modern-development/