It was over the last holiday season that I wanted to take some time and play around a bit with data. The outcome was rather different than I initially expected. Rather than having worked with the data, I spent a lot of time trying to get some CSV data loaded into a database. I knew that there are great tools such as SQL Developer out there that make loading CSV data trivial, but I wanted something that I could execute via the command line, not requiring a GUI. The simple reason was that I wanted to script the loading part, rather than having to execute some manual, potentially error prone steps whenever loading some data again. Too late I found out, actually just last week, that SQLcl already had what I was after, a load
command that does exactly what I wanted:
SQL> help load LOAD ----- Loads a comma separated value (csv) file into a table. The first row of the file must be a header row. The columns in the header row must match the columns defined on the table. The columns must be delimited by a comma and may optionally be enclosed in double quotes. Lines can be terminated with standard line terminators for windows, unix or mac. File must be encoded UTF8. The load is processed with 50 rows per batch. If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches. The load is terminated if more than 50 errors are found. LOAD [schema.]table_name[@db_link] file_name
Unfortunately back then Google didn’t offer me this insight when looking for “load csv data into oracle” and so I was stuck with the methods I already knew and that I found as too cumbersome to deal with. After all, all I wanted was a simple tool that loaded data from CSV files into a table and match the column names with the header of the file:
cat people.csv "first_name","last_name","city" "Gerald","Venzl","San Francisco" "Andreas","Pumberger","New York City" SELECT * FROM PEOPLE FIRST_NAME LAST_NAME CITY ---------- --------- ------------- Gerald Venzl San Francisco Andreas Pumberger New York City
I knew that what I was after wasn’t difficult at all and probably just a few lines of code, and so, after hours and hours of trying to load that data, I wrote a little Python program to do exactly what I needed. Writing the program took me just about two hours and around 300 lines of code. After that I was able to load the data in an easy and scriptable way into my database. Thinking of how much time I have spent looking for this and trying different means to no desired outcome, I thought that I might as well just help the world and provide this script to others as well. Of course, it not difficult to write such a little program but after all, if I can save somebody out there some time, why not? So I polished the code I have written, wrote some documentation around it and named the program csv2db
. It is now available under https://github.com/csv2db/csv2db
and can load data into an Oracle, MySQL, PostgreSQL and Db2 database.
In the meantime, I have added some nice little features to it, such as:
- Load many CSV files inside a directory at once
- Load CSV files based on wildcards and pathname patterns, i.e.
~/data/bike*2018*csv
- Being able to specify the batch size
- Generate a SQL
CREATE TABLE
syntax with all the header columns from one or more CSV files - Being able to load compressed CSV files in .
zip
and.gz
format
The last feature is nice since most CSV files come in compressed form and uncompressing them first adds no benefit but just fills up the disk unnecessarily.
The CREATE TABLE
feature came out of an immediate need by myself. I had many CSV files (50+) and not all of them had the same structure (because their structure changed over time). I really didn’t want to uncompress them one by one, copy/paste all the headers into a text file and then figure out what columns I needed in my table. I just wanted to get the data into the database and then cleanse and process it further with SQL. So what I was after was something that just gave me a create table statement with all the columns from the CSV files, their data type being just string, no constraints or anything else. With other words, just allow me to load the data into a staging table and take it from there. While this isn’t required anymore once the CSV column structure is known, it makes getting started with new, unknown CSV data just so easy.
Loading data from a CSV file into a database with csv2db
is a simple one-line command:
$ ./csv2db load --file 201811-citibike-tripdata.csv.gz \ --table citibikes --dbtype oracle --host dbserver1 \ --user csv_data --password csv_data --dbname ORCLPDB1 Loading file 201811-citibike-tripdata.csv.gz Done
This command will load all data from the file 201811-citibike-tripdata.csv.gz
into the table CITIBIKES
. The rest of the commands are database connections details. --dbtype
tells csv2db
which database driver to use, --host
, --dbname
, --user
, and --password
are the connection details to the database and the database user. By default, the default port for each database will be used, but this can be overwritten with the --port
parameter, if required. The data type conversion is left to the implicit data type conversion of the database. That means that all inputs are sent as strings and it’s up to the database to convert them into the respective data types of the individual columns.
If you get a new set of CSV files and want to create a staging table for them, you can do so again with a simple one-line command:
$ ./csv2db generate -f *-citibike-tripdata.csv.gz \ --table STAGING --column-type VARCHAR2(1000) CREATE TABLE STAGING ( END_STATION_LATITUDE VARCHAR2(1000), GENDER VARCHAR2(1000), START_STATION_LATITUDE VARCHAR2(1000), END_STATION_NAME VARCHAR2(1000), BIRTH_YEAR VARCHAR2(1000), START_STATION_NAME VARCHAR2(1000), STOPTIME VARCHAR2(1000), END_STATION_ID VARCHAR2(1000), STARTTIME VARCHAR2(1000), START_STATION_LONGITUDE VARCHAR2(1000), START_STATION_ID VARCHAR2(1000), BIKEID VARCHAR2(1000), USERTYPE VARCHAR2(1000), TRIPDURATION VARCHAR2(1000), END_STATION_LONGITUDE VARCHAR2(1000) );
This command scans all files matching the pattern *-citibike-tripdata.csv.gz
, extract their headers, i.e. the first line of each file, and produce a CREATE TABLE
statement with all the columns in the CSV headers. Spaces in the header columns are automatically replaced by _
. The --table
parameter allows you to specify a table name, when omitted you will get a CREATE TABLE <TABLE NAME>
instead and you can fill in the table name afterwards. The --column-type
parameter allows you to specify the desired data type for all columns. Note, the goal for csv2db
here is not to parse the entire CSV file and guess the data types for each column. Instead it just gives you the syntax to create a staging table that will hold your data. It is up to you to figure out the data types you want to use, or whether you want to load data into such a staging table and then do an INSERT AS SELECT *
afterwards, etc. This command is just here to make it easy getting started with your CSV data.
You can find the entire set of parameters that csv2db
provides in the ReadMe file as well as the User’s Guide. I hope that csv2db
will help some folk out there getting their CSV data into their database easily. Bug reports, enhancement requests, ideas, comments, etc. are all welcome. Just submit them as issues on GitHub, preferably.
And, of course, please feel free to spread the word!