csv2db – the little CSV command line loader you always wanted

It was over the last holiday season that I wanted to take some time and play around a bit with data. The outcome was rather different than I initially expected. Rather than having worked with the data, I spent a lot of time trying to get some CSV data loaded into a database. I knew that there are great tools such as SQL Developer out there that make loading CSV data trivial, but I wanted something that I could execute via the command line, not requiring a GUI. The simple reason was that I wanted to script the loading part, rather than having to execute some manual, potentially error prone steps whenever loading some data again. Too late I found out, actually just last week, that SQLcl already had what I was after, a load command that does exactly what I wanted:

SQL> help load
LOAD
-----

Loads a comma separated value (csv) file into a table.
The first row of the file must be a header row. The columns in the header row must match the columns defined on the table.

The columns must be delimited by a comma and may optionally be enclosed in double quotes.
Lines can be terminated with standard line terminators for windows, unix or mac.
File must be encoded UTF8.

The load is processed with 50 rows per batch.
If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches.
The load is terminated if more than 50 errors are found.

LOAD [schema.]table_name[@db_link] file_name

Unfortunately back then Google didn’t offer me this insight when looking for “load csv data into oracle” and so I was stuck with the methods I already knew and that I found as too cumbersome to deal with. After all, all I wanted was a simple tool that loaded data from CSV files into a table and match the column names with the header of the file:

cat people.csv

"first_name","last_name","city"
"Gerald","Venzl","San Francisco"
"Andreas","Pumberger","New York City"

SELECT * FROM PEOPLE

FIRST_NAME  LAST_NAME  CITY
----------  ---------  -------------
Gerald      Venzl      San Francisco
Andreas     Pumberger  New York City

I knew that what I was after wasn’t difficult at all and probably just a few lines of code, and so, after hours and hours of trying to load that data, I wrote a little Python program to do exactly what I needed. Writing the program took me just about two hours and around 300 lines of code. After that I was able to load the data in an easy and scriptable way into my database. Thinking of how much time I have spent looking for this and trying different means to no desired outcome, I thought that I might as well just help the world and provide this script to others as well. Of course, it not difficult to write such a little program but after all, if I can save somebody out there some time, why not? So I polished the code I have written, wrote some documentation around it and named the program csv2db. It is now available under https://github.com/csv2db/csv2db and can load data into an Oracle, MySQL, PostgreSQL and Db2 database.

In the meantime, I have added some nice little features to it, such as:

  • Load many CSV files inside a directory at once
  • Load CSV files based on wildcards and pathname patterns, i.e. ~/data/bike*2018*csv
  • Being able to specify the batch size
  • Generate a SQL CREATE TABLE syntax with all the header columns from one or more CSV files
  • Being able to load compressed CSV files in .zip and .gz format

The last feature is nice since most CSV files come in compressed form and uncompressing them first adds no benefit but just fills up the disk unnecessarily.

The CREATE TABLE feature came out of an immediate need by myself. I had many CSV files (50+) and not all of them had the same structure (because their structure changed over time). I really didn’t want to uncompress them one by one, copy/paste all the headers into a text file and then figure out what columns I needed in my table. I just wanted to get the data into the database and then cleanse and process it further with SQL. So what I was after was something that just gave me a create table statement with all the columns from the CSV files, their data type being just string, no constraints or anything else. With other words, just allow me to load the data into a staging table and take it from there. While this isn’t required anymore once the CSV column structure is known, it makes getting started with new, unknown CSV data just so easy.

Loading data from a CSV file into a database with csv2db is a simple one-line command:

$ ./csv2db load --file 201811-citibike-tripdata.csv.gz \
--table citibikes --dbtype oracle --host dbserver1 \
--user csv_data --password csv_data --dbname ORCLPDB1

Loading file 201811-citibike-tripdata.csv.gz
Done

This command will load all data from the file 201811-citibike-tripdata.csv.gz into the table CITIBIKES. The rest of the commands are database connections details. --dbtype tells csv2db which database driver to use, --host, --dbname--user, and --password are the connection details to the database and the database user. By default, the default port for each database will be used, but this can be overwritten with the --port parameter, if required. The data type conversion is left to the implicit data type conversion of the database. That means that all inputs are sent as strings and it’s up to the database to convert them into the respective data types of the individual columns.

If you get a new set of CSV files and want to create a staging table for them, you can do so again with a simple one-line command:

$ ./csv2db generate -f *-citibike-tripdata.csv.gz \
--table STAGING --column-type VARCHAR2(1000)
CREATE TABLE STAGING
(
END_STATION_LATITUDE VARCHAR2(1000),
GENDER VARCHAR2(1000),
START_STATION_LATITUDE VARCHAR2(1000),
END_STATION_NAME VARCHAR2(1000),
BIRTH_YEAR VARCHAR2(1000),
START_STATION_NAME VARCHAR2(1000),
STOPTIME VARCHAR2(1000),
END_STATION_ID VARCHAR2(1000),
STARTTIME VARCHAR2(1000),
START_STATION_LONGITUDE VARCHAR2(1000),
START_STATION_ID VARCHAR2(1000),
BIKEID VARCHAR2(1000),
USERTYPE VARCHAR2(1000),
TRIPDURATION VARCHAR2(1000),
END_STATION_LONGITUDE VARCHAR2(1000)
);

This command scans all files matching the pattern *-citibike-tripdata.csv.gz, extract their headers, i.e. the first line of each file, and produce a CREATE TABLE statement with all the columns in the CSV headers. Spaces in the header columns are automatically replaced by _. The --table parameter allows you to specify a table name, when omitted you will get a CREATE TABLE <TABLE NAME> instead and you can fill in the table name afterwards. The --column-type parameter allows you to specify the desired data type for all columns. Note, the goal for csv2db here is not to parse the entire CSV file and guess the data types for each column. Instead it just gives you the syntax to create a staging table that will hold your data. It is up to you to figure out the data types you want to use, or whether you want to load data into such a staging table and then do an INSERT AS SELECT * afterwards, etc. This command is just here to make it easy getting started with your CSV data.

You can find the entire set of parameters that csv2db provides in the ReadMe file as well as the User’s Guide. I hope that csv2db will help some folk out there getting their CSV data into their database easily. Bug reports, enhancement requests, ideas, comments, etc. are all welcome. Just submit them as issues on GitHub, preferably.

And, of course, please feel free to spread the word!

How to download the latest GitHub repo release via command line

I just fiddled around a bit to find out how to download the latest GitHub release via the command line. Turns out that GitHub doesn’t provide a universal download URL to release binaries like it does for the release browser page itself. The latest release page can always be reached via https://github.com/ORGANIZATION/REPO/releases/latest, e.g. https://github.com/gvenzl/csv2db/releases/latest. Unfortunately that is not true for the binaries, which are available under https://github.com/ORGANIZATION/REPO/archive/RELEASE_TAG.zip and https://github.com/ORGANIZATION/REPO/archive/RELEASE_TAG.tar.gz but not under something generic like https://github.com/ORGANIZATION/REPO/archive/latest.zip and https://github.com/ORGANIZATION/REPO/archive/latest.tar.gz

The issue here is that GitHub only provides the release binaries under their actual release tag, e.g. v1.0.0, but that tag is entirely up to the user to define and changes as more releases are added, of course. It’s therefore hard to guess what the release tag would be and what to download. Nevertheless, thanks to the GitHub developer APIs it’s not that tricky to figure out the release tag for the latest release and instruct a utility like curl to download the binary. I’ve written a little one-liner for the UNIX command line to do exactly that. It took me a bit to get it right and so I thought it might come handy to you, rather than trying to reinvent the wheel. it It’s also available as a public Gist but I thought for visibility purposes it makes sense to blog it here, too.

LOCATION=$(curl -s https://api.github.com/repos/YOUR_ORGANIZTION/YOUR_REPO/releases/latest \
| grep "tag_name" \
| awk '{print "https://github.com/YOUR_ORGANIZATION/YOUR_REPO/archive/" substr($2, 2, length($2)-3) ".zip"}') \
; curl -L -o OUTPUT_FILE_NAME $LOCATION

For example:

LOCATION=$(curl -s https://api.github.com/repos/gvenzl/csv2db/releases/latest \
| grep "tag_name" \
| awk '{print "https://github.com/gvenzl/csv2db/archive/" substr($2, 2, length($2)-3) ".zip"}') \
; curl -L -o csv2db.zip $LOCATION

Here is how it goes:

LOCATION=$(...)

stores the output of all the commands in the brackets in the variable $LOCATION,

curl -s https://api.github.com/repos/gvenzl/csv2db/releases/latest

gets the latest release from your repository, in my case https://github.com/gvenzl/csv2db,

grep "tag_name"

grabs the tag name of the latest release (e.g. v1.0.0),

awk '{print "https://github.com/gvenzl/csv2db/archive/" substr($2, 2, length($2)-3) ".zip"}'

prints https://github.com/gvenzl/csv2db/archive/ + v1.0.0.zip
–>
https://github.com/gvenzl/csv2db/archive/v1.0.0.zip.

Now the $LOCATION environment variable is set to that string. From that point on you can do with that variable whatever you like, in case you have another use than downloading the binary. Or, just like below, you can complete the download process:

curl -L -o csv2db.zip $LOCATION

invokes cURL and downloads $LOCATION into csv2db.zip. The -L parameter is important so that cURL follows the URL, i.e. redirect.

Hope this helps.

How to install Python 3 on Oracle Linux

You can install Python 3 on your Oracle Linux 7 environment with three simple steps:

  1. sudo yum install -y yum-utils
  2. sudo yum-config-manager --enable *EPEL
  3. sudo yum install -y python36

As a first step, in case you don’t have it yet on your system, is to install the yum-utils package. This package includes the yum-config-manager which allows you to easily enable and disable yum repositories, amongst other tasks.

[gerald@localhost ~]$ sudo yum install -y yum-utils
Loaded plugins: langpacks, ulninfo
Resolving Dependencies
--> Running transaction check
---> Package yum-utils.noarch 0:1.1.31-46.0.1.el7_5 will be updated
---> Package yum-utils.noarch 0:1.1.31-50.0.1.el7 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

=====================================================================================================================================
Package Arch Version Repository Size
=====================================================================================================================================
Updating:
yum-utils noarch 1.1.31-50.0.1.el7 ol7_latest 121 k

Transaction Summary
=====================================================================================================================================
Upgrade 1 Package

Total download size: 121 k
Downloading packages:
No Presto metadata available for ol7_latest
yum-utils-1.1.31-50.0.1.el7.noarch.rpm | 121 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Updating : yum-utils-1.1.31-50.0.1.el7.noarch 1/2
Cleanup : yum-utils-1.1.31-46.0.1.el7_5.noarch 2/2
Verifying : yum-utils-1.1.31-50.0.1.el7.noarch 1/2
Verifying : yum-utils-1.1.31-46.0.1.el7_5.noarch 2/2

Updated:
yum-utils.noarch 0:1.1.31-50.0.1.el7

Complete!
[gerald@localhost ~]$

Python 3 ships as part of the EPEL (Extra Packages for Enterprise Linux) repository but that repository is not enabled by default. With yum-config-manager enabling that repository is easy:



[gerald@localhost ~]$ sudo yum-config-manager --enable *EPEL
Loaded plugins: langpacks
===================================================== repo: ol7_developer_EPEL ======================================================
[ol7_developer_EPEL]
async = True
bandwidth = 0
base_persistdir = /var/lib/yum/repos/x86_64/7Server
baseurl = https://yum.oracle.com/repo/OracleLinux/OL7/developer_EPEL/x86_64/
cache = 0
cachedir = /var/cache/yum/x86_64/7Server/ol7_developer_EPEL
check_config_file_age = True
compare_providers_priority = 80
cost = 1000
deltarpm_metadata_percentage = 100
deltarpm_percentage =
enabled = 1
enablegroups = True
exclude =
failovermethod = priority
ftp_disable_epsv = False
gpgcadir = /var/lib/yum/repos/x86_64/7Server/ol7_developer_EPEL/gpgcadir
gpgcakey =
gpgcheck = True
gpgdir = /var/lib/yum/repos/x86_64/7Server/ol7_developer_EPEL/gpgdir
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
hdrdir = /var/cache/yum/x86_64/7Server/ol7_developer_EPEL/headers
http_caching = all
includepkgs =
ip_resolve =
keepalive = True
keepcache = False
mddownloadpolicy = sqlite
mdpolicy = group:small
mediaid =
metadata_expire = 21600
metadata_expire_filter = read-only:present
metalink =
minrate = 0
mirrorlist =
mirrorlist_expire = 86400
name = Oracle Linux 7Server Development Packages (x86_64)
old_base_cache_dir =
password =
persistdir = /var/lib/yum/repos/x86_64/7Server/ol7_developer_EPEL
pkgdir = /var/cache/yum/x86_64/7Server/ol7_developer_EPEL/packages
proxy = False
proxy_dict =
proxy_password =
proxy_username =
repo_gpgcheck = False
retries = 10
skip_if_unavailable = False
ssl_check_cert_permissions = True
sslcacert =
sslclientcert =
sslclientkey =
sslverify = True
throttle = 0
timeout = 30.0
ui_id = ol7_developer_EPEL/x86_64
ui_repoid_vars = releasever,
basearch
username =

[gerald@localhost ~]$

Now you are ready to install Python 3, version 3.6 in this case:


[gerald@localhost ~]$ sudo yum install -y python36
Loaded plugins: langpacks, ulninfo
ol7_UEKR4 | 1.2 kB 00:00:00
ol7_developer_EPEL | 1.2 kB 00:00:00
ol7_latest | 1.4 kB 00:00:00
(1/2): ol7_developer_EPEL/x86_64/updateinfo | 2.4 kB 00:00:00
(2/2): ol7_developer_EPEL/x86_64/primary | 8.0 MB 00:00:05
ol7_developer_EPEL 24145/24145
Resolving Dependencies
--> Running transaction check
---> Package python36.x86_64 0:3.6.6-1.el7 will be installed
--> Processing Dependency: python36-libs(x86-64) = 3.6.6-1.el7 for package: python36-3.6.6-1.el7.x86_64
--> Processing Dependency: libpython3.6m.so.1.0()(64bit) for package: python36-3.6.6-1.el7.x86_64
--> Running transaction check
---> Package python36-libs.x86_64 0:3.6.6-1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=====================================================================================================================================
Package Arch Version Repository Size
=====================================================================================================================================
Installing:
python36 x86_64 3.6.6-1.el7 ol7_developer_EPEL 65 k
Installing for dependencies:
python36-libs x86_64 3.6.6-1.el7 ol7_developer_EPEL 8.6 M

Transaction Summary
=====================================================================================================================================
Install 1 Package (+1 Dependent package)

Total download size: 8.6 M
Installed size: 36 M
Downloading packages:
(1/2): python36-3.6.6-1.el7.x86_64.rpm | 65 kB 00:00:00
(2/2): python36-libs-3.6.6-1.el7.x86_64.rpm | 8.6 MB 00:00:08
-------------------------------------------------------------------------------------------------------------------------------------
Total 1.0 MB/s | 8.6 MB 00:00:08
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : python36-3.6.6-1.el7.x86_64 1/2
Installing : python36-libs-3.6.6-1.el7.x86_64 2/2
Verifying : python36-libs-3.6.6-1.el7.x86_64 1/2
Verifying : python36-3.6.6-1.el7.x86_64 2/2

Installed:
python36.x86_64 0:3.6.6-1.el7

Dependency Installed:
python36-libs.x86_64 0:3.6.6-1.el7

Complete!
[gerald@localhost ~]$

You can verify the installation via python36 --version:

[gerald@localhost ~]$ python36 --version
Python 3.6.6
[gerald@localhost ~]$

How to enable auto start/shutdown for Oracle Database 18c Express Edition on Linux

Oracle Database 18c Express Edition can be enabled for automatic startup and shutdown with the Linux operating system. This will not only allow you not to worry about starting and stopping the database but it will also ensure that the database is properly shutdown before the machine is powered off. Enabling auto startup/shutdown is done via two simple commands:

  1. systemctl daemon-reload
  2. systemctl enable oracle-xe-18c

These commands will have to be executed either via the root user or with root privileges:

[root@localhost ~]# systemctl daemon-reload
[root@localhost ~]# systemctl enable oracle-xe-18c
oracle-xe-18c.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig oracle-xe-18c on

For more information about starting and stopping Oracle Database XE, have a look at the documentation.

How to install Oracle Database 18c XE on Linux

The new version of the free Oracle Database edition, Oracle Database 18c Express Edition, just got released for Linux 64-bit. Getting started is really simple on Oracle Linux, basically a three step process of downloading the RPM file, installing it and then configuring the database. On other Red Hat  compatible Linux distributions you will also have to download the Oracle Database Preinstall RPM alongside the XE RPM file. Here is a quick guide on how to setup Oracle Database 18c XE.

tl;dr

Oracle Linux

  1. Download the RPM file from Oracle Technology Network
  2. Run “yum -y localinstall oracle-database*18c*
  3. Run “/etc/init.d/oracle-xe-18c configure

Other Red Hat compatible Linux distribution

  1. Download the RPM file from Oracle Technology Network
  2. Download the Oracle Database Preinstall RPM via “curl -o oracle-database-preinstall-18c-1.0-1.el7.x86_64.rpm https://yum.oracle.com/repo/OracleLinux/OL7/latest/x86_64/getPackage/oracle-database-preinstall-18c-1.0-1.el7.x86_64.rpm
  3. Run “yum -y localinstall oracle-database*18c*
  4. Run “/etc/init.d/oracle-xe-18c configure

Continue reading “How to install Oracle Database 18c XE on Linux”