BigBash It! - Convert SQL queries into bash scripts


Converts your SQL select queries into an autonomous Bash One-liner that can be executed on almost any *nix device to make quick analyses or crunch GB of log files in csv format. Perfectly suited for BigData tasks on your local machine. They perform surprisingly well - thanks to the highly optimized unix tools such as sort and (m)awk - and are often even faster than DB queries for larger data sets. BigBash is open-source and you can find more information, for instance details on the supported SQL syntax, on the Github page.

1. Create a table

Every set of csv-files that you want to query using SELECT needs to be defined as a table as in a normal sql db. Create a table by using CREATE TABLE that reassembles your csv-file(s) structure. For example if your csv-file "persons.csv" has 5 columns (id, name, street, city, country) then use
CREATE TABLE persons (id, name, street, city, country);
You can add data types like INT or TEXT as well as the UNIQUE attribute which can speed up the resulting bash script and ensures correct sorting.

2. Map table to files

After creating a table use the special MAP command to define the file(s) that should be mapped to it. It is possible to select a range of files using wildcard operators as well as an arbitrary bash command where the output lines are used as table rows. For instance, to map a set of gzip'ed compressed csv files to our person table write MAP persons TO 'persons_*.csv.gz' DELIMITER ',' TYPE 'GZ' REMOVEHEADER;
REMOVEHEADER denotes that the first line in every file is a header and should be ignored.

3. The select query

You query this table using a standard SQL select command. To get persons living in Berlin sorted by name, enter SELECT name, street FROM persons WHERE city = 'Berlin' ORDER BY name;

4. Compile to a bash script

If you have put these three lines in the editor and hit "Create Bash script" you should get a resulting bash script like this one, which cou can directly run in your command line to get query result:
(trap "kill 0" SIGINT; export LC_ALL=C; find persons_*.csv.gz -print0
| xargs -0 -i sh -c "gzip -dc {} | tail -n +2"|tr $',' $';'|cut -d $';' -f2,3,4
|awk -F ';' '($3 == "Berlin") {print}'|sort -t$';'  -k 1,1|awk -F ';' '{print $1";"$2}')

More

Look at the examples below for more complex queries with joins and groups. You can find more documentation on the Github page.

SQL Input

--Create your tables using the standard create table command CREATE TABLE movies (id INT UNIQUE, title TEXT, genres TEXT); --Map every table to a file or even to an output pipe MAP movies TO 'movies.dat.gz' DELIMITER '::' TYPE 'GZ' REMOVEHEADER; --The select query that will be compiled to a bash one-liner. --You can use joins, groups but no subselects SELECT title FROM movies ORDER BY title LIMIT 10;

Important: Choose an output delimiter that is not contained in the columns of the csv files, otherwise this could lead to subtle (or not so subtle) errors in the output.

If disabled, aggregations (e.g. SUM over groups) will be done in memory. This could lead to out-of-memory problems if number of groups is very large.
If enabled, aggregations use an additional sorting step which makes the query slower.


Disclaimer: In no event we take any responsibility for any damages arising out of the use of the generated queries (including but not limited to loss of data and inaccuracies of the results sustained by you or 3rd parties)
Disclaimer 2: This is kind of a hack project and for sure full of bugs. Please open an issue here when you encountering any.