BigBash It! - Convert SQL queries into bash scripts
Converts your SQL select queries into an autonomous Bash One-liner that can be executed on almost any
*nix device to make quick analyses or crunch GB of log files in csv format. Perfectly suited for BigData tasks on your
local machine. They perform surprisingly well - thanks to the
highly optimized unix tools such as sort and (m)awk - and are often even faster than DB queries for larger data sets.
BigBash is open-source and you can find more information, for instance details on the supported SQL syntax,
on the Github page.
1. Create a table
Every set of csv-files that you want to query using SELECT needs to be defined as a table
as in a normal sql db. Create a table by using CREATE TABLE that reassembles your csv-file(s) structure.
For example if your csv-file "persons.csv" has 5 columns (id, name, street, city, country) then use CREATE TABLE persons (id, name, street, city, country);
You can add data types like INT or TEXT as well as the UNIQUE attribute which can speed up the
resulting bash script and ensures correct sorting.
2. Map table to files
After creating a table use the special MAP command to define the file(s) that should be
mapped to it. It is possible to select a range of files using wildcard operators as well as an arbitrary
bash command where the output lines are used as table rows. For instance, to map a set of gzip'ed compressed csv
files to our person table write
MAP persons TO 'persons_*.csv.gz' DELIMITER ',' TYPE 'GZ' REMOVEHEADER;
REMOVEHEADER denotes that the first line in every file is a header and should be ignored.
3. The select query
You query this table using a standard SQL select command. To get persons living in
Berlin sorted by name, enter
SELECT name, street FROM persons WHERE city = 'Berlin' ORDER BY name;
4. Compile to a bash script
If you have put these three lines in the editor and hit "Create Bash script" you should get
a resulting bash script like this one, which cou can directly run in your command line to get
query result:
--This examples uses the cap data set from 'http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml'
CREATE TABLE yellowCab (VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,
passenger_count INT,trip_distance REAL,pickup_longitude,pickup_latitude,RatecodeID,
store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount REAL,
extra,mta_tax REAL,tip_amount REAL,tolls_amount REAL,improvement_surcharge,total_amount REAL);
--We map the table to the first million lines of the cap drives from 2015/12 that we download on-thy-fly.
--If you want to make more queries, download the data set first (~1Gb) and use the following map command
--MAP yellowCab TO 'yellow_tripdata_2015-12.csv' DELIMITER ',' REMOVEHEADER;
MAP yellowCab TO
'curl -s - N "https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-12.csv" | tail -n+2 | head -n 1000000'
DELIMITER ',' TYPE 'RAW';
--The select query that will be compiled to a bash one-liner: Get the average passenger count grouped by trip_distance
SELECT int(trip_distance + 0.5), SUM(passenger_count)/COUNT(*) FROM yellowCab
GROUP BY int(trip_distance + 0.5)
HAVING count(*) >= 100
ORDER BY int(trip_distance + 0.5);
--This example uses the MovieLens ml-100k set from 'http://grouplens.org/datasets/movielens/'
CREATE TABLE ratings (userId INT,itemId INT,rating INT,timestamp);
CREATE TABLE movies (movieId UNIQUE, movieTitle, releaseDate, videoReleaseDate,IMDbURL,
genre_unknown, genre_Action, genre_Adventure, genre_Animation, genre_Children,
genre_Comedy, genre_Crime, genre_Documentary, genre_Drama, genre_Fantasy,
genre_FilmNoir, genre_Horror, genre_Musical, genre_Mystery, genre_Romance,
genre_SciFi, genre_Thriller, genre_War, genre_Western);
--We map the tables to files directly downloaded from GroupLens
MAP ratings TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"'
DELIMITER '\t' TYPE 'RAW';
MAP movies TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.item"'
DELIMITER '|' TYPE 'RAW';
--Calculate the avg. rating and output the joined movie titles
SELECT movieTitle, SUM(rating)/COUNT(*) FROM ratings
HASH JOIN movies ON movies.movieId = ratings.itemId
GROUP BY itemId
HAVING count(*) >= 10
ORDER BY SUM(rating)/COUNT(*) desc
LIMIT 20;
--This example uses the MovieLens ml-100k set from 'http://grouplens.org/datasets/movielens/'
CREATE TABLE ratings (userId INT,itemId INT,rating INT,timestamp);
CREATE TABLE movies (movieId UNIQUE, movieTitle, releaseDate, videoReleaseDate,IMDbURL,
genre_unknown, genre_Action, genre_Adventure, genre_Animation, genre_Children,
genre_Comedy, genre_Crime, genre_Documentary, genre_Drama, genre_Fantasy,
genre_FilmNoir, genre_Horror, genre_Musical, genre_Mystery, genre_Romance,
genre_SciFi, genre_Thriller, genre_War, genre_Western);
--We map the tables to files directly downloaded from GroupLens
MAP ratings TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"'
DELIMITER '\t' TYPE 'RAW';
MAP movies TO 'curl -s -N "http://files.grouplens.org/datasets/movielens/ml-100k/u.item"'
DELIMITER '|' TYPE 'RAW';
--The select query that will be compiled to a bash one-liner:
--Self join the ratings to get the cooccurrence count
SELECT movies.movieTitle, movies2.movieTitle, COUNT(*) FROM ratings
HASH JOIN movies ON movies.movieId = ratings.itemId
JOIN ratings AS ratings2 ON ratings.userId = ratings2.userId
HASH JOIN movies AS movies2 ON movies2.movieId = ratings2.itemId
WHERE ratings.itemId > ratings2.itemId
GROUP BY ratings.itemId, ratings2.itemId
ORDER BY COUNT(*) desc
LIMIT 10;
--Wikipedia pages statistics, IMPORTANT: Set correct data type on the last two columns, otherwise output will be wrong
CREATE TABLE Wikipages (project_name, page, nr_of_requests INT, page_size INT);
--Map the table to the pagecount file from 2016-06-20 8am that is downloaded on the fly (~85MB)
MAP Wikipages TO
'curl -s -N https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-06/pagecounts-20160624-080000.gz | gzip -dc'
DELIMITER ' ' TYPE 'RAW';
--Select top-10 pages from English Wikipedia, remove special pages like useraccount
SELECT page, nr_of_requests FROM Wikipages WHERE project_name='en' and index(page, 'Special:') != 1
ORDER BY nr_of_requests DESC Limit 10;
--Create your tables using the standard create table command
CREATE TABLE movies (id INT UNIQUE, title TEXT, genres TEXT);
--Map every table to a file or even to an output pipe
MAP movies TO 'movies.dat.gz' DELIMITER '::' TYPE 'GZ' REMOVEHEADER;
--The select query that will be compiled to a bash one-liner.
--You can use joins, groups but no subselects
SELECT title FROM movies ORDER BY title LIMIT 10;
Disclaimer: In no event we take any responsibility for any damages arising out of the use of the generated
queries (including but not limited to loss of data and inaccuracies of the results sustained by you or 3rd parties) Disclaimer 2: This is kind of a hack project and for sure full of bugs. Please open an issue
here when you encountering any.