I truly believe that If you’re like me, a pitiful soul who who tried to learn the basics of hadoop hive online, but only found a horde of vague blog posts, then this article is for you.

For the benefit of the visual learner, this article highlights the essentials of data copying, table creation and partitioning on Hive with a few screenshots.

Make no mistake, I write on behalf of the learners who have no background in data analytics, and almost no coding experience. This is for those brave souls who navigated the treacherous waters of SQL for dummies and survived, only to face the level boss of HQL (Hadoop’s query language) in the lonely command terminal.

Hive can be intimidating. But it’s really just a fuzzy loveable open-sourced system that processes structured data in Hadoop. Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL type language. It has a large support for multiple file formats, is highly scalable and extremely fast if appropriate partitions can be made that serve your data modeling requirements.

To get started i’ll be using EMR on EC2 for the environment to load the dataset. If you’re unfamiliar with the setup of this environment, you’d need to read up on the documentation on the website to get started. Once you’ve set up the cluster, you need to create your unique key/value pair to logon on the EMR via a SSH and bash terminal if you’re using Mac. After which you’ll see the EMR screen as below.

emr welcome prompt

Now we’re ready to copy our dataset to the HDFS using distcp or distributed copy which is used for copying data between clusters.

It expands a list of files and directories to map tasks. Each M/R will copy a partition of the files specified in the source list. This is especially useful for larger datasets. I’ve used a specific version of the syntax as below.

hadoop distcp ‘s3n://sourcepath’ ‘/user/hive/

warehouse/yourdatabase.db’

It’s a good idea at this point to query the database and check if your dataset is properly copied. In this case I’ve copied two large csv files. So I’ve used the query haddop fs -ls /user/hive/warehouse/merrill.db, which tells me that the two files are indeed where I need them to be.

check if the files have been copied correctly

Now I need to create the structure of my data. I’ve created a table with csvverde and the schema of the desired source. If you’ve understood the original table structure of your source, then this part would be easy enough. In my case, the syntax creation structure would look like this.

CREATE EXTERNAL TABLE IF NOT EXISTS merill_A(

`event_time` int,

`event_type` string,

`product_id` string,

`category_id` string,

`category_code` string,

`brand` string,

`price` int,

`user_id` int,

`user_session` string)

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’

STORED AS TEXTFILE

LOCATION ‘/user/hive/warehouse/merrilldb.db/’

tblproperties (“skip.header.line.count”=”1");

Do notice that the last line of the syntax skips the header line, which is redundant in the new structured table A. This syntax creates a single table that merges two different csv files. And just to make things more complicated, I’ve created two tables. One with no partitions titled ‘merrill_A’, and then a partitioned version named ‘merrill_B’. Please Ignore my blatant narcissism and lack of originality in using my name as the table name. This is just an academic project for the purpose of illustration. In a real life scenario your nomenclature should always reflect a functionality specific to your work environment.

Now if you’re a rational person who converts oxygen to carbon dioxide, you might be wondering why I’m creating a second table with identical data? And the answer is to check the difference between a partitioned and non-partitioned query time of course! The second table is created much like the first, but with the specification of partition and cluster.

CREATE EXTERNAL TABLE IF NOT EXISTS merill_B(

`event_time` int,

`product_id` string,

`category_id` string,

`category_code` string,

`brand` string,

`price` int,

`user_id` int,

`user_session` string)

PARTITIONED BY (event_month string, event_type string)

CLUSTERED BY (brand) INTO 10 BUCKETS ;

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’

STORED AS TEXTFILE;

Most of the queries I’m working with here require categorisation of event_time in the months of October and November. A few queries also require frequent group by operations on the brand column. Since the cardinality of event_month and event_type is low, I have performed a two-level static partitioning on these columns as above. Then I’ve copied the data from unpartitioned table A to partitioned table B with the syntax as below

insert into table merill_B partition(event_type) select event_time,

product_id, category_id, category_code, brand, price, user_id,

user_session, event_type from merill_A;

After your tables are loaded, you can query the top 5 rows of each table with a basic select clause to check which table is faster. In most cases you’ll find that the partitioned table makes short shrift of the queried response. I’ve shown the snapshots of my sample query below. The last line shows the time taken for the process.

table a execution time
table b execution time

And that’s that! We’re good to go. If the partitioned table doesn’t perform optimally, we can always redo the structure and quite simply start over. Wouldn’t that be fun?

However, the important thing to understand is the frequency of queries being run, the original cardinality of your data, and your overall comfort with Hive.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store