Installing hadoop/hive on ubuntu

If you are a hive noob like me, this post may be of use to you. I just wanted to install hadoop/hive on my ubuntu (on a dell) box, so that i can run hive (hive -e “”) commands from eclipse, before I commit the python scripts.

Installing hadoop and hive seems to be pretty straight forward. Except if you are someone who always makes wrong choices (like me!). I edited the wrong config file and had to spend considerable time to figure this simple thing out.

Installing Hadoop:

http://dmitrypukhov.pro/install-hadoop-on-ubuntu/

Installing hive:

http://dmitrypukhov.pro/install-hive-on-ubuntu/

Following the above instructions, I was able to install hive, except the following step.

http://wenda.baba.io/questions/5129058/missing-hive-execution-jar-usr-local-hadoop-hive-lib-hive-exec-jar.html

As mentioned in the above post, copying the lib directory from hive-0.12.0.tar.gz to the $HIVE_HOME directory (/opt/hive in my case), solved the problem.

*****

Hive, by default stores the metadata in derby database. Its good enough apparently but strangely writes derby.log file and metastore_db, in whichever directory we start hive shell from. There might be ways to fix this, but i decided to get rid of derby and use mysql instead.

Configuring hive with mysql:

http://java.dzone.com/articles/how-configure-mysql-metastore

If you follow the instructions above, you should be fine. Just make sure you edit the correct hive-site.xml file.

On my box,

hduser@learningbox:~$ locate hive-site.xml

/etc/hive/conf.dist/hive-site.xml

/opt/hive/common/src/test/resources/hive-site.xml

/opt/hive/conf/hive-site.xml

/opt/hive/data/conf/hive-site.xml

/opt/hive/data/conf/tez/hive-site.xml

/opt/hive/hcatalog/conf/proto-hive-site.xml

/opt/hive/hcatalog/src/packages/templates/conf/hive-site.xml.template

As mentioed in the above blog post, editing /opt/hive/conf/hive-site.xml gets the stuff done. Except if you edited another file, like I did.

Other Notes:

JSON serde.

http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html

https://github.com/rcongiu/Hive-JSON-Serde#start-of-content

Creating a table with json serde and inserting turned out to be frustrating. The solution that worked was to create table, move data to hdfs and add the partition.

I also faced this issue in hive 0.13.

https://issues.apache.org/jira/browse/HIVE-8538

Papernotes: Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

Oops, I just forgot to save my notes as draft. Will come back to it later since this seems to be a very important paper. And something I found hard to grasp.

This is the notes for the following legendary paper by Dawid and Skene.

http://www.cs.mcgill.ca/~jeromew/comp766/samples/Output_aggregation.pdf

****

Someone has implemented Dawid Skene’s example problem (patients). With excellent comments.
https://github.com/dallascard/dawid_skene/blob/master/dawid_skene.py

Another implementation is available on pypi.
https://pypi.python.org/pypi/pyanno/2.0.2

And DS is so useful that someone tried to offer DS as a service! Though it doesn’t seem to be working now.
https://github.com/ipeirotis/Get-Another-Label/wiki

Create a free website or blog at WordPress.com.

Up ↑