Keep on Learning

IBM University Relations team has a number of programs and I’ve been quite happy to contribute. In 2019, I contributed to IBM Hackathons and Global Remote Mentoring, and was pleasantly surprised when they even invited me to an year end event and recognized my contribution.

IBM Hackathon 2019 at IIM Ahmedabad, India

So recently, when I learned about other initiatives like Academic Ambassadors and Keep On Learning, I wanted to explore them too. And pretty much immediately, they approached me to see if I can give guest lectures on Design and Analysis of Algorithms.

Given that I work on Graphs these days, I suggested a lecture on the Design and Analysis of Graph Algorithms instead. That was fine with them and in quick time, they had reached out to partner colleges.

We were delighted when RVCE, Bengaluru and NIE, Mysuru faculty invited me to give guest lectures to students, most of whom are learning from home, because of the ongoing COVID19 pandemic. This was quite in line with the spirit of Keep On Learning, even during these stressful times.

***

I was first introduced to Algorithms in the first year of my undergraduate program at the University of Madras, using How to Solve It by Computer by R G Dromey. Then we formally learned Design and Analysis of Algorithms from the legendary CLRS text.

When I went to grad school, at The University of Arizona, I excitedly took CS545 Design and Analysis of Algorithms course which was then offered by Prof. Stephen Koborou. And it was one of my biggest mistakes in grad school!!

I’m kidding. But the course was brutal and Prof. Koborou wanted us to learn more advanced graph algorithms. He made the first 6 chapters of CLRS, class reading and went to later chapters!!

Later on, even after I graduated, I’ve kept returning to Algorithm courses taught by Prof. Steven Skiena, Prof. Robert Sedgewick and Prof. Tom Roughgarden. More recently, for my work on Knowledge Graphs, I’ve been closely following the work of Prof. Jure Leskovec.

For my lectures I borrowed heavily from the slides of Prof. Roughgarden and Prof. Leskovec.

***

At IBM Research, I have been working on Knowledge Graphs for almost 4 years now. Though I’ve spent more time on Information Extraction in Automatic Knowledge Base Construction. In particular, I work on constructing person knowledge graphs that contain PI/SPI (personal data) of people. You can find our work at [1], [2], [3], [4], [5], [6].

So for the lectures, I decided to present on the topic of “Link Prediction in the Real World”, drawing few examples from our most recent work. Much of the lecture materials were from Algorithms Illuminated by Prof. Roughgarden and these slides on Graph Neural Networks by Prof. Leskovec.

Guest Lecture at RVCE, Bengaluru on May 26, 2020
Guest Lecture at NIE, Mysuru on May 29, 2020

Delivering these lectures was a really good experience! The last time I gave a lecture was in January 2018. Preparing for the lectures, reminded me of grad school! And I hope my lectures were useful to the students who want to Keep On Learning. You can find the slides from the lectures at the below location.

Papernotes: Ruder - An overview of Multi-Task Learning in Deep Neural Networks.

Following are my notes reading this paper by Sebastian Ruder.

By sharing representations between related tasks, we can enable our model to generalize better on our original task.

If you find yourself optimizing more than one loss function, you are effectively doing multi-task learning.

“MTL improves generalization by leveraging the domain specific information contained in the training signals of related tasks”.

An inductive bias is provided by auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task.

Hard or soft parameter sharing of hidden layers.

Hard parameter sharing

— sharing the hidden layers between all tasks.

— reduces the risk of overfitting.

Image Source: Sebastian Ruder

Soft parameter sharing

— each task has its own model with its own parameters.

 — The distance between the parameters of the model is then regularized to encourage the parameters to be similar.

Why does MTL work?

 — increases the sample size for training the model.

 — averages the noise patterns over multiple tasks.

 — focuses attention on features that matter (how?)

 — allow model to eavesdrop on another task, to better learn the features for this task.

 — representation bias.

 —

My updates:

One of our interns is trying hard parameter sharing for a NLP task, that I have been working on since the start of the year. We hope to submit our work to a conference later in the year.

Papernotes: Towards Re-defining Relation Understanding in Financial Domain

This paper is about the authors’ submission to the FEIII 2017 data challenge.

https://ir.nist.gov/feiii/2017-challenge.html

There are two objectives to the challenge:

(a) validate financial relationships and (b) identify interesting and relevant knowledge about financial entities.

The paper is here.

Approach is to classify text as Highly Relevant or not to the Financial domain. Using an ensemble of 3 classifiers with AdaBoost.

Two domain features are used:

1. Financial Vocabulary features.

2. Financial Relationship Pattern features.

Err, cannot read this paper further. I had implemented nearly the same idea last year, and failed to publish it. Damn.

Papernotes: Document Summarization for Answering Non-Factoid Queries

Paper url: http://marksanderson.org/publications/my_papers/TKDE2017a.pdf

  1. Good abandonment – where user’s leave search engine without reading any webpage, because the answer is provided in the SERP.
  2. non-factoid queries are more frequently asked on the web.
  3. Past work – to provide passage level answers to non-factoid queries.
  4. Summarization could be better – because answers might be in different sentences scattered in the underlying document.
  5. Answer biased summary – extracting a summary from each retrieved document that is expected to contain answers to a non-factoid query.
  6. Designed to hint at the whereabout of likely answers.
  7. Using Community Question Answering (CQA) content to guide the extraction of answer-biased summaries.
  8. Why bother if CQA is present? a) better summaries than CQA answers. b) even imperfect CQA answers can help find summaries, c) learning to rank based model to help extract summaries even where CQA answers are not available.
  9. Contributions:
    1. Novel user of CQA content in a summarization algorithm for locating answer-bearing sentences in the document
    2. 3 optimization based methods and a learning to rank based method for answering non-factoid queries.
    3. Analyse the effect of CQA quality on such methods.

The paper then goes on to propose 3 optimization methods using CPLEX. There is some discussion of the learning model, to use when CQA is not available.

***

Interestingly there is no mention of Knowledge Graphs being used. Though query expansion could be done using KGs.

***

I have been trying to use FAQs which are not too different from CQAs. So overall a very interesting paper.

Some thoughts on Software Engineering

Picture credit: https://twitter.com/noluckmurphy/status/645701882854309890

I wrote down some quick thoughts on Software Engineering, to make some points at work! Took only half hour to write, so may not be very well formed opinions.

1. Separation of people and code!

Its never a good idea to identify a task or module with a person. There may be someone who started working on a module, but once checked into github, code belongs to the team. Everyone should free feel to improve and run any module.

One easy way to make this transition, is to give up personalizing code. Instead of saying, “give this json to person A”, we can say, “module X should consume this json”.

This separation of people and code will mean, no one person will be burdened with one task. Team will not be blocked by one person’s availability and work load. Besides, best way to learn new things is to run and improve other’s code.

2. Diversity of skills

Irrespective of one’s career goals, diversity of skills is important. From ML, to annotations, to UI, to soft skills (presentations, talks), everyone should try everything. Being stuck to a module and doing the same thing again and again, doesn’t help.

  • If one wants to be a Researcher, being able to whip up a UI, might make one’s work get noticed. So don’t just learn ML, learn UI too.
  • If one wants to be a senior Engineer, he/she is expected to lead the whole team, not just one module.
  • If one wants to be a Manager, one should know enough about each module, to evaluate options.

3. Engineers are supposed to be lazy.

We should never manually do something that can be programmed. We should never code anything without designing. UI being a tremendous time sink should never be coded without first drawing it on paper or blackboard. Computer Science is very much about optimization. Optimizing our own time is paramount.

And using a unix machine (linux/mac) and mastering git/vim/sed/awk, saves so much time that often, the only difference between a newbie and experienced engineer, is a working knowledge of git and vim!

4. But beware of premature and unnecessary optimization.

Premature optimization delays projects. We may not need an API when a flash app is enough. And some kind of optimizations are never needed. For example, we should never need hot-swap in a research project. One time tasks need not be programmed.

5. Say No.

Saying “No!” is a very important responsibility. No one can do anything “in parallel”. Engineering and Research involve Thinking and focus. Its always better to work on one task at hand, take it to the next checkpoint and then take up other things. When Engineers say “No” to task n+1 because they already have n tasks in their priority queue, it helps teams to set right expectations. Work can be more fairly divided. Deadlines won’t be missed.

6. Feature creep kills projects.

Features should be decided early on in the project. Once a deadline is set and work has started, team should ruthlessly refuse to take up any more features. A well made working version of a project is always better than a composition of poorly developed features. Refer to the picture above.

Building govpedia.in

Lets begin!

Consuming tweet stream:

Need to consume public streams from twitter using their streaming API.

https://dev.twitter.com/streaming/overview

Twitter has provided an http client to listen to the streaming API.

https://github.com/twitter/hbc

The SampleStreamExample is good enough for the first cut. I just added a properties file instead of giving the credentials on the commandline.

https://github.com/twitter/hbc/tree/master/hbc-example/src/main/java/com/twitter/hbc/example

Now we have to parse the JSON streamed by the above code. Or it turns out we can index json docs directly in elasticsearch.

Indexing tweets in elasticsearch:

Install elasticsearch!

https://intercityup.com/blog/installing-elasticsearch-mac-os-x-10-9-mavericks-development.html

Preliminary tests in elasticsearch.

http://www.elasticsearchtutorial.com/elasticsearch-in-5-minutes.html

Java client for elasticsearch:

There seem to be a number of java clients available to talk to elastic search. Though the native client is highly recommended, on first look, I like the Jest client.

https://www.elastic.co/blog/found-java-clients-for-elasticsearch

Jest client: https://github.com/searchbox-io/Jest/tree/master/jest

Papernotes: Understanding tourist behavior using large-scale mobile sensing approach: A case study of mobile phone users in Japan

Notes from reading the below paper:

http://www.sciencedirect.com/science/article/pii/S1574119214001321

1. Analyzed GPS location traces of 130,861 mobile phone users in Japan collected for one year.

2. To reduce battery consumption, an accelerometer was used to detect periods of relative stasis during which power-consuming GPS acquisition functions can be suspended.

3. we selected the 130,861 subjects whose GPS locations were observed at least 350 days out of 365 days in 2012 (95%).

4. The first step was to identify stop, which was a collection of recorded GPS locations in close proximity.

If X u = { xt1, xt2, … , xti , …} denotes a set of GPS locations of user u where xti is the location at time ti , then our experimental results suggested that we group xti, xti + 1 , xti + 2 , … , xtm that are within 196m and tm − ti ≤ 14 min as a stop.

5. The second step was the spatial clustering of stops. The centroid of the cluster was considered as a significant place (e.g., home, workplace, other). DBSCAN (Density-based spatial clustering of applications with noise) had the best performance.

6. For validation, we developed a tool that allowed the tool user to label significant places after observing clusters of stops. With this tool, we annotated our data with the home and workplace locations of 400 subjects, and used this as ground truth in our validation.

7. The last step was to classify significant places as home or workplace. The hand-labeling ground truth was used for this task and we found that Random Forest had the best performance when compared with k-nearest neighbors and naïve Bayesian classifier (10-fold cross-validation was used) using the following 10 different features:

– Cluster ranking: top ranked clusters can be indicative of home and workplace locations.
– Portion of stops in cluster: to some extent, this suggests the importance of places because people tend to visit important
places, such as home and work more frequently than others.
– Hours of stops: it is the portion of the hours of the day, where clustered stops appeared. For example, if stops were
observed from 9 am to 4 pm (throughout the year), this feature would be 8 / 24.
– Days of stops: the number of days where clustered stops were observed.

– Inactive hours: for each subject, an inactive period was defined as the hours where a number of GPS locations are less
than the average for at least three consecutive hours. Inactive-hours feature is a portion of clustered stops that fall into
the inactive period.
– Day-hour stops: the portion of day hours (9 am–6 pm) that stops were observed.
– Night-hour stops: the portion of night hours (10 pm–6 am) that stops were observed.
– Max stop duration: maximum value of stop duration.
– Min stop duration: minimum value of stop duration.
– Avg. stop duration: average value of stop duration.

8. To further validate our home location estimation, we compared our results against the census data and observed that the estimated population density based on our home location estimation was comparable (R 2 = 0 . 966) with the city population density information obtained from the 2006 Japanese Census.

9. After obtaining a firm estimation of home and workplace locations, we were able to identify trips that were commuting (between home and workplace) as well as non-commuting. A commuting trip is defined as a trip where at least one stop appears at a workplace. A non-commuting trip is defined as a
trip where none of the stops appear at a workplace.

10. We defined a touristic stop as a stop that is either within 200 m from a touristic destination location or within the polygon area covered by a touristic destination. We used the touristic destinations information provided by the Ministry of Land, Infrastructure and Transport of Japan (MLIT).

11. We were interested in trip flows—the number of touristic trips made to and from different prefectures in Japan, time spent at destination, modes of transportation used by the tourists, and correlations between personal
mobility and touristic travel behavior.

12. We calculated the time spent at destination simply as the total amount starting from the time of arrival at the first destination until the departure time of the last destination of a trip.

13. We used a framework for identifying modes of transportation used by mobile phone users based on their GPS locations. Our framework basically
looked at the GPS traces along with detected stops, and defined a segment as
a series of GPS locations between adjacent stops. These segments were then classified into walking and non-walking segments based on the rate of change in velocity and train line proximity.

14. The non-walking segments are then classified into two modes of transportation, car and train, based on Random Forest classification.

15. Personal mobility and tourist travel behavior, and Similarity in travel behavior.

k-means clustering – useful links

Easy to understand/explain toy data set, though one may not want to code this in c# from scratch, as the author feels.

https://msdn.microsoft.com/en-us/magazine/jj891054.aspx

some ready to use functions (on random data)

https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

loading data from file

X = np.loadtxt(filename, delimiter=’t’,dtype=np.float64)

plotting using matplotlib

http://matplotlib.org/examples/shapes_and_collections/scatter_demo.html

choosing colors using itertools.

http://stackoverflow.com/a/12236808

another example on visualizing the clusters.

http://www.dummies.com/how-to/content/how-to-visualize-the-clusters-in-a-kmeans-unsuperv.html

Lists and Maps in Hive GenericUDF

I have not been posting to this blog regularly. So going to post some links on Hive genericUDF links currently open on my browser. 

Few months ago, I started off with GenericUDF without bothering with UDFs. Now that i write GenericUDFs with non-primitive types, I’m hoping this is helping.

http://www.baynote.com/2012/11/a-word-from-the-engineers/

https://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSortArray.java

I seem to encounter all kinds of exceptions. Things related to “kryo” being all greek to me. I’m catching them all with default Exception. Hope to get better at this.

Installing hadoop/hive on ubuntu

If you are a hive noob like me, this post may be of use to you. I just wanted to install hadoop/hive on my ubuntu (on a dell) box, so that i can run hive (hive -e “”) commands from eclipse, before I commit the python scripts.

Installing hadoop and hive seems to be pretty straight forward. Except if you are someone who always makes wrong choices (like me!). I edited the wrong config file and had to spend considerable time to figure this simple thing out.

Installing Hadoop:

http://dmitrypukhov.pro/install-hadoop-on-ubuntu/

Installing hive:

http://dmitrypukhov.pro/install-hive-on-ubuntu/

Following the above instructions, I was able to install hive, except the following step.

http://wenda.baba.io/questions/5129058/missing-hive-execution-jar-usr-local-hadoop-hive-lib-hive-exec-jar.html

As mentioned in the above post, copying the lib directory from hive-0.12.0.tar.gz to the $HIVE_HOME directory (/opt/hive in my case), solved the problem.

*****

Hive, by default stores the metadata in derby database. Its good enough apparently but strangely writes derby.log file and metastore_db, in whichever directory we start hive shell from. There might be ways to fix this, but i decided to get rid of derby and use mysql instead.

Configuring hive with mysql:

http://java.dzone.com/articles/how-configure-mysql-metastore

If you follow the instructions above, you should be fine. Just make sure you edit the correct hive-site.xml file.

On my box,

hduser@learningbox:~$ locate hive-site.xml

/etc/hive/conf.dist/hive-site.xml

/opt/hive/common/src/test/resources/hive-site.xml

/opt/hive/conf/hive-site.xml

/opt/hive/data/conf/hive-site.xml

/opt/hive/data/conf/tez/hive-site.xml

/opt/hive/hcatalog/conf/proto-hive-site.xml

/opt/hive/hcatalog/src/packages/templates/conf/hive-site.xml.template

As mentioed in the above blog post, editing /opt/hive/conf/hive-site.xml gets the stuff done. Except if you edited another file, like I did.

Other Notes:

JSON serde.

http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html

https://github.com/rcongiu/Hive-JSON-Serde#start-of-content

Creating a table with json serde and inserting turned out to be frustrating. The solution that worked was to create table, move data to hdfs and add the partition.

I also faced this issue in hive 0.13.

https://issues.apache.org/jira/browse/HIVE-8538

Blog at WordPress.com.

Up ↑