Wednesday, August 9, 2017

Boxing the Integer

It was late one night, when I was solving a problem in Java. My code suffered from one of those edge-case failures that let one doubt reality, and for a moment I considered that if I hadn't slept off while debugging and if all this was within a dream.

After some println("here") println("here1")... println("here9999"), I finally reached that line.

Basically,
int x = 4444;
List<Integer> list = new ArrayList<Integer>();
list.add(x);
list.add(x);
println(list.get(0) == list.get(1))  // RETURNS FALSE

Yes it does!

So I checked,
Integer x = 4444;
Integer y = 4444;
println(x == y); // again returns false

Okay...
Integer x = 10;
Integer y = 10;
println(x == y); // TRUE

Integer x = 100;
Integer y = 100;
println(x == y); // TRUE

Integer x = 200;
Integer y = 200;
println(x == y); // FALSE (???)

Integer x = 150;
Integer y = 150;
println(x == y); // FALSE(???)


Integer x = 127;
Integer y = 127;
println(x == y); // TRUE


Integer x = 128;
Integer y = 128;
println(x == y); // FALSE (finally, not a dream!)

My first mistake was that I was thinking in decimal. And the second, more importantly, forgetting that Integer is also a class (well not forgetting exactly).

Now I knew what to google and found this:
https://stackoverflow.com/questions/1700081/why-does-128-128-return-false-but-127-127-return-true-when-converting-to-integ
which also quotes this
http://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.7

Autoboxing is resolved to Integer.valueOf() at compile time, which caches and reuses some ints (-128 to 127 by default) from a pool. Not for performance but conforming to some convention.

Bottom line, use list.get(0).intValue();

Monday, September 21, 2015

Yelp Review Categorization - NLP

(This was completely done in the Google DevFest Buffalo 2015 (24 hours), starting from the first light to the deployment)
(The language, originally intended for a project description, may seem a little odd)

We tried to understand the sentiments and various topics in reviews for Yelp businesses. We proposed that this data, a feedback rating in various facets of business, will greatly help the them to understand customer response.

We provide simple ratings, 0 to 5 star for Restaurants in four categories: 'Food', 'Service', 'Value for money' and 'Ambience'. The dataset is made publicly available by Yelp as their DataSet Challange (http://www.yelp.com/dataset_challenge).

How we did it? - Natural Language Processing
We extracted each sentence from each review and categorize it in one of the categories by doing semantic similarity based on WordNet synsets. We then computed sentiment polarity of the sentence. Ending up with (Category, Sentiment Polarity) pairs for each sentence in the review, we aggregated (each sentence, each review) to give overall category ratings.

We also tried to be more precise, extracting phrases in a sentence. For instance - the sentence: "The pizza is was really awesome, but had to wait a lot.", talks about two categories: 'Food' and 'Service' with opposing sentiments). We used Stanford Parser for this extraction, but dropped the idea because of the computation time. Various optimizations were done to make the system efficient.

We had, what seemed to be an inexhaustible repo of ideas for increasing accuracy and improving results: using the review rating provided by yelp, reviewer profile information, different similarity measures,  trying out topic modeling (LDA), Supervised learning... Due to time-constraints of a hackathon we could not explore everything, but this is an ongoing project, with quite a many applications such as summarization and stuff.

On this advent of my first technical post, I would like to thank my god, family and friends and their neighbours. And only for the sake of it, should also consider to mention, though not in a completely feeble, fleeting and frivolous manner, my team for this project: Himanshu, Ankit and Harishankar Vishwanathan.


Results: http://avinav.science:4000/
Git: https://github.com/avinav/Yelp_Review_Categorization
Some related papers and articles:
WordNet::Similarity - Measuring the Relatedness of Concepts
Wordnet based semantic similarity measurement
Sentence Similarity Based on Semantic Nets and Corpus Statistics