Scripting for Fun and Passion

Statistical Analysis of Rahul versus Arnab

So, Arnab kept asking about Rahul’s opinion on Modi, 1984 riots and Ashok Chavan and Rahul fought him back with RTI, women empowerment and broader system related questions from his armory. This is how one of India’s recent probably “Once in a Lifetime” faceoff between 2 social media hot favorites ended. The unstoppable force versus the immovable object. I present a data based analytics of the whole proceeding.

It has been almost a week since Rahul Gandhi’s interview with Times Now journalist Arnab Goswami was published on Youtube. For those of you who haven’t seen the recent hot thing in Indian politics should spend some 1.5 hours of your time studying the psych of the Vice President of our current party. 

Since its publishing the video has garnered more than 1.7 million views and its has has been quite a viral thing in the days following the actual interview.

Youtube Video Stats. Source:

Youtube Video Stats. Source:

This has also allowed the politically engrossed Indian masses on social media to share their sentiments about Rahul Gandhi and his interview. The comments in my friend groups have been mostly funny and quite humorous. A majority of them have claimed that the interview was all about Rahul reiterating the same points over and over again. Things like “Women Empowerment”, “RTI” and “Rahul Gandhi” were supposedly some of the words which were supposed to be overused by Rahul. The social media was abuzz with memes about Rahul Gandhi [Source:] and there was even a website totally dedicated to generating answers as Rahul would have given. [Source:]

Being a data scientist and a starter in text processing I decided to do a fun weekend project on the interview text and look for patterns and if they are correctly correlated to the claims people are making on social media. Another reason this interview was of particular interest to me because it bought 2 icons of Indian media together. It was like “an unstoppable force meeting an immovable object” [Source: The Dark Knight, 2008] and I am sure the people saw scales remaining balanced till the end.

I looked at the data from 3 perspectives:

  • All Data
  • Rahul’s Text
  • Arnab’s Text

This was important because I wanted to do a frequent statistical analysis and try to see if the claims on social media were correct. So I decide to answer the following research question:

“How accurate are the claims on social media about Rahul versus Arnab and what insights do they give into the personalities of the 2 involved entities ?”

With this simple question in mind I decided to first test word frequencies of all the 3 datasets and some of the preliminary results were not quite consistent with the claims and reflected the social media audience’s inclination to hang on to some catch phrases from the interview and make a whole viral campaign out of it.

Looking at all the data cumulatively the hot topics which were quite prominent during the interview were: riots, system, RTI and Gujrat. Now this is quite understandable as Arnab was trying to focus on issues like Gujrat riots and Rahul was focused on the things related to system changes as a part of his broader perspective strategy.

A more statistical result was:

All word statistics

All word statistics

However, a more interesting thing I was interested was in the number of entities mentioned. This allowed me to focus on key individuals who were mentioned during the interview. And the results I got were quite interesting. Leaving out Rahul [Rahul will be discussed in detail when studying Rahul’s text independently ;)] and Congress, the other key entities were Gujrat and Narendra Modi which is also quite evident. However, the most interesting entity which surfaced was Ashok Chavan whose name Arnab used a lot of times to extract some answers from Rahul. Also 1984 and Cambridge were entities discussed quite frequently.

All Text Entities

All Text Entities

I also constructed a network of entities which occurred together and these results reflected similar patterns. Also some people whose names were linked to the 1984 riots like Sajjan Kumar, Bhagat, Jagdish Tytler etc. were also evident from the analysis.

All Entity Network

All Entity Network

To try to find out the central theme of the interview I did topic modelling of the text and got 5 major cluster of topics which co-occurred frequently. Apart from central theme being Rahul v/s Modi and the elections of 2014, the other important but more frequent themes were hidden mostly in Rahul’s answers regarding women issues, RTI, system and the 2 riots of Gujrat and 1984. Ashok Chavan was also frequently used during interview regarding him being shielded in the Adarsh Scam.

Frequent and important topics during interview

Frequent and important topics during interview

Once finished with the overview analysis of the text corpus as a whole I decided to dig deeper and look into the individual statements given by both Rahul and Arnab. This is the interesting dataset according to me as this will give me answers to the research question I was pursuing.

On looking at Arnab’s dataset it was quite evident that he continued his style of asking very detailed questions Arnab spoke around 5071 words as compared to Rahul’s 7460. While Arnab was focused on issues like Modi, Chavan and 1984 riots; Rahul was more focused on issues like system, people, RTI and women. However, the internet memes started to get visualized when I looked at the entity results of Rahul and Arnab. While Arnab mentioned entities like Rahul, Modi, Gujrat and 1984; Rahul’s top entities included Congress, Gujrat, India and “Rahul Gandhi” [The Rock and Stone Cold Steve Austin would be amazed at the new entry to their club].  Infact Rahul used Rahul Gandhi 11 times during his statements, more than the number of times he used Modi [6] or even Ashok Chavan[3].

Rahul's Word Stats

Rahul’s Word Stats

Rahul's Entity Stats

Rahul’s Entity Stats

Arnab's Word Frequency

Arnab’s Word Frequency

Arnab's Entity Stats

Arnab’s Entity Stats

Another interesting thing I found was that BJP and AAP were very less frequently used by both individuals, especially when compared to the the number of times Modi and Congress were mentioned.
















Rahul Gandhi



Ashok Chavan



While Arnab’s questions revolved around topics related to Modi, Congress, Chavan and Riots; Rahul’s answers were mostly about RTI, System, youngsters in election with the central topic revolving around women issues. The central topics were not the most frequent ones but the ones which were most uniformly distributed in the whole conversation.

Arnab's Entity Network

Arnab’s Entity Network

Rahul's Entity Network

Rahul’s Entity Network

Rahul in his statements tried to connect Congress party to issues related to RTI, India along with focusing on its performance in various states. Rahul frequently tried to draw differences between Gujrat riots and 1984 riots. This was quite different from the entities Arnab tried to link. Arnab’s focus revolved around Modi and his comments of Shehzada about Rahul, Rahul’s performance in UP. Arnab also tried to pit Rahul against the BJP PM candidate Modi by bringing Modi’s candidature for the PM of India, quite regularly during the interview.

Arnab's Topics

Arnab’s Topics

Rahul's Topics

Rahul’s Topics


I am thankful to the Times of India website for making the whole Interview script text publicly available online. [Source: This made the text analysis a far more easy project to me. (After all I was not interested in spending another hour and a half trying to transcribe the whole audio).

After getting the data I had to clean it to get it into analytically state. I decided to split it into 3 separate data-sets:

  • Full text of Interview
  • Only Rahul’s Text
  • Only Arnab’s Text

I used the tool called ConText for data analysis like word stats, entity stats, network generation and topic visualizations along with some python scripts for parsing the data.  And I created the visualizations in Gephi using centrality measures for Node sizes [Degree] and Label Sizes [Betweenness] and modularity classes for node coloring. 

Interactive Charts of the images presented above along with full analytics data can be found at:


After doing this basic analysis I realize I figured out that even though the topics which were not frequent but were uniform in the discussion they became more popular in social media. Rahul’s usage of women empowerment, RTI and “Rahul Gandhi” were caught by social media enthusiasts and made viral. However, this also led to many other important topics and issues being hidden beneath this viral sharing. Key individuals like Ashok Chavan, Virbhadra and some scams which were mentioned were not caught by the social media audiences.

Another important observation was regarding evading of questions by Rahul and how less he tried to answer to the point or pointing out the individuals who he was supposed to give statement on. Even though it is a perfectly safe and good strategy to answer in a positive tone mentioning issues which one envisions; I would say that when it comes to personal interview being a bit more specific and elaborate on the questions at hand is more important. As the statistics reflected the platform appeared more to me as a means to talk more about what he is planning for the future and has done in the past than  about what are the key things at in the current political scenario. Overall the claims on social media were quite accurate.

Finally, I still feel there are lot more things which can make this analysis more useful and interesting. Some ideas I have but can’t implement because of lack of time [PhD studies ;)] are:

  • Word correlation on for each question and its corresponding answer
  • Language model for Rahul Gandhi’s answers and Arnab’s question [the latter can be done more easily because of the abundant dataset available]
  • Sentiment tracking for each entity and in what way the answer’s were given.

Fun Bites

Today only I also happened to see this quite dramatic reconstruction of the whole interview by Cyrus Broacha. I think the language model for both Rahul and Arnab would have greatly improved the video.


This article is my personal analysis and opinion on the issue at hand. I have cited sources from which I have taken the data and the tools I have used. If anyone plans to reproduce this article or my analysis on their site, please give a link back.

If you agree or disagree, or have thoughts to add to my analysis or want to answer more broader questions to my analysis related to light bulb changing labor forces and chickens crossing the streets. Please feel free to use the comment section.

Also, humor and analysis were some of the key elements I thought of while writing this piece and I am pretty sure I ended up doing the later relatively more than the former.


2 responses to “Statistical Analysis of Rahul versus Arnab

  1. Srinivas Chilukuri March 12, 2014 at 9:20 pm

    Hi Shubhanshu,

    Nice post. Will it be possible for you to share the ConText tool you mentioned? I reached out to the UIUC team but haven’t received any response from them.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: