0 comments on “Decoding the data: Twitter Analytics”

Decoding the data: Twitter Analytics

Around this time last year Hervé shared his dabbling in Google trends leading to a retrospective identification of pharma bro being the most significant driver of pharmaceutical pricing searches in 2017. This sparked a lot of internal interest and prompted a pet project to see if we could do some deeper analysis on societies engagement with pharma pricing by flexing our proverbial machine learning and Natural Language Processing (NLP) muscles. While search trends can take you so far, for 2018 we wanted to get our teeth into something with a bit more substance and Twitter duly obliged. 

True to our business line, Chris spent four months mining tweets from August to November this year, using ‘drug pricing’ as a search term. Then, it was time to get analysing. As Inbeeo’s new analytical intern, I took up the challenge to make sense of this data and while I don’t have Herve’s years industry experience, I wanted to use my data science experience to let the data do the talking. Initial processing removed retweets resulting in a dataset containing 24,131 original (unique) tweets, the vast majority of the these coming from the United States (no real surprises there). However, our real interest was in the content of the tweets, and for this we needed to crack out the algorithms.

I chose two algorithms, K-means and LDA, both considered to be generally good at analysing textual data by grouping similar strings (tweets in this case) into clusters. Running K-means and LDA produced the following plots, where points in the same colour are of the same cluster:

Left: Clusters generated by K-means. Right: Clusters generated by LDA

Both of these clustering algorithms have worked well as we do see defined clusters. However, in the K-means graph we see that the algorithm hasn’t worked as well as LDA. In K-means we have this sea of blue extending across the plot containing most of the tweets, whereas in LDA we have smaller clusters. On cluster inspection, the large blue cluster contains tweets relating to a recent news story about Nirmal Mulye, the CEO of Nostrum Pharmaceuticals, raising drug prices by 400%. Interestingly, (or perhaps unfortunately) K-means hasn’t aggregated all tweets on this story into one cluster, instead we have five. However, LDA has been more successful at aggregation, as we have two clusters relating to this story.

Next, I wondered about the timing of these tweets, did they spike when a news story occurred? Or were they random opinions? Hence, I decided to take the results of the better algorithm, LDA, and plot the tweets of each cluster over time, as shown below.

Graph showing tweet volume over time, indexed by topic. In the infographic on the top right, the size of the box is proportional to the number of tweets per topic

The results confirmed my hunch – people tweeted as a reaction to news. I was then able to identify news stories associated with each cluster based on spikes in time. The most impactful stories discovered were:

  • 12th September, Topic 1 and 10: Nirmal Mulye raising drug prices (this is somewhat unsurprising) 
  • 25th August, Topic 2: The Senate working on solutions to end secrecy around drug pricing
  • 11th September, Topic 4: Pharmaceutical middlemen rake in millions 
  • 10th October, Topic 5: Trump signs a bill on drug price 
  • 7th September and 25th August, Topic 6: The pay raise of the Pfizer CEO, interestingly this story was initially announced in March, but residual spikes were seen 
  • 8th September, Topic 7: Trump ends foreign price controls 
  • 27th August, Topic 13: A drug price hike in the Netherlands 
  • 10th October, Topic 14: Trump lifts a gag clause on drug pricing

Interestingly, running the LDA algorithm multiple times yielded slightly different outputs. LDA was discovering new stories with each run, implying that multiple runs of the algorithm may be required to find all news stories that were sparking debate on twitter. The additional stories included Oklahoma Medicaid testing a new tactic to curb US drug costs and a tweet by AARP Advocates, a group of advocates for health care, social security and older workers, encouraging people to contact their representative about rising prescription costs for seniors. Returning to the K-means analysis I found a cluster of tweets on a news story about the price of a heroin withdrawal drug increasing, a story not found by LDA so maybe both algorithms are needed in order to identify all news stories gaining twitter traction.

Interestingly, and perhaps a comment on human nature, it seemed people were more inclined to comment on negative stories, such as the Nirmal Mulye story, compared to positive stories, for instance Trump lifting the gag clause. Are these trends the sign of a broken system (with unaffordable prices and inadequate protection of vulnerable citizens, such as the elderly and heroin addicts) or do people just like to moan? Additionally, most of these tweets are from the US and are influenced heavily by politics highlighted by Donald Trump’s appearance in topics relating to lowering pricing and increasing transparency as well as AARP Advocates’ petition to Congress. This also begs the question as to some of the motivations behind these tweets. Are they from people with a genuine concern about drug pricing, or opportunistic posters using the stories to push a political agenda?  However, though small in proportion compared to negative tweets, there are signs of positive engagement. Donald Trump and the Senate making headway to increase transparency and lower drug costs, and Oklahoma Medicaid working on curbing drug costs both got people talking. 

To get an idea of public opinion on a topic, be it a particular drug or pharmaceutical company, NLP is the way to go. Using fairly simple (as far as machine learning goes anyway) algorithms I was able to quickly identify news events concerning drug pricing that resonated with the Twitterati and gauge opinions and even more is possible! Pharmaceutical companies can analyse twitter to assess if their product was a success, for instance, are people reporting side effects? Real time dashboards can mine data from twitter and run sentiment analysis to discover how people feel about a company or the market in general. And what about marketing a product? When is the best time of day to post about it? All this and more can be discovered using Machine Learning, the possibilities are immense!