Determining the Prevalence of Cannabis, Tobacco, and Vaping Device Mentions in Online Communities

Abstract: "The relationship between cannabis, tobacco, and vaping devices is both rapidly changing and poorly understood, with consumers rapidly shifting between use of all three product types. Given this dynamic and evolving landscape, there is an urgent need to monitor and better understand co-use, dual-use, and transition patterns between these products. This study describes work that utilizes social media - in this case, Reddit - in conjunction with automated Natural Language Processing (NLP) methods to better understand cannabis, tobacco, and vaping device product usage patterns...We collected Reddit data from the period 2013-2018, sourced from eight popular, high-volume Reddit communities (subreddits) related to the three product categories. We then manually annotated (coded) a set of 2640 Reddit posts and trained a machine learning-based NLP algorithm to automatically identify and disambiguate between cannabis or tobacco mentions (both smoking and vaping) in Reddit posts. This classifier was then applied to all data derived from the eight subreddits, 767,788 posts in total...The NLP algorithm achieved an overall moderate performance (overall F-score of 0.77). When applied to our large corpus of Reddit posts, we discovered that over 10% of posts in the smoking cessation subreddit r/stopsmoking were classified as referring to vaping nicotine, and that only 2% of posts from the subreddits r/electronic_cigarette and r/vaping were classified as referring to smoking (tobacco) cessation...This study presents the results of applying an NLP algorithm designed to identify and distinguish between cannabis and tobacco mentions (both smoking and vaping) in Reddit posts, hence contributing to our currently limited understanding of co-use, dual-use, and transition patterns between these products."