{"id":4568,"date":"2019-12-29T07:44:16","date_gmt":"2019-12-29T06:44:16","guid":{"rendered":"https:\/\/pythonprogramming.altervista.org\/?p=4568"},"modified":"2019-12-29T07:44:16","modified_gmt":"2019-12-29T06:44:16","slug":"collect-data-from-reddit","status":"publish","type":"post","link":"https:\/\/pythonprogramming.altervista.org\/collect-data-from-reddit\/","title":{"rendered":"Collect data from Reddit"},"content":{"rendered":"<p>To use data from Reddit, a great source of data accessible with many methods, we will use the <a href=\"https:\/\/github.com\/pushshift\/api\">https:\/\/github.com\/pushshift\/api<\/a>.<\/p>\n<p>I put an example here in <a href=\"https:\/\/notebooks.ai\/giovannigatto\/matplol-reddit\/lab\">this notebook<\/a>.<\/p>\n<h2>Import<\/h2>\n<p>First import this modules<\/p>\n<pre class=\"lang:default decode:true \">import requests\r\nimport pandas\r\nimport textblob\r\nimport plotly.express as px\r\nimport nltk<\/pre>\n<p>then this:<\/p>\n<pre class=\"lang:default decode:true \">nltk.download('punkt')\r\npandas.set_option('display.max_colwidth', -1) # don't cut my pandas dataframes\r\n# define variables\r\n\r\nCOMMENT_COLOR         = \"darkgreen\"\r\nSUBMISSION_COLOR      = \"darkorange\"\r\nTEXT_PREVIEW_SIZE     = 240\r\nTERM_OF_INTEREST      = \"python\"\r\nSUBREDDIT_OF_INTEREST = \"python\"\r\nTIMEFRAME             = \"48h\" # see more options in the pushshift api docs: https:\/\/github.com\/pushshift\/api\r\n# a couple of helper functions<\/pre>\n<h2>Function to get a json format for the data you look for<\/h2>\n<pre class=\"lang:default decode:true\">def get_pushshift_data(data_type, **kwargs):\r\n    \"Get data from the API in form of a json file, passing the data_type\"\r\n    base_url = f\"https:\/\/api.pushshift.io\/reddit\/search\/{data_type}\/\"\r\n    payload = kwargs\r\n    request = requests.get(base_url, params=payload)\r\n    return request.json()\r\n\r\n\r\ndata = get_pushshift_data(data_type=\"comment\", q=TERM_OF_INTEREST, after=TIMEFRAME, size=1000, aggs=\"subreddit\").get(\"aggs\").get(\"subreddit\")<\/pre>\n<h2>Show the data<\/h2>\n<pre class=\"lang:default decode:true \">df = pandas.DataFrame.from_records(data)[0:10]\r\n\r\nfig = px.bar(df,\r\n       x=\"key\",\r\n       y=\"doc_count\",\r\n       title=f\"Subreddits with most activity - comments with '{TERM_OF_INTEREST}' in the last {TIMEFRAME}\",\r\n       labels={\"doc_count\": \"# comments\",\"key\": \"Subreddits\"},\r\n       color_discrete_sequence=[COMMENT_COLOR],\r\n       height=500,\r\n       width=800)\r\nfig.show()<\/pre>\n<p>The output<\/p>\n<p><a href=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2019\/12\/data.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4569\" src=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2019\/12\/data.png\" alt=\"\" width=\"800\" height=\"500\" srcset=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2019\/12\/data.png 800w, https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2019\/12\/data-320x200.png 320w, https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2019\/12\/data-768x480.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Data from reddit: get them with Python and Plotly\n<a class=\"moretag\" href=\"https:\/\/pythonprogramming.altervista.org\/collect-data-from-reddit\/\"> [...]<\/a>","protected":false},"author":1,"featured_media":4570,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[1],"tags":[489,325,660,659],"class_list":["post-4568","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples","tag-data","tag-matplotlib","tag-plotly","tag-reddit"],"avopt_banners_inside_post":true,"avopt_banners_on_page":true,"av_copy_from":"","av_sharing_message":"","av_sharing_allowed":true,"av_sharing_on":{"fb":[],"tw":[]},"av_allow_affiliate_banner":false,"av_allow_affiliate_multi_banner":false,"av_show_affiliation_buy_button":false,"av_post_rating":true,"av_have_post_rating_value":false,"av_is_artificial_intelligence_content":false,"_links":{"self":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/4568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/comments?post=4568"}],"version-history":[{"count":1,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/4568\/revisions"}],"predecessor-version":[{"id":4571,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/4568\/revisions\/4571"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/media\/4570"}],"wp:attachment":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/media?parent=4568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/categories?post=4568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/tags?post=4568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}