{"id":5334,"date":"2020-03-04T20:07:07","date_gmt":"2020-03-04T19:07:07","guid":{"rendered":"https:\/\/pythonprogramming.altervista.org\/?p=5334"},"modified":"2020-03-05T09:18:26","modified_gmt":"2020-03-05T08:18:26","slug":"webscrapint-coronavirus-data","status":"publish","type":"post","link":"https:\/\/pythonprogramming.altervista.org\/webscrapint-coronavirus-data\/","title":{"rendered":"Webscraping coronavirus data:"},"content":{"rendered":"<p>I have this simple code to <strong>scrape<\/strong> the last <strong>data<\/strong> of cofirmed cases in <strong>Italy<\/strong> of <strong>coronavirus<\/strong> from a site with <strong>Python<\/strong>.<\/p>\n<p>I just did some basic webscraping, importing:<\/p>\n<pre class=\"lang:default decode:true \">import requests\r\nimport urllib.request\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\n<\/pre>\n<p>Now that we have the tools, let&#8217;s use them<\/p>\n<pre class=\"lang:default decode:true \">url = \"https:\/\/lab24.ilsole24ore.com\/coronavirus\/\"\r\n\r\n# access the site\r\nresponse = requests.get(url)\r\nprint(response)<\/pre>\n<p>We made a request to the site of ilsole24ore. If we get this we get the response right:<\/p>\n<pre class=\"lang:default decode:true \">&lt;Response [200]&gt;<\/pre>\n<p>We parse the html that we get as response<\/p>\n<pre class=\"lang:default decode:true \">soup = BeautifulSoup(response.text, \"html.parser\")<\/pre>\n<p>We get all the &lt;h2&gt; tag, because the data is into a h2 tag, like you can see in the code below.<\/p>\n<p>To get the window on the right that you see below, right click with the mouse and then choose inspect from the menu, then click on (1. click) the arrow on the top left, then click on the number of confirmed cases (click 2). You will se on the right the &lt;h2&gt; tag that contains the number.<\/p>\n<p><a href=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5335\" src=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking.png\" alt=\"\" width=\"1293\" height=\"541\" srcset=\"https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking.png 1293w, https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking-320x134.png 320w, https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking-960x402.png 960w, https:\/\/pythonprogramming.altervista.org\/wp-content\/uploads\/2020\/03\/clicking-768x321.png 768w\" sizes=\"auto, (max-width: 1293px) 100vw, 1293px\" \/><\/a><\/p>\n<pre class=\"lang:default decode:true \">h2 = soup.findAll(\"h2\")<\/pre>\n<p>I can see that the first h2 is the one that I need; let&#8217;s see what we have<\/p>\n<pre class=\"lang:default decode:true\">print(h2)<\/pre>\n<p>We will see this<\/p>\n<pre class=\"lang:default decode:true \">[&lt;h2 class=\"timer count-number\" data-speed=\"1000\" data-to=\"2706\" id=\"num_1\"&gt;&lt;\/h2&gt;, &lt;h2 class=\"timer count-number\" data-speed=\"1000\" data-to=\"107\" id=\"num_2\"&gt;&lt;\/h2&gt;, &lt;h2 class=\"timer count-number\" data-speed=\"1000\" data-to=\"276\" id=\"num_3\"&gt;&lt;\/h2&gt;, &lt;h2 class=\"timer count-number\" data-speed=\"1000\" data-to=\"3089\" id=\"num_4\"&gt;&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;&lt;center&gt;I numeri complessivi&lt;\/center&gt;&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;Il trend giorno per giorno&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;L\u2019andamento nelle province con pi\u00f9 contagi&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;L\u2019andamento delle 5 regioni con pi\u00f9 contagi&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;I dati per provincia&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;I nuovi tamponi giornalieri&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;Ricoveri e terapie intensive&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;Come crescono i ricoveri&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;Il contagio nei Paesi europei&lt;\/h2&gt;, &lt;h2 class=\"chartTitle\"&gt;I primi dieci Paesi al mondo per contagio&lt;\/h2&gt;]<\/pre>\n<p>as you can see, the first one it the one that we are looking for with the 2706 number.<\/p>\n<p>I am gonna transform the first of the h2 data (h2[0]) into a string and the print the sixth (5) element of the list obtained splitting by the apostrophes the string.<\/p>\n<pre class=\"lang:default decode:true \">h2 = str(h2[0])\r\nprint(\"Last data of confirmed case in Italy:\")\r\nprint(h2.split(\"\\\"\")[5])<\/pre>\n<p>Just to be clear if I split the h2.split(&#8220;\\&#8221;&#8221;) I get this:<\/p>\n<pre class=\"lang:default decode:true \">['&lt;h2 class=', 'timer count-number', ' data-speed=', '1000', ' data-to=', '2706', ' id=', 'num_1', '&gt;&lt;\/h2&gt;']<\/pre>\n<p>And the sixth element is &#8220;2706&#8221;, what I was looking for.<\/p>\n<h2>The whole code<\/h2>\n<pre class=\"lang:default decode:true\">import requests\r\nimport urllib.request\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\n\r\n# the site to scrape\r\n\r\nurl = \"https:\/\/lab24.ilsole24ore.com\/coronavirus\/\"\r\n\r\n# access the site\r\nresponse = requests.get(url)\r\n# now we parse the html with BS\r\nsoup = BeautifulSoup(response.text, \"html.parser\")\r\nh2 = soup.findAll(\"h2\")\r\nh2 = str(h2[0])\r\nprint(\"Last data of confirmed case in Italy:\")\r\nprint(h2.split(\"\\\"\")[5])<\/pre>\n<h2>Let&#8217;s get the result with a shell here<\/h2>\n<script async src=\"https:\/\/cdn.datacamp.com\/dcl-react.js.gz\"><\/script>\r\n\r\n  <div class=\"exercise\">\r\n\r\n\r\n    <div data-datacamp-exercise data-lang=\"python\">\r\n      <code data-type=\"pre-exercise-code\">\r\n        \r\n\r\n\r\n\r\n\r\n\r\n        \r\n      <\/code>\r\n      <code data-type=\"sample-code\">\r\n# press RUN below to see the data\r\n\r\nimport requests\r\nimport urllib.request\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\n\r\n# the site to scrape\r\n\r\nurl = \"https:\/\/lab24.ilsole24ore.com\/coronavirus\/\"\r\n\r\n# access the site\r\nresponse = requests.get(url)\r\n\r\n\r\n# now we parse the html with BS\r\nsoup = BeautifulSoup(response.text, \"html.parser\")\r\nh2 = soup.findAll(\"h2\")\r\n\r\nh2 = str(h2[0])\r\nprint(\"Last data of confirmed case in Italy:\")\r\nprint(h2.split(\"\\\"\")[5])\r\n\r\n      <\/code>\r\n      <code data-type=\"solution\"><\/code>\r\n      <code data-type=\"sct\"><\/code>\r\n      <div data-type=\"hint\">Just press 'Run'.<\/div>\r\n    <\/div>\r\n  <\/div>\n<h2>The data of the previous days<\/h2>\n<script async src=\"https:\/\/cdn.datacamp.com\/dcl-react.js.gz\"><\/script>\r\n\r\n  <div class=\"exercise\">\r\n\r\n\r\n    <div data-datacamp-exercise data-lang=\"python\">\r\n      <code data-type=\"pre-exercise-code\">\r\n        \r\n\r\n\r\n        \r\n      <\/code>\r\n      <code data-type=\"sample-code\">\r\n# press RUN below to see the data\r\n\r\nimport pandas as pd\r\nfrom datetime import datetime, timedelta\r\nimport matplotlib.pyplot as plt\r\n\r\n\r\n\r\ndays = []\r\ndays2 = []\r\nfor n in range(5, 0, -1):\r\n    day = datetime.today() - timedelta(n)\r\n    days2.append(datetime(2020, day.month, day.day))\r\n    days.append(str(day.month) + \"\/\" + str(day.day))\r\n\r\nmonth = str(datetime.today().month)\r\nc = []\r\ndef check(what, day):\r\n    url = \"https:\/\/raw.githubusercontent.com\/CSSEGISandData\/COVID-19\/master\/csse_covid_19_data\/csse_covid_19_time_series\/time_series_19-covid-{}.csv\".format(what)\r\n\r\n    df = pd.read_csv(url, error_bad_lines=False)\r\n    print(what, end=\" \")\r\n    result = df.loc[df[\"Country\/Region\"]==\"Italy\"][\"{}\/20\".format(day)]\r\n    print(list(result)[0], end=\" - \")\r\n    if what == \"Confirmed\":\r\n        c.append(list(result)[0])\r\n\r\n\r\nwhat = \"Confirmed\", \"Recovered\", \"Deaths\" \r\n\r\nfor d in days:\r\n    print(\"{}\".format(d), end=\": \")\r\n    for w in what:\r\n        check(w, d)\r\n    print()\r\n    \r\n#sorted(days, key=lambda d: map(int, d.split('\/')))\r\nax = plt.subplot(111)\r\nax.bar(days2, c)\r\nax.xaxis_date()\r\nplt.xlabel(\"days: \")\r\nplt.ylabel(\"Confirmed \/ positivi\")\r\nplt.show()\r\n\r\n      <\/code>\r\n      <code data-type=\"solution\"><\/code>\r\n      <code data-type=\"sct\"><\/code>\r\n      <div data-type=\"hint\">Just press 'Run'.<\/div>\r\n    <\/div>\r\n  <\/div>\n<p><iframe loading=\"lazy\" title=\"Scraping data about coronavirus from a web page with Python\" width=\"747\" height=\"420\" src=\"https:\/\/www.youtube.com\/embed\/KYgwO1Zwo5M?feature=oembed&amp;enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<h2>More data scraped<\/h2>\n<p>Here we add some other data:<\/p>\n<pre class=\"lang:default decode:true\">&lt;script async src=\"https:\/\/cdn.datacamp.com\/dcl-react.js.gz\"&gt;&lt;\/script&gt;\r\n  &lt;div class=\"exercise\"&gt;\r\n    &lt;div data-datacamp-exercise data-lang=\"python\"&gt;\r\n      &lt;code data-type=\"pre-exercise-code\"&gt;\r\n      &lt;\/code&gt;\r\n      &lt;code data-type=\"sample-code\"&gt;\r\n\r\nimport requests\r\nimport urllib.request\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\n\r\n\r\n# the site to scrape\r\n\r\nurl = \"https:\/\/lab24.ilsole24ore.com\/coronavirus\/\"\r\n\r\n# access the site\r\nresponse = requests.get(url)\r\n# now we parse the html with BS\r\nsoup = BeautifulSoup(response.text, \"html.parser\")\r\nh2 = soup.findAll(\"h2\")\r\nprint(\"Today`s data about coronavirus in Italy:\")\r\nprint(\"Confirmed: \", str(h2[0]).split(\"\\\"\")[5])\r\nprint(\"Deaths: \", str(h2[1]).split(\"\\\"\")[5])\r\nprint(\"Recovered: \", str(h2[2]).split(\"\\\"\")[5])\r\n\r\n\r\n\t\r\n\r\n      &lt;\/code&gt;\r\n      &lt;code data-type=\"solution\"&gt;&lt;\/code&gt;\r\n      &lt;code data-type=\"sct\"&gt;&lt;\/code&gt;\r\n      &lt;div data-type=\"hint\"&gt;Just press 'Run'.&lt;\/div&gt;\r\n    &lt;\/div&gt;\r\n  &lt;\/div&gt;<\/pre>\n<h2>Utility to create your interactive shell<\/h2>\n<p>In this code I put a little utility to make faster your shell with python for your sites.<\/p>\n<p>You need this file called code_shell.txt in your directory<\/p>\n<pre class=\"lang:default decode:true \">&lt;script async src=\"https:\/\/cdn.datacamp.com\/dcl-react.js.gz\"&gt;&lt;\/script&gt;\r\n  &lt;div class=\"exercise\"&gt;\r\n    &lt;div data-datacamp-exercise data-lang=\"python\"&gt;\r\n      &lt;code data-type=\"pre-exercise-code\"&gt;\r\n      &lt;\/code&gt;\r\n      &lt;code data-type=\"sample-code\"&gt;\r\n\r\n{{code}}\r\n\r\n      &lt;\/code&gt;\r\n      &lt;code data-type=\"solution\"&gt;&lt;\/code&gt;\r\n      &lt;code data-type=\"sct\"&gt;&lt;\/code&gt;\r\n      &lt;div data-type=\"hint\"&gt;Just press 'Run'.&lt;\/div&gt;\r\n    &lt;\/div&gt;\r\n  &lt;\/div&gt;<\/pre>\n<p>Then you save your python script (whatever you want to run on the web page and then you run this other script<\/p>\n<pre class=\"lang:default decode:true \">import os\r\nfrom tkinter import filedialog\r\n\r\n\r\nfilename = filedialog.askopenfilename(\r\n    initialdir=\".\", filetypes=[(\"Python files\", \".py\")])\r\n\r\n\r\ndef openfile(filename):\r\n    with open(filename) as filepy:\r\n        filepy = filepy.read()\r\n    return filepy\r\n\r\n\r\nfilepy = openfile(filename)\r\nfiletxt = openfile(\"code_shell.txt\")\r\n\r\n# Create file to go into wordpress\r\nfiletxt = filetxt.replace(\"{{code}}\", filepy)\r\nfilename = filename[:-4] + \".html\"\r\nwith open(filename, \"w\") as file:\r\n    file.write(filetxt)\r\n\r\nos.startfile(filename)\r\n<\/pre>\n<p>When you run this, you will be asked to get the python file with the code you want to run in a web page, and after you choose, it will show you the code running in a shell in the browser.<\/p>\n<script async src=\"https:\/\/cdn.datacamp.com\/dcl-react.js.gz\"><\/script>\r\n  <div class=\"exercise\">\r\n    <div data-datacamp-exercise data-lang=\"python\">\r\n      <code data-type=\"pre-exercise-code\">\r\n      <\/code>\r\n      <code data-type=\"sample-code\">\r\n\r\nimport requests\r\nimport urllib.request\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\n\r\n\r\n# the site to scrape\r\n\r\nurl = \"https:\/\/lab24.ilsole24ore.com\/coronavirus\/\"\r\n\r\n# access the site\r\nresponse = requests.get(url)\r\n# now we parse the html with BS\r\nsoup = BeautifulSoup(response.text, \"html.parser\")\r\nh2 = soup.findAll(\"h2\")\r\nprint(\"Today`s data about coronavirus in Italy:\")\r\nprint(\"Confirmed: \", str(h2[0]).split(\"\\\"\")[5])\r\nprint(\"Deaths: \", str(h2[1]).split(\"\\\"\")[5])\r\nprint(\"Recovered: \", str(h2[2]).split(\"\\\"\")[5])\r\n\r\n\r\n\t\r\n\r\n      <\/code>\r\n      <code data-type=\"solution\"><\/code>\r\n      <code data-type=\"sct\"><\/code>\r\n      <div data-type=\"hint\">Just press 'Run'.<\/div>\r\n    <\/div>\r\n  <\/div>\n<p>[hoos name=&#8221;all&#8221;]<\/p>\n","protected":false},"excerpt":{"rendered":"Scraping the web to find the data about coronavirus in Italy with Python\n<a class=\"moretag\" href=\"https:\/\/pythonprogramming.altervista.org\/webscrapint-coronavirus-data\/\"> [...]<\/a>","protected":false},"author":1,"featured_media":5340,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[1,749],"tags":[728,489,750],"class_list":["post-5334","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples","category-webscraping","tag-coronavirus","tag-data","tag-webscraping"],"avopt_banners_inside_post":true,"avopt_banners_on_page":true,"av_copy_from":"","av_sharing_message":"","av_sharing_allowed":true,"av_sharing_on":{"fb":[],"tw":[]},"av_allow_affiliate_banner":false,"av_allow_affiliate_multi_banner":false,"av_show_affiliation_buy_button":false,"av_post_rating":true,"av_have_post_rating_value":false,"av_is_artificial_intelligence_content":false,"_links":{"self":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/5334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/comments?post=5334"}],"version-history":[{"count":9,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/5334\/revisions"}],"predecessor-version":[{"id":5356,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/posts\/5334\/revisions\/5356"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/media\/5340"}],"wp:attachment":[{"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/media?parent=5334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/categories?post=5334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pythonprogramming.altervista.org\/wp-json\/wp\/v2\/tags?post=5334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}