Does python urllib2 automatically uncompress gzip data fetched from webpage?

Question

I'm using

 data=urllib2.urlopen(url).read()

I want to know:

How can I tell if the data at a URL is gzipped?
Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

Maybe worth noting that the requests library handles gzip compression automatically (see the FAQ) — dbr
– dbr, Commented Aug 3, 2013 at 9:39

Jay Taylor · Accepted Answer · 2016-09-15 16:51:55Z

154

How can I tell if the data at a URL is gzipped?

This checks if the content is gzipped and decompresses it:

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO(response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()

Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')

edited Sep 15, 2016 at 16:51

Jay Taylor

13.6k11 gold badges64 silver badges85 bronze badges

answered Oct 16, 2010 at 1:21

ars

124k23 gold badges152 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

daniyalzade Over a year ago

bobince has a point, urllib2 would not be sending the appropriate headers, so the response will not be gzipped.

phobie Over a year ago

In Py3k use io.BytesIO instead of StrinIO.StringIO!

Sam Over a year ago

Relevant: Why you can't stream urllib into gzip enricozini.org/2011/cazzeggio/python-gzip

jfs Over a year ago

@tommy.carstensen: here's Python 3 code example

Eyal Over a year ago

@daniyalzade I'm working with a website that gzipped the response even though the request did not specify it.

|

bobince · Accepted Answer · 2010-10-16 01:28:21Z

8

If you are talking about a simple .gz file, no, urllib2 will not decode it, you will get the unchanged .gz file as output.

If you are talking about automatic HTTP-level compression using Content-Encoding: gzip or deflate, then that has to be deliberately requested by the client using an Accept-Encoding header.

urllib2 doesn't set this header, so the response it gets back will not be compressed. You can safely fetch the resource without having to worry about compression (though since compression isn't supported the request may take longer).

answered Oct 16, 2010 at 1:28

bobince

538k111 gold badges675 silver badges846 bronze badges

1 Comment

Andres Riofrio Over a year ago

This doesn't seem to be true for all popular servers. Try curl -vI http://en.wikipedia.org/wiki/Spanish_language |& grep '^[<>]'

RuiDC · Accepted Answer · 2013-07-30 09:51:07Z

5

Your question has been answered, but for a more comprehensive implementation, take a look at Mark Pilgrim's implementation of this, it covers gzip, deflate, safe URL parsing and much, much more, for a widely-used RSS parser, but nevertheless a useful reference.

edited Jul 30, 2013 at 9:51

answered Aug 9, 2011 at 20:05

RuiDC

9,2037 gold badges29 silver badges21 bronze badges

Comments

AXO · Accepted Answer · 2024-05-23 09:15:53Z

0

I'd suggest using gzip.decompress instead of gzip.GzipFile so that there is no need for StringIO:

import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    f = gzip.decompress(response.read())
    data = f.read()

answered May 23, 2024 at 9:15

AXO

9,2766 gold badges75 silver badges69 bronze badges

Comments

RobotHumans · Accepted Answer · 2018-09-01 14:23:37Z

It appears urllib3 handles this automatically now.

Reference headers:

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})

Reference code:

import gzip
import io
import urllib3

class EDDBMultiDataFetcher():
    def __init__(self):
        self.files_dict = {
            'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
            'Stations':'http://eddb.io/archive/v5/stations.jsonl',
            'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
            'Commodities':'http://eddb.io/archive/v5/commodities.json'
            }
        self.http = urllib3.PoolManager()
    def fetch_all(self):
        for item, url in self.files_dict.items():
            self.fetch(item, url)

    def fetch(self, item, url, save_file = None):
        print("Fetching: " + item)
        request = self.http.request(
            'GET',
            url,
            headers={
                'Accept-encoding': 'gzip, deflate, sdch'
                })
        data = request.data.decode('utf-8')
        print("Fetch complete")
        print(data)
        print(request.headers)
        quit()


if __name__ == '__main__':
    print("Fetching files from eddb.io")
    fetcher = EDDBMultiDataFetcher()
    fetcher.fetch_all()

urllib3 is a third-party package. It's not an upgrade of urllib2.

Collectives™ on Stack Overflow

Does python urllib2 automatically uncompress gzip data fetched from webpage?

5 Answers 5

10 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

10 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related