i had tried chardet with dataset on non-US files. https://www.kaggle.com/datasnaek/youtube-new/downloads/youtube-new.zip/40 its detects windows-1254 instead of correct utf-8 as reported by linux file utility