See how Adaptiv can transform your business. Schedule a kickoff call today

Technical

From Frustration to Fluency: Demystifying Python Encoding

Technical

Connor Kerling 6 min read. Feb 12

Contents

Introduction

Section 1: The Problem

Section 2: The Code

Section 3: Conclusion

About the Author

Introduction

Is code behaving mysteriously? Strange characters appearing seemingly out of thin air? More often than not, encoding issues are the hidden culprit to blame. In this blog post, you’ll learn how to correctly identify and avoid encoding issues when reading Python files. Bonus tip: if you’re working with Azure Blob Storage or Pandas, you’ll find this particularly useful!

Section 1: The Problem

Picture this scenario: You’re trying to read a CSV file into a Pandas data frame using Python; however, you keep receiving a Unicode Decode Error. The error message states that the UTF-8 codec is unable to decode a byte:

In your quest for answers, you inspect the file in Notepad++, only for Notepad++ to assure you that the file is UTF-8. You’re left confused, scratching your head, and your data frame is still as empty as ever.

While the Pandas read_csv() function is easy to use, what most users may need to realise is that it employs UTF-8 as the default encoding. Now, in my case, this default setting was the bane of my existence; the files I was reading were not UTF-8 and, for this reason, were producing decoding errors. If you’ve ever encountered this situation or something similar, you will know it can be frustrating and confusing to find a file’s original encoding.

So, that begs the question: How can we determine a file’s encoding in Python? One solution is to harness the power of the Chardet package. Chardet is an easy-to-use, universal encoding detector package that requires Python 3.7 or higher. In the code example below, I will demonstrate how to use Chardet to detect the file’s encoding and correctly read CSV data.

Note: In my specific case, I was trying to read in files from an Azure Blob Storage Account and subsequently load the blob data into a pandas data frame. The code example below outlines that process.

Section 2: The Code

Step 1: Setting up our Storage Blob variables

In this section, we set up all our Azure Storage Blob variables, including our Azure Storage Account connection string, container name, and file name.

# Replace placeholders
constr = 'Insert Storage Account Connection String Here'
container_name = 'Insert Container Name Here'
blob_filename = 'Insert File Name Here'

Step 2: Setting up Container Client

Now, we’ll set up a blob client instance using our previously declared variables. This client will allow us to interact with our Azure Blob Storage Account.

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob_filename)

Step 3: Setting up a temporary file to write blob data to

In this step, we set up a temp file and write the data from our Azure Blob to this file.

import tempfile

# Setting up temp file
tmp = tempfile.NamedTemporaryFile()
tmp.close()

# Writing file to temp
with open(file=tmp.name, mode='wb') as file:
    download_stream = blob_client.download_blob()
    file.write(download_stream.readall())
    file.close()

Step 4: Reading the temporary file to detect and handle encoding issues

Now, we’ll open the temp file we’ve created as a binary file and read in its data. Using the opened file, we then leverage the `chardet` package to detect the encoding of the file. We’ll then use the detected encoding when reading the CSV file with Pandas to ensure it’s correctly decoded.

from io import BytesIO
import pandas as pd
import chardet # Our saving grace

# Reading temp csv file to check encoding and loading into pandas data frame
with open(file=tmp.name, mode='rb') as file:
    data = file.read()
    # Using chardet to find out the files encoding
    encoding = chardet.detect(data)['encoding']
    df = pd.read_csv(BytesIO(data), keep_default_na=False, encoding=encoding)
    file.close()

And there we have it! We’ve successfully configured our Azure Storage Blob variables, retrieved, and handled the blob data, and correctly decoded it using chardet. This process ensures an error-free process when working with files whose encodings differ and can be applied to many other processes.

Section 3: Conclusion

Finding the encoding of a file in Python can be a frustrating roadblock. However, by identifying the problem and leveraging the Chardet package, you can confidently detect and handle file encodings, ensuring seamless data processing in Python. So, the next time you catch yourself second-guessing a file’s encoding, remember that the solution is just one import statement away.

For those navigating the complexities of data migration or cloud-based data management, challenges like this underscore the importance of having a strategic partner in data and integration. With the proper guidance, you can focus more on leveraging your data for business insights and less on troubleshooting technical issues.

Additional Resources

Chardet 5.2.0: https://pypi.org/project/chardet/

Chardet documentation: https://chardet.readthedocs.io/en/latest/

Connor Kerling

Follow on Linkedin

Technical

Enhancing Operational Visibility: Leveraging Azure Workbooks - Part 2

Part 2 of our series dives deeper into using Log Analytics Workbooks to visualise and analyse your tracking data. Learn how to filter, sort, and even export your findings to Excel.

Technical

Unlock Documents with Azure OpenAI - Part 1

Discover how Azure OpenAI is revolutionising the way businesses manage and interact with policy documents. This piece explores the innovative solutions in legal document management.

Thought Leadership

The Power of Integration in Today’s Tech-Driven World

Today's digital age demands more than just having various systems; it calls for the seamless integration of these technologies to ensure streamlined operations. Discover how Boomi Atomsphere can help you stay ahead in today's business environment.

From Frustration to Fluency: Demystifying Python Encoding

Introduction

Section 1: The Problem

Section 2: The Code

Step 1: Setting up our Storage Blob variables

Step 2: Setting up Container Client

Step 3: Setting up a temporary file to write blob data to

Step 4: Reading the temporary file to detect and handle encoding issues

Section 3: Conclusion

Additional Resources

Connor Kerling

Ready to elevate your data transit security and enjoy peace of mind?

Related Articles

Talk to the Team +64 (0)9 2806675

Empower your team with our integration solutions

Adaptiv Integration

Data & Analytics

Trusted by some of New Zealand’s biggest companies

Sectors

We use leading technologies to minimise risk

See how our customers turn insights into action

Featured Case Studies

Fulton Hogan builds solid integrations with Adaptiv

Placemakers transforms their future with Adaptiv

The University Of Waikato Leverages Azure Integration Services with Adaptiv

From Frustration to Fluency: Demystifying Python Encoding

Introduction

Section 1: The Problem

Section 2: The Code

Step 1: Setting up our Storage Blob variables

Step 2: Setting up Container Client

Step 3: Setting up a temporary file to write blob data to

Step 4: Reading the temporary file to detect and handle encoding issues

Section 3: Conclusion

Additional Resources

Connor Kerling

Ready to elevate your data transit security and enjoy peace of mind?

Related Articles

Talk to the Team +64 (0)9 2806675

Trusted by some of  New Zealand’s biggest companies