This is the 4th post in a series where I am talking about Python PEPs. If you are a programmer handling large amounts of data, then you would be using a schema to organise it. One good way to do it in Python is to use a class for storing the data attributes. But, if you have a large number of attributes, you will end up writing a lot of boilerplate code to store the attributes in classes and perform simple operations on them. Data classes are one way to reduce boilerplate while organising data. They also ensure data safety in a very lose-manner and you get some free goodies from Python. Let us see how.

Why use Data Classes

Let us say you have a set of 1000 users in a database with their usernames and email ids. You retrieve them and you want to organise them for your use-case. You can write a regular class.

class User:
	def __init__(username: str, email: str):
		self.username = username
		self.email = email 

Now if you want to display information about the user, you should add the __repr__ dunder (special function that is surrounded by underscores) method to your class.

class User:
	def __init__(username: str, email: str):
		self.username = username 
		self.email = email 

	def __repr__():
		return f"Username: {self.username} email: {self.email}"

As the number of attributes increases, you have to write more boilerplate code to store the attributes and perform simple operations like printing.Python 3.7 has a cool new feature, that allows you to organise the attributes in a class. It automatically writes utility methods to operate on those attributes. You provide the configuration and Python writes code for you (You heard it right! Code without coding. Free code!) Everything else remains the same – all the functionality that is available for a normal class is also available for the data class.

Python 3.7 has a cool new feature, that allows you to organise the attributes in a class. It automatically writes utility methods to operate on those attributes.

from dataclasses import dataclass 

@dataclass
class User:
	username: str
	email: str 

>> user1 = User('johndoe', 'johndoe@example.com')

Add the dataclass class decorator to a class ( If you do not know what decorators are, it is okay. You can think of it as something with a funny little @ character and some name, that does some magic and provides your class all the methods it needs ). Declare your attributes and their types (If you do not know yet, you can add types to variables in Python. This helps make the code easier to read and also help in validating your types – a topic for another day). This code is much simpler compared to the normal way of declaring classes. What is cooler? The __init__ and __repr__ method is automatically added to your class.

What Else Can You Do ??

Initialise the Attributes

If you want to provide some sensible data values to the attributes, then you can add this to the data class.

from dataclasses import dataclass 

@dataclass 
class User:
	username: str = "johndoe"
	email: str = "johndoe@example.com"

>> user1 = User()
>> user1.username
$johndoe

>> user1.email
$johndoe@example.com

You can initialise some of the attributes and leave the rest to be initialised later. However, the non-default attributes have to precede the default ones – like anywhere else in Python.


from dataclasses import dataclass 

@dataclass 
class User:
  email: str
  username: str = "johndoe"	

>> user1 = User(email="johndoe@example.com")
>> user1.username
$johndoe

>> user1.email
$johndoe@example.com

Freeze the Data Class

Sometimes, you do not want to let anyone change the attributes of the class once it has been instantiated. This is to say that you want to define a read-only structure. This assures that you do not accidentally change the attributes. You can pass frozen=True inside the decorator. If you try to change the attributes, then python raises a FrozenInstanceError.

from dataclasses import dataclass 

@dataclass(frozen=True)
class User:
	username: str
	email: str 
    
>> user1 = User("johndoe", "johndoe@example.com")
>> user1.username = "John Mary Doe"
$ ..
$ FrozenInstanceError: cannot assign to field 'username'

What happens when the field inside the data class is mutable. Say it is a list. If I change the list, would it change the attribute of the data class? Unfortunately, it does. This is one of those situations where python does not provide you with a safety net and you get doomed if you are not careful. Oops!

from dataclasses import dataclass 

@dataclass(frozen=True)
class User: 
  	username: str 
    email: str
    friends: List[str]
      
>> user1 = User("johndoe", "johndoe@example.com", ["maryjane"])
>> user1_friends = user.friends
>> user1_friends.append(["chloedan"])
>> user1_friends 
$ ["maryjane", "chloedan"]  # Oh no!

Advanced

Here are some more advanced features of data classes. You can safely skip if you are someone new to all the chaos.

More Control Over Your Fields

You can have more control over your fields by using the fields function from dataclasses. Say you do not want to pass a field while creating an object of the data class. Then you can specify it using the init attributes of the fields function.

from dataclasses import dataclass, field 

@dataclass
class User:
  first_name: str
  last_name: str
  full_name: str = field(init=False)  # do not require this field while instantiating the User class
    
>> user_1 = User("John", "Doe")

# full_name is uninitialized
>> user.full_name = "John Doe"

Post init

If certain attributes depend on other attributes for their instantiation, they can be calculated in a special method called __post_init__. The __post_init__ method will be called after the __init__ is called.

from dataclasses import dataclass 

@dataclass
class User:
  first_name: str
  last_name: str
  full_name: str = field(init=False)  # do not require this field while instantiating the User class
  
  def __post_init__(self):
		self.full_name = self.first_name + " " + self.last_name
  
>> user1 = User("John", "Doe")
>> user1.full_name
$ John Doe

Final note

There are some other functionalities like inheriting data classes and safely comparing objects of the same data class, which we did not discuss here in the interest of brevity.

If you need more functionality on your data like validating the different types and the format, there is a python package for it called Pydantic. If you are reading a lot of configuration information, and the data you read needs validation use this library. Pydantic can add validation to your data classes. If you are familiar with some other languages, there are struct to create a data type and provides similar functionalities.

Python is being loaded with more useful features. I know that this has been a lengthy post. But dataclasses are one of the exciting features in python especially useful in a data processing environment! You can read more about this PEP here.