Introduction
Having spent over twenty years in SAP projects, my expertise has been deeply rooted in using ABAP for everything from ETL (Extract, Transform, Load) to various coding tasks. This extensive experience has given me a thorough understanding of ABAP's strengths and limitations. When I started working with non-SAP data platforms I became curious about Python's capabilities in data transformations to see if it offers any significant benefits over ABAP.
In recent years Python, known for its simplicity and flexibility, has become a popular choice for ETL operations. Its ease of use and extensive library ecosystem, including powerful tools like Pandas for data manipulation and Apache Airflow for workflow automation, make it a go-to choice for many data professionals. Python's ability to handle large datasets and integrate with other technologies like Apache Spark further enhances its appeal in the ETL domain.
My initial dive into Python showed me that it's much more flexible and easier to use compared to ABAP. Simple things like adding new columns to data tables, changing the type of data in a column, or combining data from different sources are much simpler in Python. The large number of libraries and functions available for almost any task you can think of is another major advantage. What's more, in today’s world, developers don’t need to remember all these details. Tools like ChatGPT can quickly help find the right function or code example, saving a lot of time and effort.
Python is not just good for traditional structured data. It's also great for working with unstructured and semi-structured data like files, documents, and data from web services (RESTful APIs). Plus, Python's capabilities in artificial intelligence (AI) and machine learning (ML) mean it can handle tasks that go way beyond basic ETL.
Starting this journey with Python has been an eye-opener, showing how it could potentially offer more efficient and diverse ways to handle data transformations compared to ABAP.
Example 1. Basic ETL Tasks
With its rich library ecosystem, Python is often used for ETL processes. Here I give a few basic examples of Python code snippets leveraging Panda library.
import pandas as pd
# Removing duplicates
df = df.drop_duplicates()
# Handling missing data
df = df.fillna(method='ffill') # Forward fill
# Adding a new column
df['new_column'] = df['existing_column1'] + df['existing_column2']
# Converting data types
df['column_to_convert'] = df['column_to_convert'].astype('float')
# Drop invalid lines with Null coordinates
to_drop = df[pd.isnull(df['longitude']) | pd.isnull(df['latitude'])].index
df.drop(to_drop, inplace=True)
Example 2. Generate Dates Table Used in Financial Reports and Visualizations
It is often required to have a Dates table for complex Year to Date (YTD), Month to Date (MTD), Prior Year (PY), etc. calculations and comparisons. Here is a sample Python code that allows you to quickly generate such a table:
import pandas as pd
from datetime import datetime
# Define the start and end year for your date range
start_year = 2010
end_year = 2025
# Generate the date range
date_range = pd.date_range(start=f'{start_year}-01-01', end=f'{end_year}-12-31', freq='D')
# Create the DataFrame
date_table = pd.DataFrame(date_range, columns=['Date'])
# Extract year, quarter, month, and day
date_table['Year'] = date_table['Date'].dt.year
date_table['Quarter'] = date_table['Date'].dt.quarter
date_table['Month'] = date_table['Date'].dt.month
# Combine Fiscal Year and Month to get Fiscal Period
date_table['FiscalPeriod'] = date_table['Year'].astype(str) + '-' + date_table['Month'].apply(lambda x: '{:02d}'.format(x))
date_table['Day'] = date_table['Date'].dt.day
date_table['DayOfWeek'] = date_table['Date'].dt.dayofweek # Monday=0, Sunday=6
date_table['DayName'] = date_table['Date'].dt.day_name() # Name of the day
date_table['MonthName'] = date_table['Date'].dt.month_name() # Name of the month
date_table['Period'] = date_table['MonthName'].str[0:3]+'-'+date_table['Year'].astype(str)
display(date_table)
Here is what the result table looks like:
This table is used in all date-related calculations, Power BI DAX formulas, and visualizations. And of course, this simple Python code can be easily enhanced with any other date-related attributes and texts used in the organization.
Example 3. Joining two tables (DataFrames)
Usually, table Joins is an expensive operation that may take substantial performance capacity on large datasets. Therefore multi-layered (medallion) data warehouse architecture may be essential for achieving good performance on user queries and dashboards.
To join two DataFrames in Python with Pandas, you can use the merge function. This function is highly flexible and allows for various types of joins (similar to SQL joins), such as inner, outer, left, and right joins. Here is a basic example to demonstrate how to use it:
import pandas as pd
# Assuming you have two DataFrames: df1 and df2
# Let's say both DataFrames have a common column 'key' on which you want to join
# Example DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
'value': [5, 6, 7, 8]})
# Performing an inner join on 'key'
merged_df = pd.merge(df1, df2, on='key', how='inner')
# 'how' can be 'left', 'right', 'outer', 'inner'
print(merged_df)
Here is the output of the Join operation:
In this example, merged_df contains the joined data based on the common 'key' column from both df1 and df2 (inner join).
Possible Analytics Architecture with SAP S/4HANA or SAP ECC
When designing a data analytics architecture involving SAP ERP systems like SAP S/4HANA or SAP ECC, along with non-SAP data sources, including unstructured and real-time data, a robust and flexible approach is needed. Below I outline a possible modern analytics architecture that can be suitable for multiple types of data sources.
Integrating Microsoft Fabric into your analytics architecture offers a unified data analytics platform for consolidating various data sources, including SAP ERP, non-SAP databases, and unstructured data. This platform enhances data processing and analytics capabilities, leveraging built-in Python support and PySpark for advanced, large-scale data operations in a distributed environment. The cloud-based nature of Microsoft Fabric (SaaS) ensures scalability, flexibility, and efficient resource management, making it a robust solution for modern enterprise data analytics.
Final Words
As the business world increasingly leans towards data-driven decision-making, the selection of the right data architecture, platforms, and ETL processes becomes crucial. The ETL capabilities of Python, highlighted in this discussion, showcase its simplicity, power, and ability to effectively manage both SAP and non-SAP datasets. This highlights Python's adaptability and how well it complements the established strengths of ABAP, seamlessly integrating into various data architectures to bolster data processing and business intelligence.
Комментарии