top of page
Search

Python vs SAP ABAP in ETL: Which is Superior?

Writer's picture: Sergei PeleshukSergei Peleshuk

Introduction

Having spent over twenty years in SAP projects, my expertise has been deeply rooted in using ABAP for everything from ETL (Extract, Transform, Load) to various coding tasks. This extensive experience has given me a thorough understanding of ABAP's strengths and limitations. When I started working with non-SAP data platforms I became curious about Python's capabilities in data transformations to see if it offers any significant benefits over ABAP.


In recent years Python, known for its simplicity and flexibility, has become a popular choice for ETL operations. Its ease of use and extensive library ecosystem, including powerful tools like Pandas for data manipulation and Apache Airflow for workflow automation, make it a go-to choice for many data professionals. Python's ability to handle large datasets and integrate with other technologies like Apache Spark further enhances its appeal in the ETL domain.


My initial dive into Python showed me that it's much more flexible and easier to use compared to ABAP. Simple things like adding new columns to data tables, changing the type of data in a column, or combining data from different sources are much simpler in Python. The large number of libraries and functions available for almost any task you can think of is another major advantage. What's more, in today’s world, developers don’t need to remember all these details. Tools like ChatGPT can quickly help find the right function or code example, saving a lot of time and effort.


Python is not just good for traditional structured data. It's also great for working with unstructured and semi-structured data like files, documents, and data from web services (RESTful APIs). Plus, Python's capabilities in artificial intelligence (AI) and machine learning (ML) mean it can handle tasks that go way beyond basic ETL.


Starting this journey with Python has been an eye-opener, showing how it could potentially offer more efficient and diverse ways to handle data transformations compared to ABAP.


Example 1. Basic ETL Tasks

With its rich library ecosystem, Python is often used for ETL processes. Here I give a few basic examples of Python code snippets leveraging Panda library.


import pandas as pd

# Removing duplicates

df = df.drop_duplicates()


# Handling missing data

df = df.fillna(method='ffill') # Forward fill


# Adding a new column

df['new_column'] = df['existing_column1'] + df['existing_column2']


# Converting data types

df['column_to_convert'] = df['column_to_convert'].astype('float')


# Drop invalid lines with Null coordinates

to_drop = df[pd.isnull(df['longitude']) | pd.isnull(df['latitude'])].index

df.drop(to_drop, inplace=True)


Example 2. Generate Dates Table Used in Financial Reports and Visualizations

It is often required to have a Dates table for complex Year to Date (YTD), Month to Date (MTD), Prior Year (PY), etc. calculations and comparisons. Here is a sample Python code that allows you to quickly generate such a table:


import pandas as pd

from datetime import datetime

# Define the start and end year for your date range

start_year = 2010

end_year = 2025


# Generate the date range

date_range = pd.date_range(start=f'{start_year}-01-01', end=f'{end_year}-12-31', freq='D')


# Create the DataFrame

date_table = pd.DataFrame(date_range, columns=['Date'])


# Extract year, quarter, month, and day

date_table['Year'] = date_table['Date'].dt.year

date_table['Quarter'] = date_table['Date'].dt.quarter

date_table['Month'] = date_table['Date'].dt.month

# Combine Fiscal Year and Month to get Fiscal Period

date_table['FiscalPeriod'] = date_table['Year'].astype(str) + '-' + date_table['Month'].apply(lambda x: '{:02d}'.format(x))


date_table['Day'] = date_table['Date'].dt.day

date_table['DayOfWeek'] = date_table['Date'].dt.dayofweek # Monday=0, Sunday=6

date_table['DayName'] = date_table['Date'].dt.day_name() # Name of the day

date_table['MonthName'] = date_table['Date'].dt.month_name() # Name of the month

date_table['Period'] = date_table['MonthName'].str[0:3]+'-'+date_table['Year'].astype(str)

display(date_table)


Here is what the result table looks like:

Dates Table

This table is used in all date-related calculations, Power BI DAX formulas, and visualizations. And of course, this simple Python code can be easily enhanced with any other date-related attributes and texts used in the organization.


Example 3. Joining two tables (DataFrames)

Usually, table Joins is an expensive operation that may take substantial performance capacity on large datasets. Therefore multi-layered (medallion) data warehouse architecture may be essential for achieving good performance on user queries and dashboards.


To join two DataFrames in Python with Pandas, you can use the merge function. This function is highly flexible and allows for various types of joins (similar to SQL joins), such as inner, outer, left, and right joins. Here is a basic example to demonstrate how to use it:


import pandas as pd

# Assuming you have two DataFrames: df1 and df2

# Let's say both DataFrames have a common column 'key' on which you want to join


# Example DataFrames

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],

'value': [1, 2, 3, 5]})

df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],

'value': [5, 6, 7, 8]})

# Performing an inner join on 'key'

merged_df = pd.merge(df1, df2, on='key', how='inner')


# 'how' can be 'left', 'right', 'outer', 'inner'

print(merged_df)


Here is the output of the Join operation:





In this example, merged_df contains the joined data based on the common 'key' column from both df1 and df2 (inner join).


Possible Analytics Architecture with SAP S/4HANA or SAP ECC

When designing a data analytics architecture involving SAP ERP systems like SAP S/4HANA or SAP ECC, along with non-SAP data sources, including unstructured and real-time data, a robust and flexible approach is needed. Below I outline a possible modern analytics architecture that can be suitable for multiple types of data sources.

Modern Data Architecture

Integrating Microsoft Fabric into your analytics architecture offers a unified data analytics platform for consolidating various data sources, including SAP ERP, non-SAP databases, and unstructured data. This platform enhances data processing and analytics capabilities, leveraging built-in Python support and PySpark for advanced, large-scale data operations in a distributed environment. The cloud-based nature of Microsoft Fabric (SaaS) ensures scalability, flexibility, and efficient resource management, making it a robust solution for modern enterprise data analytics.


Final Words

As the business world increasingly leans towards data-driven decision-making, the selection of the right data architecture, platforms, and ETL processes becomes crucial. The ETL capabilities of Python, highlighted in this discussion, showcase its simplicity, power, and ability to effectively manage both SAP and non-SAP datasets. This highlights Python's adaptability and how well it complements the established strengths of ABAP, seamlessly integrating into various data architectures to bolster data processing and business intelligence.



79 views0 comments

Комментарии


bottom of page