CSV Diff Tool

Out of the Box Thinking to Solve Issues In the Box

Project information

  • Category: Data Manipulation
  • Company: Confidential
  • Project date: Confidential
  • Technologies: Python (Pandas, Numpy libraries)

Summary

My earliest project showcasing out-of-the-box thinking. The project was a CSV diff tool that compared two CSV files and outputted the differences in a way that the in-house technology could read where the action needed to be taken. The project was a success, saving 40+ hours of work for my team and was a stepping stone for my future career.

Rationale

During my career in advertisement, I addressed a significant inefficiency in handling .csv datasets from a major client. These datasets detailed ad pricing configurations across a range of applications and ad space types (interstitial, banner, rewarded, native etc.). The configurations were further segmented by attributes such as device types (e.g., smartphones vs. tablets), operating systems (e.g., iOS vs. Android), geographic locations, and audience demographics. These granular segmentations allowed the client to target ads with precision, but they also resulted in .csv files with tens of thousands of rows—sometimes exceeding 100,000 rows per update.

My analysis revealed that despite the size of these updates, the actual changes were relatively small. For example, imagine that only 20,000 rows contained modified pricing or targeting criteria, 7,000 rows were removed, and 5,000 rows represented new entries. However, the existing workflow redundantly reprocessed the entire dataset, causing unnecessary strain on the system and significantly delaying updates for both our team and the client.

To resolve this, I developed a Python-based solution to identify changes between the incoming .csv files and the existing data in our platform. By comparing attributes such as application ID, ad space type, pricing model, device type, and audience segmentation, the script isolated rows that were modified, removed, or newly added. It then generated a new .csv containing only these updates, eliminating the need to process unchanged data.

This optimization reduced processing times by over 75%, allowing for near-instant updates to live campaigns. By streamlining the workflow, the solution freed up system resources for critical tasks while enhancing the client’s ability to adapt to market changes. My efforts not only improved operational efficiency but also strengthened the platform’s scalability. The impact of this solution was recognized by my team manager, who highlighted its value as being equivalent to the introduction of an additional full-time team member.