2 min readfrom Machine Learning

TSAuditor: A time-series auditing framework [P]

This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported 3% missing data rate for volume columns.
I didn't think much about it because I thought it was noise, as this was my first time working with time-series data, but the downstream models weren't acting right. That's when I thought something was off, and I actually looked at the data and found the 3% missing data was not noise; in fact, it was a 6-day worth of missing data. It didn't stop here, though, as the data also had leakage, and the model hit 99% accuracy. The rolling windows and lag features were also messed up, as the chronological sequence was broken.

Looking back, if I had done proper EDA, this would not have happened. But I decided to make a small validation tool called tsauditor that catches chronological breaks, leakage, and sudden sequential spikes present in global boundaries. It also adds a description along with evidence on why the data point is faulty and suggests fixes

It's open source, lightweight, and on PyPI. I also added an example notebook, which has a side-by-side comparison of tsauditor with a standard profiling tool. You can also check out the comparison notebook.

I wanted to simplify the EDA process and reduce the number of custom scripts for a dataset.

Edit: It can be used without defining a domain.

Link in comments

submitted by /u/severecaseofsarcarsm
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#real-time data collaboration
#generative AI for data analysis
#Excel alternatives for data analysis
#conversational data analysis
#data analysis tools
#big data management in spreadsheets
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data cleaning solutions
#financial modeling with spreadsheets
#real-time collaboration
#large dataset processing
#natural language processing for spreadsheets
#rows.com
#Time-Series Data
#TSAuditor
#Data Auditing
#Chronological Breaks