TSAuditor: A time-series auditing framework [P]
This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported 3% missing data rate for volume columns.
I didn't think much about it because I thought it was noise, as this was my first time working with time-series data, but the downstream models weren't acting right. That's when I thought something was off, and I actually looked at the data and found the 3% missing data was not noise; in fact, it was a 6-day worth of missing data. It didn't stop here, though, as the data also had leakage, and the model hit 99% accuracy. The rolling windows and lag features were also messed up, as the chronological sequence was broken.
Looking back, if I had done proper EDA, this would not have happened. But I decided to make a small validation tool called tsauditor that catches chronological breaks, leakage, and sudden sequential spikes present in global boundaries. It also adds a description along with evidence on why the data point is faulty and suggests fixes
It's open source, lightweight, and on PyPI. I also added an example notebook, which has a side-by-side comparison of tsauditor with a standard profiling tool. You can also check out the comparison notebook.
I wanted to simplify the EDA process and reduce the number of custom scripts for a dataset.
Edit: It can be used without defining a domain.
Link in comments
[link] [comments]
Want to read more?
Check out the full article on the original site