Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]
I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time.
When you've built something like this, what was the bottleneck:
- Getting enough real world data in the first place?
- Cleaning / labeling / organizing the data you have?
- Actually building and training the model?
- Getting it optimized and deployed on the device?
I am working on a project that aims to eliminate some of these pains and wanted to get some validation on this topic first before I go and add more features. It is essentially edge impulse, but hardware agnostic, gen ai native, and targeted for time series data. I am still trying to figure out what the best vertical would be as there are many to choose from. I'm weighing a few features and would love a gut check on which would actually save you time: 1) automatic data quality checks that flag bad/inconsistent data on upload before you train, 2) AI-assisted labeling for long/dynamic recordings, 3) enforcing data standards at collection, 4) reproducible/versioned pipelines.
Which would genuinely help, and which is "nice but I'd never pay for it"? Especially curious whether the expensive pain is catching basic data issues or the subtle ones you only notice after the model misbehaves
[link] [comments]
Want to read more?
Check out the full article on the original site