20 min readfrom Dataquest

Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen

Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen

Real-world data is messy. Anyone who has worked with scraped web data has seen this pattern: prices saved as text, impossible year values, columns with no variation, and numeric fields that repeat in suspiciously limited ways. Cleaning that data is where most of the real analytical work actually happens.

In this project, we'll work through a realistic data cleaning and analysis workflow using a dataset of 50,000 used car listings scraped from eBay Kleinanzeigen, the classifieds section of the German eBay site. We’ll step into the role of a data analyst helping a used car classifieds service understand how brand and mileage affect car prices.

What You'll Learn

By the end of this tutorial, you'll know how to:

  • Load data with encoding issues and understand why they occur
  • Clean column names from camelCase to snake_case
  • Identify and drop useless or redundant columns
  • Strip formatting characters from numeric strings and convert data types
  • Detect and remove price outliers using judgment-based thresholds
  • Filter invalid registration years
  • Group and aggregate data by brand to uncover pricing trends

Before You Start

You'll need Python 3.8+, Jupyter Notebook, pandas, and NumPy. You can install them with pip install pandas numpy if needed.

Familiarity with Python lists, dictionaries, loops, and basic pandas operations will help you follow along. If you want to review first, the Introduction to Pandas and NumPy for Data Analysis course covers the core concepts. Access the full project in the Dataquest app and the solution notebook on GitHub.

The Dataset

The data was originally scraped from eBay Kleinanzeigen and uploaded to Kaggle. We're working with a 50,000-row sample prepared by Dataquest that simulates a less-cleaned version of the original. Each row represents a single used car listing, with columns covering the car's name, price, brand, model, mileage, registration year, fuel type, and various metadata about the listing itself.

Our goal is to clean this data and analyze pricing trends by brand.

Step 1: Loading the Data

import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
autos.head()

If you try loading the file without specifying an encoding, you'll get a UnicodeDecodeError. This happens because the dataset contains non-English characters (German text and special characters) that the default UTF-8 encoding doesn't know how to handle. When you see this error, encoding='Latin-1' is often the fix. Latin-1 covers a broader range of Western European characters and resolves the issue about 90% of the time.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
...
 4   price                50000 non-null  object
...
11   odometer             50000 non-null  object
...

A few things stand out from info(). Most columns are object type (strings in pandas). Notably, price and odometer are both objects when they should be numeric — glancing at the actual data values immediately reveals why.

price: $5,000    odometer: 150,000km

Both columns contain non-numeric characters: dollar signs and commas in price, and "km" plus commas in odometer. Those need to be stripped before we can do any math with these columns.

Learning Insight: The info() and head() methods together are your first line of defense in any new dataset. info() tells you data types and null counts. head() shows you the actual values so you can see why those types are what they are. Together they surface many of the cleaning tasks you'll need to do.

Step 2: Renaming Columns

The original column names use camelCase (dateCrawled, vehicleType, yearOfRegistration). Python convention is snake_case, so we'll rename them all at once.

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']

This makes the columns easier to work with and aligns with pandas best practices. Note that we've also taken the opportunity to rename a few columns more descriptively: yearOfRegistration becomes registration_year, notRepairedDamage becomes unrepaired_damage, and nrOfPictures becomes num_photos.

Step 3: Dropping Useless Columns

The describe(include='all') method gives us summary statistics for every column, including string columns.

autos.describe(include='all')

date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_photos postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-10 15:36:24 Ford_Fiesta privat Angebot \$0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

A few things immediately stand out. The seller column has only two unique values, and 49,999 out of 50,000 are the same one. offer_type is identical for nearly every row. These columns carry essentially no information.

The num_photos column is even more clear-cut:

autos["num_photos"].value_counts()
0    50000
Name: num_photos, dtype: int64

Every single row is zero. The column is useless. We drop all three.

autos = autos.drop(["num_photos", "seller", "offer_type"], axis=1)

Step 4: Cleaning the Price and Odometer Columns

Now for the core cleaning work. Both price and odometer are stored as strings with formatting characters that prevent numeric conversion.

autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int)
                  )
autos["price"].head()
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

This is method chaining: three operations applied one after another in a single readable block. We strip the dollar sign, strip the comma (used as a thousands separator in German notation, where commas can also denote decimal points), and convert to integer. The same approach works for odometer.

autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                     )
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

We also rename the column to odometer_km so it's clear the unit is kilometers, since we're removing "km" from the values themselves.

Step 5: Exploring and Cleaning Price

Before filtering, let's understand what we're working with.

print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts().sort_index(ascending=False).head(20)
(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
min      0.000000e+00
max      1.000000e+08

The minimum price is \$0 and the maximum is \$100,000,000. Both are suspicious. Looking at the actual high-end values:

99999999    1
27322222    1
12345678    3

These prices are well beyond the price range of a typical or high-end vehicle. Looking at the low end:

0     1421
1      156

There are 1,421 cars listed at \$0. While a small fraction of 50,000 rows, \$0 doesn't represent a real sale price for our analysis. We'll remove zeros and anything above \$350,000, which is where prices jump from plausible high-end values to a small set of outliers.

autos = autos[autos["price"].between(1, 351000)]
autos["price"].describe()
count     48565.000000
mean       5888.935591
min           1.000000
max      350000.000000

We retain 48,565 rows — a small reduction that removes noise without discarding meaningful data.

Learning Insight: Deciding where to draw the outlier threshold is one of the judgment calls in data cleaning. There's often no single correct answer. We chose \$350,000 because that's where prices in the data jump from gradual increases to extreme values. Document your reasoning wherever you make these calls, especially if you're sharing your analysis with a stakeholder.

Step 6: Exploring Odometer Values

autos["odometer_km"].value_counts()
150000    32424
125000     5170
100000     2169
...
Name: odometer_km, dtype: int64

Interestingly, there are only 13 distinct values, despite this being a continuous numeric variable. This happens because eBay's listing interface likely had a dropdown selector rather than a free text field, so sellers chose from pre-set mileage bands. That's not an error — it just means we have less granularity than we might expect. The 13 distinct values are clean and interpretable.

Step 7: Cleaning Registration Year

autos["registration_year"].describe()
min       1000.000000
max       9999.000000

Year 1000 predates the automobile by about 900 years. Year 9999 is 8,000 years into the future. We need to filter to a realistic range.

(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0]
0.038793

Only about 4% of rows fall outside our valid range of 1900 to 2016 (the dataset's collection year). That's small enough to remove without concern.

autos = autos[autos["registration_year"].between(1900, 2016)]
autos["registration_year"].value_counts(normalize=True).head(10)
2000    0.067608
2005    0.062895
1999    0.062060

The distribution looks exactly as expected for a used car market: most vehicles were registered in the past 20 years.

Step 8: Analyzing Price by Brand

. With the data cleaned, we can now compare the most common car brands in the listings and see how their average prices differ.

autos["brand"].value_counts(normalize=True)
volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
...

Volkswagen dominates with about 21% of listings. German manufacturers (Volkswagen, BMW, Opel, Mercedes-Benz, Audi) account for roughly half of all listings combined.

Many brands in the tail have less than 1% representation. For a meaningful price analysis, we'll focus on brands with more than 5% of total listings.

brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Now we calculate mean price for each of these brands.

brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices
{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749}

A clear pattern emerges: Audi, BMW, and Mercedes-Benz cluster in the \$8,000–\$9,300 range. Ford and Opel sit in the \$3,000–\$3,700 range. Volkswagen lands in between at \$5,400.

Step 9: Comparing Mileage Across Brands

Does a higher average price correspond to higher mileage? Let's find out.

brand_mean_mileage = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)

mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)

brand_info = pd.DataFrame(mean_mileage, columns=['mean_mileage'])
brand_info["mean_price"] = mean_prices
brand_info
               mean_mileage  mean_price
bmw               132572        8332
mercedes_benz     130788        8628
opel              129310        2975
audi              129157        9336
volkswagen        128707        5402
ford              124266        3749

Mileage across brands falls within a fairly narrow range (roughly 124,000 to 133,000 km), while prices vary by more than 3x between the cheapest (Opel at ~\$3,000) and the most expensive (Audi at ~\$9,300). BMW and Mercedes-Benz have the highest mileage and also command high prices, suggesting buyers pay a premium for these brands regardless of wear.

Learning Insight: When comparing multiple metrics across groups, putting them side by side in a single DataFrame makes patterns much easier to spot than reading two separate outputs. Here, the combination reveals that mileage isn't the primary driver of the price gap between German luxury brands and budget options.

Summary of Findings

After cleaning the dataset from 50,000 rows down to approximately 46,900 valid listings, the key takeaways are:

The used car market on eBay Kleinanzeigen is dominated by German brands, with Volkswagen alone accounting for 21% of listings. Among common brands, there's a distinct price gap: Audi, BMW, and Mercedes-Benz average \$8,000–\$9,300 per vehicle, while Ford and Opel average \$3,000–\$3,700. Volkswagen sits in between, which may help explain its popularity as a middle-market option.

Average mileage doesn't vary much across brands (all within 10% of each other), which means mileage alone doesn't explain the price differences. Brand prestige and the typical market segment these manufacturers occupy are stronger drivers.

Next Steps

There's a lot more this dataset can support:

Analyze model-level pricing. We looked at brand, but the model column lets you go deeper. Does the price vary significantly between a BMW 3-series and a BMW 7-series? Are certain Volkswagen models clustered at higher or lower price points?

Segment by mileage bands. The odometer already gives us 13 clean buckets. Do average prices drop consistently as mileage increases, or is the relationship brand-dependent?

Explore unrepaired damage. The unrepaired_damage column tells us whether a car has damage that hasn't been fixed. Are damaged cars priced significantly lower? What fraction of each brand's listings have damage?

Clean and use the date columns. We skipped date_crawled, ad_created, and last_seen for this walkthrough. Analyzing how long listings stayed active before being removed could reveal which brands or price ranges sell fastest.

Resources

Share your extended analysis in the community and tag @Anna_strahl. There are several interesting directions to take this project, and seeing what patterns others find in the data is always worthwhile.

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#no-code spreadsheet solutions
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#data cleaning solutions
#natural language processing
#conversational data analysis
#data analysis tools
#real-time data collaboration
#big data management in spreadsheets
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#financial modeling with spreadsheets
#rows.com
#large dataset processing
#real-time collaboration
#row zero
#machine learning in spreadsheet applications