Preparing Data for AI
Why Preprocess Data?
Think of preprocessing like preparing ingredients for cooking:- Clean the ingredients (cleaning)
- Cut them uniformly (normalization)
- Portion them correctly (tokenization)
Step 1: Cleaning Data
Common Data Sources
We handle different types of content:- Websites (HTML, CSS, JavaScript)
- Documents (PDFs, Excel, Word)
- Knowledge bases (Confluence, SharePoint)
- Databases (SQL, MongoDB)
What We Remove
Common elements to clean:- Navigation and ads from websites
- Formatting from documents
- Version history from knowledge bases
- System fields from databases
- Duplicate content
- Empty values
Example: Like keeping the recipe text but removing ads and comments from a cooking website.
Step 2: Normalizing Data
Text Standardization
We make text consistent by:- Converting to same case (upper/lower)
- Standardizing spaces and punctuation
- Handling special characters
- Converting formats
Language Handling
We manage:- Multiple languages
- Special characters
- Different alphabets
- Regional variations
Example: Making sure "café", "cafe", and "CAFE" are treated the same way.
Format Standardization
We normalize:- Dates (01/01/2024 → 2024-01-01)
- Numbers (1,000.00 → 1000.00)
- Units (km → miles)
- Abbreviations (Dr. → Doctor)
Step 3: Tokenization
Breaking Down Text
We split text into pieces (tokens) that AI can understand:- Words ("hello world" → ["hello", "world"])
- Subwords ("playing" → ["play", "ing"])
- Characters ("hello" → ["h", "e", "l", "l", "o"])
Special Cases
We handle:- Technical content and code
- Multiple languages
- Emojis and symbols
- Social media content
Example: Breaking "AI-powered" into ["AI", "-", "powered"] or ["AI", "power", "ed"]
Quality Control
Checking Results
We verify:- Text makes sense
- Format is consistent
- Important info is kept
- No errors introduced
Common Challenges
Key issues include:- Mixed language content
- Technical terminology
- Special characters
- Long documents
Next Steps
After preprocessing, data goes through:Want to learn more?