Handling Mixed Data Types¶

You can train a Text Classifier ML model and use the Classify endpoint to predict labels for long-form text content.

Similarly, you can train a Predictor ML model and use the Predict endpoint to predict outcomes from structured data (that does NOT include long-form text).

If your dataset contains both long-form text and structured data, you can use Text Classifier and Predictor models together to make predictions.

Training¶

First, train a Text Classifier ML model using the long-form text content.
- You will only use text and label columns in the CSV file for training data.
Once the Text Classifier model is trained, use it to predict labels and confidence scores for the long-form text content.
- Replace the text column in the CSV file with one or more columns for the predicted labels/confidence scores.
Next, train a Predictor ML model using the CSV file from the previous step.
- This model will learn patterns and importance of both text and structured data in predicting the label column.

Pseudo-code for Training¶

# Original spam.csv:
# label,    text,                      sender_domain, time_sent_hour, time_sent_min, links_count, has_attachment, sender_history_score
# spam,     "Buy cheap watches...",    sketchy.com,  23,            45,            5,           True,          0.2
# not_spam, "Meeting at 3pm today...", company.com,  9,            15,            0,           False,         0.95

# Step 1: Train Text Classifier ML model on email text content
text_df = df[['label', 'text']]

# Step 2: Get text classification confidence scores from Classify API endpoint
text_predictions = classify_api.classify(df['text'].tolist())
confidence_scores = [pred['confidence_scores'] for pred in text_predictions]

# Create Predictor ML model training data with text scores + metadata
predictor_df = df.drop('text', axis=1)  # Remove raw text
# Add confidence scores
predictor_df['text_spam_confidence'] = [score['spam'] for score in confidence_scores]

# Now predictor_df has the columns:
# label, sender_domain, time_sent_hour, time_sent_min, links_count, has_attachment, sender_history_score, text_spam_confidence
# spam,  sketchy.com,  23,            45,            5,           True,          0.2,                 0.85
# not_spam, company.com,  9,            15,            0,           False,         0.95,                0.1

# Step 3: Train Predictor ML model using CSV file represented by predictor_df

Making Predictions¶

First, use the Classify endpoint to get labels and confidence scores for the long-form text content.
Next, use the Predict endpoint with both:
- Structured data, and
- Labels/confidence scores for text

This will help you make the final prediction based on both text and structured data.

Pseudo-code for Making Predictions¶

# New email to classify:
email = {
    "text": "CONGRATULATIONS! You've won a free iPhone! Click here to claim...",
    "sender_domain": "win-prizes.net",
    "time_sent_hour": 2,
    "time_sent_min": 30,
    "links_count": 3,
    "has_attachment": False,
    "sender_history_score": 0.1
}

# Step 1: Get text classification confidence scores from Classify API endpoint
text_prediction = classify_api.classify([email['text']])[0]
confidence_scores = text_prediction['confidence_scores']

# Create Predictor API input with text scores + metadata
predictor_input = {
    "sender_domain": email["sender_domain"],
    "time_sent_hour": email["time_sent_hour"],
    "time_sent_min": email["time_sent_min"],
    "links_count": email["links_count"],
    "has_attachment": email["has_attachment"],
    "sender_history_score": email["sender_history_score"],
    "text_spam_confidence": confidence_scores["spam"]
}

# Step 2: Get final prediction from Predict API endpoint
final_prediction = predict_api.predict(predictor_input)