Handling Mixed Data Types¶
You can train a Text Classifier ML model and use the Classify endpoint to predict labels for long-form text content.
Similarly, you can train a Predictor ML model and use the Predict endpoint to predict outcomes from structured data (that does NOT include long-form text).
If your dataset contains both long-form text and structured data, you can use Text Classifier and Predictor models together to make predictions.
Training¶
-
First, train a Text Classifier ML model using the long-form text content.
- You will only use
text
andlabel
columns in the CSV file for training data.
- You will only use
-
Once the Text Classifier model is trained, use it to predict labels and confidence scores for the long-form text content.
- Replace the
text
column in the CSV file with one or more columns for the predicted labels/confidence scores.
- Replace the
-
Next, train a Predictor ML model using the CSV file from the previous step.
- This model will learn patterns and importance of both text and structured data in predicting the
label
column.
- This model will learn patterns and importance of both text and structured data in predicting the
Pseudo-code for Training¶
# Original spam.csv:
# label, text, sender_domain, time_sent_hour, time_sent_min, links_count, has_attachment, sender_history_score
# spam, "Buy cheap watches...", sketchy.com, 23, 45, 5, True, 0.2
# not_spam, "Meeting at 3pm today...", company.com, 9, 15, 0, False, 0.95
# Step 1: Train Text Classifier ML model on email text content
text_df = df[['label', 'text']]
# Step 2: Get text classification confidence scores from Classify API endpoint
text_predictions = classify_api.classify(df['text'].tolist())
confidence_scores = [pred['confidence_scores'] for pred in text_predictions]
# Create Predictor ML model training data with text scores + metadata
predictor_df = df.drop('text', axis=1) # Remove raw text
# Add confidence scores
predictor_df['text_spam_confidence'] = [score['spam'] for score in confidence_scores]
# Now predictor_df has the columns:
# label, sender_domain, time_sent_hour, time_sent_min, links_count, has_attachment, sender_history_score, text_spam_confidence
# spam, sketchy.com, 23, 45, 5, True, 0.2, 0.85
# not_spam, company.com, 9, 15, 0, False, 0.95, 0.1
# Step 3: Train Predictor ML model using CSV file represented by predictor_df
Making Predictions¶
-
First, use the Classify endpoint to get labels and confidence scores for the long-form text content.
-
Next, use the Predict endpoint with both:
- Structured data, and
- Labels/confidence scores for text
This will help you make the final prediction based on both text and structured data.
Pseudo-code for Making Predictions¶
# New email to classify:
email = {
"text": "CONGRATULATIONS! You've won a free iPhone! Click here to claim...",
"sender_domain": "win-prizes.net",
"time_sent_hour": 2,
"time_sent_min": 30,
"links_count": 3,
"has_attachment": False,
"sender_history_score": 0.1
}
# Step 1: Get text classification confidence scores from Classify API endpoint
text_prediction = classify_api.classify([email['text']])[0]
confidence_scores = text_prediction['confidence_scores']
# Create Predictor API input with text scores + metadata
predictor_input = {
"sender_domain": email["sender_domain"],
"time_sent_hour": email["time_sent_hour"],
"time_sent_min": email["time_sent_min"],
"links_count": email["links_count"],
"has_attachment": email["has_attachment"],
"sender_history_score": email["sender_history_score"],
"text_spam_confidence": confidence_scores["spam"]
}
# Step 2: Get final prediction from Predict API endpoint
final_prediction = predict_api.predict(predictor_input)