AutoML - Lookalike Audience

All you need to know about our AutoML - Lookalike Audience module


The AutoML - Lookalike Audience API enables you to expand your target audience by identifying similar users from a seed set. By leveraging Hyperplane's Knowledge Graph and Foundation Models, within minutes you will train your own fine-tuned models that capture behaviors about your customers and drive actionable insights.

Lookalike Modeling

While traditional model training relies on explicit labels, gathering and validating these labels is oftentimes time-consuming and labor-intensive. Our Lookalike API offers a simpler alternative: given a seed set of positive_label_users, you will train models that capture the behavior across these users and expand your audience to other users who display similar behaviors. Optionally, you can provide a negative_label_users seed set whose characteristics you wish to avoid.

This API provides the simplest interface for you to start driving model-based decisions while leveraging Hyperplane's Foundation Models. Here are a few guidelines to produce high-quality models:

  • If the seed set is too small, the model quality may be noisy and poor. Check the model evaluation reports (WIP) to verify the stability and quality of your Lookalike Model.
  • Your model's performance on downstream tasks will be highly dependent on the quality of the seed audience. Try to validate the quality of your audience and experiment with different variations of audience definitions.


We provide examples of areas in your business that may benefit from Lookalike models.

Digital Advertising and Email Marketing

Lookalike audiences can be used to reach potential customers who share characteristics with a seed customer. In this use case, we might have:

  • positive_label_users: Users who have purchased or converted on some product or advertisement
  • Optional negative_label_users: Users who did not covert or explicitly rejected some past marketing attempt.

Credit Limit Increases

Lookalike audiences can also be used in credit and default applications. For example, you may want to identify potential candidates for credit limit increases:

  • positive_label_users: Users who have recently requested and were approved credit limit increases.
  • Optional negative_label_users: Users who have recently defaulted.

API Conventions


For the header of your request, you'll need a few pieces of information:

  • authentication token
  • module ID

Below is a request example with what a header should look like:

curl -X 'GET'  
-H 'authorization: \<AUTHENTICATION_TOKEN>'  
-H 'module-id: \<MODULE_ID>'  
-H 'accept: application/json'

For a comprehensive code guide on this, check out our AutoML - Lookalike Audience Recipe.

Training Run Request

To initiate a new model training run, you will provide a list of users that define your seed audience.


  "run_description": "AutoML - Lookalike Audience Guide Test Audience",
  "engagement_type": "MULTI_ENGAGE",
  "positive_label_users": [
    {"user_id": "001", "timestamp": "2023-01-01 00:00:00"},
    {"user_id": "002", "timestamp": "2023-01-01 00:00:00"},
    {"user_id": "003", "timestamp": "2023-02-01 00:00:00"},
    {"user_id": "004", "timestamp": "2022-06-12 00:00:00"},
    {"user_id": "005"},
    {"user_id": "006"},
    {"user_id": "007"},
    {"user_id": "008"},
    {"user_id": "009"},
    {"user_id": "010"},
  "negative_label_users": [
    {"user_id": "011", "timestamp": "2023-04-11 00:00:00"},
    {"user_id": "012", "timestamp": "2023-04-11 00:00:00"},
    {"user_id": "013", "timestamp": "2022-12-25 00:00:00"},
    {"user_id": "014"},
    {"user_id": "015"},
    {"user_id": "016"},
    {"user_id": "017"},
    {"user_id": "018"},
    {"user_id": "019"},
    {"user_id": "020"},

Note that providing negative_label_users is optional.

Engagement type clarifies the targeting behavior of this product run. There are two valid values.
"MULTI_ENGAGE": This campaign can retarget users that have previously converted, and users with labels will be rescored and ranked along with the rest of the population.
"SINGLE_ENGAGE": This campaign only targets new users, so the scores returned will only include new users.

Each list of given users must have at least 10 users for the service to work. We highly recommend giving many more users though, in order to ensure model performance.

Providing timestamp for each user is also optional, but recommended to select the best snapshot of each user. If timestamp is provided, the most recent snapshot before the given date will be used (inclusive). If no timestamp is provided, the most recent snapshot is used.

Once the request is made, you will start asynchronously training a model and generating scores for your population. The model training typically completes within 15-20 minutes.

Getting Run Status

You can check in on the status of runs using the automl/runs endpoint to list all runs or get a specific run's status with the automl/runs/{run_id}/status endpoint.

You will find one of the following statuses for your run:

RUN_INITIALIZED: The run has started and the data is being prepared

TRAINING_MODEL: The model is being trained on your given users

COMPUTING_METRICS: The model has been fitted and metrics are being generated

FINISHED_TRAINING: The entire training process is complete. The rest of your users are being prepared to be scored

SCORING_INFERENCE_USERS: Scores are being generated for all users in your population

FINISHED: The entire run process has completed successfully. You can now get top user scores from this run

FAILED: Something has gone wrong in the run process. Please try again or contact us to address the issue

It might take a few minutes for a run to initialize, so if you don't immediately see a status, please check again later.

Note that you will only be able to get top users after scoring is complete and the status is FINISHED

Getting Run Metadata and Results

You can check in on the results and the metadata of runs using the automl/run/{run_id} endpoint for a specific run or the automl/runs endpoint to list all runs.


You can access the metadata and results for a specific run using the automl/run/{run_id} endpoint.

Example response:

  "run_id": "string",  
  "status": "string",  
  "number_of_train_users": 0,  
  "number_of_inference_users": 0,  
  "run_description": "Custom run description to attach to run",  
  "timestamp": 0,  
  "run_metrics": {  
    "train_auc": 0,  
    "train_ks": 0,  
    "test_auc": 0,  
    "test_ks": 0,  
    "cumulative_positive_rate": {  
      "ascending": {  
        "0.1": 0.25,  
        "0.2": 0.3,  
        "0.3": 0.35  
      "descending": {  
        "0.1": 0.4,  
        "0.2": 0.35,  
        "0.3": 0.3  
    "positive_rate_by_decile": {  
      "0": 0.25,  
      "1": 0.3,  
      "2": 0.32  
    "label_positive_rate": 0  
  "input_label_summary": {
    "num_input_labels": 100,
    "num_input_users": 70,
    "num_matched_labels": 85,
    "num_training_users": 60,
    "monthly_positive_rates": {
       "2023-07": 0.48,
       "2023-08": 0.53,
       "2023-09": 0.41,
       "2023-10": 0.62
    "monthly_counts": {
       "2023-07": 20,
       "2023-08": 40,
       "2023-09": 40,
       "2023-10": 20
    "label_weight_counts": {
      "(0.0, 1.0)": 40,
      "(0.0, 0.5)": 10,
      "(1.0, 1.0)": 50,

This schema provides detailed information about the run, including its unique identifier (run_id), current status (status), the number of users used for training and inference, a custom description for the run, and a timestamp indicating when the run was initiated.

Additionally, you can find various metrics related to the model's performance, such as AUC, KS, cumulative positive rates, positive rates by decile, and label positive rate. For a better understanding of each metric, please refer to Get the run summary for a specified AutoML run.

You may also see a breakdown of the input labels given for this run under input_label_summary. You will find analyses of the given labels that can help you determine the "health" of the training labels.

num_input_labels: The total number of labels (both positive and negative) given in the request

num_input_users: The number of distinct user_ids found in the given labels

num_matched_labels: The number of labels that were found to match the available data

num_training_users: The number of distinct user_ids that were matched

monthly_positive_rates: A breakdown of label value rates (# of positive labels / # of total labels) by month.

monthly_counts: A breakdown of label counts by month

label_weight_counts: A breakdown of counts for each label value, weight value pair. The dict is keyed by a stringified tuple of (label_value, weight_value)


To list all runs and their basic information, you can use the automl/runs endpoint.

Response example:

  "runs": [  
      "run_id": "string",  
      "status": "string",  
      "number_of_train_users": 0,  
      "number_of_inference_users": 0,  
      "run_description": "Custom run description to attach to run",  
      "timestamp": 0,  
      "run_metrics": {
        "train_auc": 0.8319017758421804,
        "train_ks": 0.49481177186093805,
        "test_auc": 0.7548071508721715,
        "test_ks": 0.3753896241517406,
        "cumulative_positive_rate": null,
        "positive_rate_by_decile": null,
        "label_positive_rate": 0.661318321743897
      "input_label_summary": null  

This schema returns a list of all runs, where each run object includes the run_id, current status, the number of users used for training and inference, the run description, and the timestamp of when the run was initiated. However, please note that only float metrics are returned. Furthermore, the input_label_summary is not returned in this endpoint. To obtain complex metrics and input label analyses, you must utilize the automl/runs/{run_id} endpoint instead.

Model Evaluation and Inference

Once training is complete, you can fetch a report of your model quality as well as inference scores on users outside of your selected seed set. The most recent snapshots of each user will be used to generate scores.

Scores are represented as a float between 0 and 1, where a higher score indicates greater similarity to the positive_label_users and a lower score indicates greater dissimilarity. If a negative_label_users set is defined, low scores suggest similarity to this set.


  "user_scores": [
      "user_id": "013",
      "score": 0.8407377318317583
      "user_id": "015",
      "score": 0.6388057911910163
      "user_id": "014",
      "score": 0.4083728583053339
      "user_id": "011",
      "score": 0.34759904111341
      "user_id": "012",
      "score": 0.154499591657662

Again, for a comprehensive code guide on this, check out our AutoML - Lookalike Audience Recipe.

What’s Next

check out our AutoML - Lookalike Audience Recipe