Using Chat GPT To Generate Datasets šŸ¤–

Using Chat GPT To Generate Datasets šŸ¤–

Generate data with realistic text content.

Ā·

4 min read

Recently I used Chat GPT to generate mock data in JSON format.

This is extremely powerful, as you can describe what real data looks like in your prompt, and Chat GPT will generate data with realistic content in fields like descriptions, titles, tags, etc.

This is a sample of the data Chat GPT generated:

  {
    "id": 3,
    "description": "Implementing OAuth2 for user authentication",
    "tags": [
      "@feature",
      "@security"
    ],
    "start": "2024-02-22T03:11:03",
    "end": "2024-02-22T06:36:18"
  }

Notice how the description, tags, and duration are all realistic.

Example

To build the documentation website for track-time-cli, I needed a database with mock data to have realistic command output in the documentation content.

To get Chat GPT to generate them for me, I used the following prompt:

Generate 5 time sheet entries with realistic software development tasks as descriptions. Use the following schema:

        {
          "id": 1,
          "description": "writing documentation",
          "tags": [
            "@feature",
            "@docs"
          ],
          "start": "2024-02-21T18:26:45",
          "end": "2024-02-21T19:33:08"
        }

The entries should be sequential in time, of varying durations (between 5 minutes and 5 hours), and have realistic descriptions that fit the software development time sheet theme.

The output should be in JSON, an array of time sheet entries conforming to the provided schema example.

Note that I specified the number of entries (5) and the type of activity (realistic software development tasks) that should be mentioned in the description.

I defined the output schema by providing an example of a correctly formatted time sheet entry. I also specified that they should be sequential in time and have realistic durations as would be expected for the defined type of activity.

Result

This was the output:

[
  {
    "id": 1,
    "description": "Refactoring user interface code",
    "tags": [
      "@feature",
      "@frontend"
    ],
    "start": "2024-02-21T18:26:45",
    "end": "2024-02-21T22:39:32"
  },
  {
    "id": 2,
    "description": "Optimizing database queries for report generation",
    "tags": [
      "@performance",
      "@database"
    ],
    "start": "2024-02-21T22:39:32",
    "end": "2024-02-22T03:11:03"
  },
  {
    "id": 3,
    "description": "Implementing OAuth2 for user authentication",
    "tags": [
      "@feature",
      "@security"
    ],
    "start": "2024-02-22T03:11:03",
    "end": "2024-02-22T06:36:18"
  },
  {
    "id": 4,
    "description": "Weekly team sync-up meeting",
    "tags": [
      "@meeting",
      "@communication"
    ],
    "start": "2024-02-22T06:36:18",
    "end": "2024-02-22T10:16:24"
  },
  {
    "id": 5,
    "description": "Adding localization support for new languages",
    "tags": [
      "@feature",
      "@i18n"
    ],
    "start": "2024-02-22T10:16:24",
    "end": "2024-02-22T11:50:10"
  }
]

After running this prompt a few times to generate 16 entries, I placed them all in an array within a mock JSON database. The output of running track-time-cli's list command was the following:

 23:50 ā–¶ 97% ā–¶ xia ā–¶ f3rno64 ā–¶ .track-time-cli ā–¶ ī‚  main ā–¶ āœŽ ā–¶ 2? ā–¶ $ ā–¶ tt l
* Showing sheets since 2/21/2024, 12:00:00 AM [yesterday]

- Sheet main (16 entries) [1:10:19:16]
    (1)   [9:28:26]  yesterday       ->  14 hours ago    Writing API documentation                           @feature @docs
    (2)   [16:51]    14 hours ago    ->  14 hours ago    Implementing user authentication module             @feature @backend
    (3)   [50:01]    14 hours ago    ->  13 hours ago    Designing database schema for new features          @design @database
    (4)   [2:04:11]  13 hours ago    ->  11 hours ago    Improving responsive design for mobile devices      @feature @frontend
    (5)   [2:29:48]  11 hours ago    ->  9 hours ago     Conducting user experience research                 @research @ux
    (6)   [1:42:30]  9 hours ago     ->  7 hours ago     Implementing caching for improved performance       @performance @backend
    (7)   [9:08]     7 hours ago     ->  7 hours ago     Weekly team retrospective meeting                   @meeting @team
    (8)   [39:52]    7 hours ago     ->  6 hours ago     Refining search algorithm                           @algorithm @backend
    (9)   [2:13:23]  6 hours ago     ->  4 hours ago     Reviewing and merging pull requests                 @review @version-control
    (10)  [1:08:12]  4 hours ago     ->  3 hours ago     Developing new feature for customer analytics       @feature @analytics
    (11)  [2:27:32]  3 hours ago     ->  21 minutes ago  Solving compatibility issues in cross-platform app  @bugfix @mobile
    (12)  [3:14:14]  21 minutes ago  ->  in 3 hours      Organizing backlog and prioritizing tasks           @planning @management
    (13)  [3:07:36]  in 3 hours      ->  in 6 hours      Upgrading server infrastructure                     @maintenance @devops
    (14)  [1:41:17]  in 6 hours      ->  in 8 hours      Creating automated tests for new API endpoints      @testing @automation
    (15)  [2:02:29]  in 8 hours      ->  in 10 hours     Troubleshooting server downtime issues              @maintenance @devops
    (16)  [43:46]    in 10 hours     ->  in 10 hours     Designing user interface for the upcoming module    @design @frontend

* 0 Sheets not shown. use --all to show

Conclusions

Clearly Chat GPT is a great tool for generating realistic datasets.

Try it out the next time you need some mock data with realistic textual content! šŸš€

Ā