Training Top2vec with embedding_batch_size=256 crashed OS X 12.3.1
tensorflow_macos 2.8.0, tensorflow_metal 0.4.0 Anaconda Python 3.8.5
% pip show tensorflow_macos WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages) Name: tensorflow-macos Version: 2.8.0 Summary: TensorFlow is an open source machine learning framework for everyone. Home-page: https://www.tensorflow.org/ Author: Google Inc. Author-email: packages@tensorflow.org License: Apache 2.0 Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, keras-preprocessing, libclang, numpy, opt-einsum, protobuf, setuptools, six, tensorboard, termcolor, tf-estimator-nightly, typing-extensions, wrapt Required-by: (tensorflow-metal) (base) davidlaxer@x86_64-apple-darwin13 top2vec % pip show tensorflow_metal WARNING: Ignoring invalid distribution -umpy (/Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages) Name: tensorflow-metal Version: 0.4.0 Summary: TensorFlow acceleration for Mac GPUs. Home-page: https://developer.apple.com/metal/tensorflow-plugin/ Author: Author-email: License: MIT License. Copyright © 2020-2021 Apple Inc. All rights reserved. Location: /Users/davidlaxer/tensorflow-metal/lib/python3.8/site-packages Requires: six, wheel Required-by:
To train the model with embedding_model="universal-sentence-encoder", you'll have to download universal-sentence-encoder_4.
top2vec_trained = Top2Vec(documents=papers_filtered_df.text.tolist(), split_documents=True, **embedding_batch_size=256,** embedding_model="universal-sentence-encoder", use_embedding_model_tokenizer=True, embedding_model_path="/Users/davidlaxer/Downloads/universal-sentence-encoder_4", workers=8)
Here's the project:
https://github.com/ddangelov/Top2Vec
Here's the Jupyter notebook:
https://github.com/ddangelov/Top2Vec/blob/master/notebooks/CORD-19_top2vec.ipynb
You'll have to load the COVID-19 data set from Kaggle here:
https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge
I set filter size to 1,000:
def filter_short(papers_df): papers_df["token_counts"] = papers_df["text"].str.split().map(len) papers_df = **papers_df[papers_df.token_counts>1000].reset_index(drop=True)** papers_df.drop('token_counts', axis=1, inplace=True) return papers_df
Trace
panic(cpu 8 caller 0xffffff8020d449ad): userspace watchdog timeout: no successful checkins from WindowServer in 120 seconds service: logd, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0 seconds ago service: WindowServer, total successful checkins since wake (127621 seconds ago): 12751, last successful checkin: 120 seconds ago service: remoted, total successful checkins since wake (127621 seconds ago): 12763, last successful checkin: 0 [Trace](https://developer.apple.com/forums/content/attachment/d17c2c9b-569b-4c1a-9c61-892ced7f785b)