Ron Weiss

I'm currently a research scientist at Facebook Reality Labs working on a neural interface. Prior to that I spent over a decade at Google, most recently on the Google Brain team, working on end-to-end models for speech recognition, translation, and synthesis, sound separation, and unsupervised representation learning. I also worked on recommendations for Google Play Music. Previously, I was a postdoc working on music information retrieval with Juan Bello at MARL at NYU. Earlier still, I was a graduate research assistant working with Dan Ellis in the Laboratory for the Recognition and Organization of Speech and Audio (LabROSA). I defended my dissertation in 2009 (watch me write it at about 50,000 * real-time here).

My research interests lie at the intersection of audio signal processing and machine learning. My dissertation research was devoted to model based source separation, but I also found time to do a bit of music signal analysis to create some wacky remixes on the side.

Invited Talks

Generating speech from speech: How end-to-end is too far?
Speech and Audio in the Northeast (SANE) Workshop, October 2019.
[ slides | video ]
Training neural network acoustic models on (multichannel) waveforms
Speech and Audio in the Northeast (SANE) Workshop, October 2015.
[ slides | video ]

Publications

[1]	S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan, and J. R. Hershey. Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2021. [ bib \| arxiv ]
[2]	N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan. WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. In Proc. Interspeech, August 2021. [ bib \| arxiv \| web ]
[3]	P. Wang, T. N. Sainath, and R. J. Weiss. Multitask Training with Text Data for End-to-End Speech Recognition. In Proc. Interspeech, August 2021. [ bib \| arxiv ]
[4]	R. J. Weiss, R. J. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. Kingma. Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), June 2021. [ bib \| DOI \| video \| arxiv \| web \| poster \| slides ]
[5]	I. Elias, H. Zen, J. Shen, Y. Zhang, Y. Jia, R. J. Weiss, and Y. Wu. Parallel Tacotron: Non-Autoregressive and Controllable TTS. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), June 2021. [ bib \| DOI \| arxiv \| web ]
[6]	N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. WaveGrad: Estimating Gradients for Waveform Generation. In Proc. International Conference on Learning Representations (ICLR), May 2021. [ bib \| reviews \| arxiv \| web ]
[7]	S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, and J. R. Hershey. Unsupervised Sound Separation Using Mixture Invariant Training. In Advances in Neural Information Processing Systems (NeurIPS), December 2020. [ bib \| reviews \| arxiv \| web ]
[8]	S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, and J. R. Hershey. Unsupervised Speech Separation Using Mixtures of Mixtures. In ICML 2020 Workshop on Self-supervision in Audio and Speech, July 2020. [ bib \| reviews \| web ]
[9]	G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg, B. Ramabhadran, and Y. Wu. Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6699--6703, May 2020. [ bib \| DOI \| arxiv \| web ]
[10]	G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6264--6268, May 2020. [ bib \| DOI \| arxiv \| web ]
[11]	T. N. Sainath, R. Pang, R. J. Weiss, Y. He, C.-C. Chiu, and T. Strohman. An Attention-Based Joint Acoustic and Text on-Device End-To-End Model. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 7039--7043, May 2020. [ bib \| DOI \| .pdf ]
[12]	J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord. Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2041--2053, December 2019. [ bib \| DOI \| arxiv ]
[13]	Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. In Proc. Interspeech, Graz, Austria, September 2019. [ bib \| DOI \| arxiv \| web ]
[14]	H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech, Graz, Austria, September 2019. [ bib \| DOI \| arxiv \| web ]
[15]	F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y. Jia. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation. In Proc. Interspeech, Graz, Austria, September 2019. [ bib \| DOI \| arxiv \| web ]
[16]	Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu. Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. In Proc. Interspeech, Graz, Austria, September 2019. [ bib \| DOI \| arxiv \| web ]
[17]	Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. Lopez-Moreno. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech, Graz, Austria, September 2019. [ bib \| DOI \| arxiv \| web ]
[18]	J. M. Antognini, M. Hoffman, and R. J. Weiss. Audio Texture Synthesis with Random Neural Networks: Improving Diversity and Quality. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 2019. [ bib \| DOI \| web \| poster \| .pdf ]
[19]	J. Guo, T. N. Sainath, and R. J. Weiss. A Spelling Correction Model for End-to-End Speech Recognition. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 2019. [ bib \| DOI \| arxiv \| slides ]
[20]	Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y. Wu. Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 2019. [ bib \| DOI \| arxiv \| slides ]
[21]	W. N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang. Hierarchical Generative Modeling for Controllable Speech Synthesis. In Proc. International Conference on Learning Representations (ICLR), New Orleans, USA, May 2019. [ bib \| reviews \| arxiv \| web ]
[22]	W. N. Hsu, Y. Zhang, R. J. Weiss, Y. A. Chung, Y. Wang, Y. Wu, and J. Glass. Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. In NeurIPS 2018 Workshop on Interpretability and Robustness in Audio, Speech, and Language, Montréal, Canada, December 2018. also at ICASSP 2019. [ bib \| reviews \| web ]
[23]	Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada, December 2018. [ bib \| reviews \| arxiv \| web \| poster ]
[24]	R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proc. International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018. [ bib \| arxiv \| web ]
[25]	J. Antognini, M. Hoffman, and R. J. Weiss. Synthesizing Diverse, High-Quality Audio Textures. arXiv preprint arXiv:1806.08002, June 2018. [ bib \| arxiv \| web ]
[26]	C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani. State-of-the-art Speech Recognition With Sequence-to-Sequence Models. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, April 2018. [ bib \| arxiv \| web ]
[27]	S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao. Multilingual Speech Recognition With A Single End-To-End Model. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, April 2018. [ bib \| arxiv \| web ]
[28]	J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio. On Using Backpropagation for Speech Texture Generation and Voice Conversion. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, April 2018. [ bib \| arxiv \| web ]
[29]	J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, April 2018. [ bib \| arxiv \| web ]
[30]	J. P. Bello, P. Grosche, M. Müller, and R. Weiss. Content-Based Methods for Knowledge Discovery in Music. In Springer Handbook of Systematic Musicology, pages 823--840. Springer, March 2018. [ bib \| DOI ]
[31]	B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. C. Sim, R. J. Weiss, K. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintraub, E. McDermott, R. Rose, and M. Shannon. Acoustic Modeling for Google Home. In Proc. Interspeech, Stockholm, Sweden, August 2017. [ bib \| DOI \| .pdf ]
[32]	R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen. Sequence-to-Sequence Models Can Directly Translate Foreign Speech. In Proc. Interspeech, Stockholm, Sweden, August 2017. [ bib \| DOI \| arxiv \| slides ]
[33]	Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: Towards End-To-End Speech Synthesis. In Proc. Interspeech, Stockholm, Sweden, August 2017. [ bib \| DOI \| arxiv ]
[34]	C. Raffel, T. Luong, P. J. Liu, R. J. Weiss, and D. Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In Proc. International Conference on Machine Learning (ICML), Sydney, Australia, August 2017. [ bib \| arxiv \| http ]
[35]	S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. CNN Architectures for Large-Scale Audio Classification. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, March 2017. [ bib \| DOI \| arxiv \| .pdf ]
[36]	T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. W. Chin, A. Misra, and C. Kim. Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(5):965--979, February 2017. [ bib \| DOI \| .pdf ]
[37]	T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. W. Chin, A. Misra, and C. Kim. Raw Multichannel Processing Using Deep Neural Networks. In New Era for Robust Speech Recognition: Exploiting Deep Learning. Springer, 2017. [ bib \| DOI \| .pdf ]
[38]	T. N. Sainath, A. Narayanan, R. J. Weiss, E. Variani, K. W. Wilson, M. Bacchiani, and I. Shafran. Reducing the Computational Complexity of Multimicrophone Acoustic Models with Integrated Feature Extraction. In Proc. Interspeech, San Francisco, USA, September 2016. [ bib \| DOI \| .pdf ]
[39]	B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani. Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition. In Proc. Interspeech, San Francisco, USA, September 2016. [ bib \| DOI \| .pdf ]
[40]	T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, and M. Bacchiani. Factored Spatial and Spectral Multichannel Raw Waveform CLDNNs. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, March 2016. [ bib \| DOI \| .pdf ]
[41]	T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and A. Senior. Speaker Location and Microphone Spacing Invariant Acoustic Modeling from Raw Multichannel Waveforms. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, USA, December 2015. [ bib \| DOI \| .pdf ]
[42]	T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals. Learning the Speech Front-End with Raw Waveform CLDNNs. In Proc. Interspeech, Dresden, Germany, September 2015. [ bib \| .pdf ]
[43]	Y. Hoshen, R. J. Weiss, and K. W. Wilson. Speech Acoustic Modeling from Raw Multichannel Waveforms. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, April 2015. [ bib \| DOI \| .pdf ]
[44]	J. Weston, R. Weiss, and H. Yee. Affinity Weighted Embedding. In Proc. International Conference on Machine Learning (ICML), pages 1215--1223, Beijing, China, June 2014. [ bib \| http \| .pdf ]
[45]	J. Weston, H. Yee, and R. J. Weiss. Learning to Rank Recommendations with the k-order Statistic Loss. In Proc. ACM Conference on Recommender Systems (RecSys), pages 245--248, Hong Kong, October 2013. [ bib \| DOI \| .pdf ]
[46]	J. Weston, R. J. Weiss, and H. Yee. Nonlinear Latent Factorization by Embedding Multiple User Interests. In Proc. ACM Conference on Recommender Systems (RecSys), pages 65--68, Hong Kong, October 2013. [ bib \| DOI \| .pdf ]
[47]	J. Weston, R. Weiss, and H. Yee. Affinity Weighted Embedding. In Proc. International Conference on Learning Representations (ICLR), Scottsdale, USA, May 2013. [ bib \| arxiv \| http \| .pdf ]
[48]	J. Weston, C. Wang, R. Weiss, and A. Berenzweig. Latent Collaborative Retrieval. In Proc. International Conference on Machine Learning (ICML), Edinburgh, Scotland, June 2012. [ bib \| arxiv \| http \| .pdf ]
[49]	F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay. scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825--2830, October 2011. [ bib \| arxiv \| http \| .pdf ]
[50]	R. J. Weiss and J. P. Bello. Unsupervised Discovery of Temporal Structure in Music. IEEE Journal of Selected Topics in Signal Processing, 5(6):1240--1251, October 2011. [ bib \| DOI \| .pdf ]
[51]	T. Bertin-Mahieux, G. Grindlay, R. J. Weiss, and D. P. W. Ellis. Evaluating Music Sequence Models Through Missing Data. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 177--180, Prague, Czech Republic, May 2011. [ bib \| DOI \| .pdf ]
[52]	R. J. Weiss, M. I. Mandel, and D. P. W. Ellis. Combining Localization Cues and Source Model Constraints for Binaural Source Separation. Speech Communication, 53(5):606--621, May 2011. Special issue on Perceptual and Statistical Audition. [ bib \| DOI \| .pdf ]
[53]	T. Bertin-Mahieux, R. J. Weiss, and D. P. W. Ellis. Clustering Beat-Chroma Patterns in a Large Music Database. In Proc. International Society for Music Information Retrieval Conference (ISMIR), pages 111--116, Utrecht, Netherlands, August 2010. [ bib \| web \| .pdf ]
[54]	R. J. Weiss and J. P. Bello. Identifying Repeated Patterns in Music Using Sparse Convolutive Non-Negative Matrix Factorization. In Proc. International Society for Music Information Retrieval Conference (ISMIR), pages 123--128, Utrecht, Netherlands, August 2010. Best Paper Award. [ bib \| web \| slides \| .pdf ]
[55]	T. Cho, R. J. Weiss, and J. P. Bello. Exploring Common Variations in State of the Art Chord Recognition Systems. In Proc. Sound and Music Computing Conference (SMC), pages 1--8, Barcelona, Spain, July 2010. [ bib \| .pdf ]
[56]	M. I. Mandel, R. J. Weiss, and D. P. W. Ellis. Model-Based Expectation-Maximization Source Separation and Localization. IEEE Transactions on Audio, Speech, and Language Processing, 18(2):382--394, February 2010. [ bib \| DOI \| web \| .pdf ]
[57]	R. J. Weiss and D. P. W. Ellis. Speech Separation Using Speaker-Adapted Eigenvoice Speech Models. Computer Speech and Language, 24(1):16--29, January 2010. Speech Separation and Recognition Challenge. [ bib \| DOI \| .pdf ]
[58]	R. J. Weiss and D. P. W. Ellis. A Variational EM Algorithm for Learning Eigenvoice Parameters in Mixed Signals. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 113--116, Taipei, Taiwan, April 2009. [ bib \| DOI \| poster \| .pdf ]
[59]	R. J. Weiss. Underdetermined Source Separation Using Speaker Subspace Models. PhD thesis, Department of Electrical Engineering, Columbia University, 2009. [ bib \| slides \| .pdf ]
[60]	R. J. Weiss and T. Kristjansson. DySANA: Dynamic Speech and Noise Adaptation for Voice Activity Detection. In Proc. Interspeech, pages 127--130, Brisbane, Australia, September 2008. [ bib \| http \| poster \| .pdf ]
[61]	R. J. Weiss, M. I. Mandel, and D. P. W. Ellis. Source Separation Based on Binaural Cues and Source Model Constraints. In Proc. Interspeech, pages 419--422, Brisbane, Australia, September 2008. [ bib \| http \| poster \| .pdf ]
[62]	R. J. Weiss and D. P. W. Ellis. Monaural Speech Separation Using Source-Adapted Models. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 114--117, New Paltz, USA, October 2007. [ bib \| DOI \| web \| slides \| .pdf ]
[63]	R. J. Weiss and D. P. W. Ellis. Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking. In Proc. ISCA Tutorial and Research Workshop on Statistical Perceptual Audition (SAPA), pages 31--36, Pittsburgh, USA, September 2006. [ bib \| http \| slides \| .pdf ]
[64]	D. P. W. Ellis and R. J. Weiss. Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages V--957--960, Toulouse, France, May 2006. [ bib \| DOI \| .pdf ]

Teaching

I have taught/been a teaching assistant for:

Spring 2010: E85.2607 Advanced Digital Signal Theory (NYU)
Spring 2007: ELEN E4896/E4998 Music Signal Processing
Spring 2005: ELEN E4896 Music Signal Processing
Fall 2004: GIST E4060/E3060 Introduction to Genomic Information Science and Technology
Fall 2003: COMS W4118 Operating Systems I